Tell me about the current status of the tokenize endpoint in vllm

Tell me about the current status of the tokenize endpoint in vllm, thanks!

The /tokenize endpoint in vLLM is available and supports both prompt and chat-completions style inputs, returning token IDs, token count, and max model length. It also supports detokenization via the /detokenize endpoint, and can optionally return token strings with the return_token_strs parameter. Example usage and response formats are documented in recent PRs and the official docs. For chat-style input, the endpoint applies the chat template as in inference, ensuring tokenization matches generation behavior. See PR #5923, PR #5054, and API docs.

However, the endpoint currently only supports text-based tokenization and does not handle images or other multimodal inputs; for such cases, you must use the appropriate HuggingFace processor locally. There is ongoing discussion about extending /tokenize to support multimodal data, but this is not yet implemented (issue #20778). Would you like more detail or code examples?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Give me examples or test examples in vLLM, also share me some what is the endpint? it’s /tokenize, not /v1/tokenize, right?

@RunLLM Give me examples or test examples in vLLM, also share me some what is the endpint? it’s /tokenize, not /v1/tokenize, right?

Yes, the endpoint is /tokenize (not /v1/tokenize) for vLLM’s tokenizer API. This endpoint is used for tokenizing text or chat messages and is separate from the OpenAI-compatible /v1 routes to avoid confusion (PR #5054).

For examples, see the vLLM test file test_tokenization.py, which demonstrates POST requests to /tokenize with both prompt and chat-style inputs. Here is a minimal example:

import requests
response = requests.post(
    "http://localhost:8000/tokenize",
    json={"model": "meta-llama/Meta-Llama-3-8B-Instruct", "prompt": "test"}
)
print(response.json())

This returns token IDs, count, and max_model_len. For chat, you can send a messages array instead of prompt.

Sources:

Hint: Mention @RunLLM in the post for followups.