Tell me about the current status of the tokenize endpoint in vllm

hustxiayang · December 8, 2025, 4:03am

Tell me about the current status of the tokenize endpoint in vllm, thanks!

RunLLM · December 8, 2025, 4:03am

The /tokenize endpoint in vLLM is available and supports both prompt and chat-completions style inputs, returning token IDs, token count, and max model length. It also supports detokenization via the /detokenize endpoint, and can optionally return token strings with the return_token_strs parameter. Example usage and response formats are documented in recent PRs and the official docs. For chat-style input, the endpoint applies the chat template as in inference, ensuring tokenization matches generation behavior. See PR #5923, PR #5054, and API docs.

However, the endpoint currently only supports text-based tokenization and does not handle images or other multimodal inputs; for such cases, you must use the appropriate HuggingFace processor locally. There is ongoing discussion about extending /tokenize to support multimodal data, but this is not yet implemented (issue #20778). Would you like more detail or code examples?

Sources:

Hint: Mention @RunLLM in the post for followups.

hustxiayang · December 8, 2025, 4:05am

@RunLLM Give me examples or test examples in vLLM, also share me some what is the endpint? it’s /tokenize, not /v1/tokenize, right?

hustxiayang · December 8, 2025, 4:22am

@RunLLM Give me examples or test examples in vLLM, also share me some what is the endpint? it’s /tokenize, not /v1/tokenize, right?

RunLLM · December 8, 2025, 4:22am

Yes, the endpoint is /tokenize (not /v1/tokenize) for vLLM’s tokenizer API. This endpoint is used for tokenizing text or chat messages and is separate from the OpenAI-compatible /v1 routes to avoid confusion (PR #5054).

For examples, see the vLLM test file test_tokenization.py, which demonstrates POST requests to /tokenize with both prompt and chat-style inputs. Here is a minimal example:

import requests
response = requests.post(
    "http://localhost:8000/tokenize",
    json={"model": "meta-llama/Meta-Llama-3-8B-Instruct", "prompt": "test"}
)
print(response.json())

This returns token IDs, count, and max_model_len. For chat, you can send a messages array instead of prompt.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Raw tokens completion via online serving Features	1	128	November 3, 2025
How to custom end token in vllm serve cli? General	4	205	December 9, 2025
Jina AI VLM - Support help General	3	91	October 12, 2025
Text to speech support with /v1/audio/speech route Model Support	1	725	November 28, 2025
Does vLLM support token truncation? General	5	1301	September 10, 2025

Tell me about the current status of the tokenize endpoint in vllm

Related topics