Raw tokens completion via online serving

DyeKuu · November 3, 2025, 1:37pm

Hello,

Currently VLLM seems to supports only text-based (messages dict) via the online serving, and the raw tokens seems to only be available in the offline inference setting. Is there any plan to support this or it’s already available? Thanks folks.

RunLLM · November 3, 2025, 1:37pm

Currently, vLLM’s online serving (OpenAI-compatible API server) only supports text-based input via the messages or prompt fields; it does not support sending raw token IDs directly. Raw token (pre-tokenized) input is only available in offline inference using the LLM class, where you can pass token IDs to methods like generate or beam_search. There is no official documentation or roadmap entry indicating that online serving will support raw token input in the near future, and this feature is not currently available according to the latest docs and issues reviewed (Quickstart, Generative Models, GitHub Issue #10084).

Would you like more detail on how to use raw token input in offline mode, or guidance on possible workarounds for online serving?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Tell me about the current status of the tokenize endpoint in vllm General	4	234	December 8, 2025
Text to speech support with /v1/audio/speech route Model Support	1	559	November 28, 2025
How to custom end token in vllm serve cli? General	4	145	December 9, 2025
Jina AI VLM - Support help General	3	77	October 12, 2025
Text generation doesn't stop General	1	355	April 8, 2025

Raw tokens completion via online serving

Related topics