Raw tokens completion via online serving

Hello,

Currently VLLM seems to supports only text-based (messages dict) via the online serving, and the raw tokens seems to only be available in the offline inference setting. Is there any plan to support this or it’s already available? Thanks folks.

Currently, vLLM’s online serving (OpenAI-compatible API server) only supports text-based input via the messages or prompt fields; it does not support sending raw token IDs directly. Raw token (pre-tokenized) input is only available in offline inference using the LLM class, where you can pass token IDs to methods like generate or beam_search. There is no official documentation or roadmap entry indicating that online serving will support raw token input in the near future, and this feature is not currently available according to the latest docs and issues reviewed (Quickstart, Generative Models, GitHub Issue #10084).

Would you like more detail on how to use raw token input in offline mode, or guidance on possible workarounds for online serving?

Sources:

Hint: Mention @RunLLM in the post for followups.