GPT-oss inference

Hey all, I have some questions, when will vllm support gpt-oss on ada lovelace GPUs?

Second, since OSS used the Harmony format, how can we use the regular old tokenizer.apply_chat_template()?

is it possible? I’ve seen some folks made the chat template in jinja(corrected!)

cheers!

vLLM plans to add support for GPT-OSS on Ada Lovelace GPUs (e.g., RTX 4090, 4080, 4070) after completing testing and integration, following their current rollout from Blackwell → Hopper → Ampere → Ada Lovelace. The necessary kernels are already integrated, but end-to-end testing on Ada GPUs is not yet complete, so support is not available as of now. See the discussion in this issue and this comment for the current status.

Regarding Harmony format: GPT-OSS models use the Harmony format for chat and reasoning, which is not compatible with the standard Hugging Face tokenizer.apply_chat_template() approach. Instead, vLLM handles Harmony formatting internally for GPT-OSS models, and you should not use the regular chat template Jinja for these models. If you attempt to use apply_chat_template() directly, it will not produce the correct prompt format for GPT-OSS. See this issue and this discussion for more details.

Would you like more detail on Harmony format handling or the GPU support roadmap?

Sources: