GPT-oss inference

Mhdaw · August 23, 2025, 8:47am

Hey all, I have some questions, when will vllm support gpt-oss on ada lovelace GPUs?

Second, since OSS used the Harmony format, how can we use the regular old tokenizer.apply_chat_template()?

is it possible? I’ve seen some folks made the chat template in jinja(corrected!)

cheers!

RunLLM · August 23, 2025, 8:47am

vLLM plans to add support for GPT-OSS on Ada Lovelace GPUs (e.g., RTX 4090, 4080, 4070) after completing testing and integration, following their current rollout from Blackwell → Hopper → Ampere → Ada Lovelace. The necessary kernels are already integrated, but end-to-end testing on Ada GPUs is not yet complete, so support is not available as of now. See the discussion in this issue and this comment for the current status.

Regarding Harmony format: GPT-OSS models use the Harmony format for chat and reasoning, which is not compatible with the standard Hugging Face tokenizer.apply_chat_template() approach. Instead, vLLM handles Harmony formatting internally for GPT-OSS models, and you should not use the regular chat template Jinja for these models. If you attempt to use apply_chat_template() directly, it will not produce the correct prompt format for GPT-OSS. See this issue and this discussion for more details.

Would you like more detail on Harmony format handling or the GPU support roadmap?

Sources:

Topic		Replies	Views
Status of gpt-oss General	3	449	February 3, 2026
GPT OSS vllm server General	2	554	November 18, 2025
Jina AI VLM - Support help General	3	89	October 12, 2025
Tell me about the current status of the tokenize endpoint in vllm General	4	355	December 8, 2025
Raw tokens completion via online serving Features	1	116	November 3, 2025

GPT-oss inference

Related topics