I am currently using vllm to infer from the model gpt-oss-20b. I want to keep the model’s reasoning ability to low. I read that vllm support a parameter called reasoning_effort. Can I use that parameter if I am using vllm via python code and not server?
No, you cannot use the reasoning_effort parameter when running vLLM via the Python LLM class for offline inference; it is only supported through the OpenAI-compatible server endpoints (e.g., /v1/chat/completions or /v1/responses). The reasoning_effort parameter is not available in the SamplingParams or LLM.generate API for direct Python usage—only in HTTP API requests to the server, as confirmed in multiple issues and documentation entries (source1, source2, source3).
If you need to control reasoning_effort, you must run vLLM as a server and send requests via the OpenAI-compatible API. Would you like more detail or example code for using the server API?
Sources:
- https://github.com/vllm-project/vllm/issues/22359
- https://github.com/vllm-project/vllm/issues/24292
- https://github.com/vllm-project/vllm/issues/22809
Hint: Mention @RunLLM in the post for followups.