Thinking Token limit setting

A client-side workaround involves monitoring the model’s output for the start of the “thinking” section (e.g., <think>) and stopping generation or sending a stop sequence once your desired token limit is reached. This can be done by making two API calls: first, generate up to the thinking token limit, then prompt the model to end the reasoning and produce the final answer. This approach is described in this issue.

Alternatively, you can implement a custom logits processor in your client code to force the end-of-thinking token after a set number of tokens, but this requires direct integration with the model inference pipeline and is not supported via the OpenAI API interface in vLLM v1. Would you like a step-by-step example of either approach?

Sources:

Hint: Mention @RunLLM in the post for followups.