I’ve modified the chat template, sometimes I wanna set add_generation_prompt=False, how to specify that?
You can specify add_generation_prompt=False by passing it as a parameter in your API request or when calling the chat template in code. For offline inference, pass it to tokenizer.apply_chat_template or the llm.chat method. For the OpenAI-compatible server, include "add_generation_prompt": false in your request body. This controls whether the generation prompt is appended by the chat template, and is supported in both the API and server modes according to vLLM documentation and chat template protocol.
Would you like a code example or more detail?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM
like this?
response = self.client.chat.completions.create(
model=self.model_name,
messages=messages,
max_tokens=10000,
temperature=0.3,
extra_body={
"chat_template_kwargs": {"enable_thinking": False, "add_generation_prompt"=False},
"stop": ["\n", "<|im_start|>"]
}
)
No, that’s not correct. add_generation_prompt should be a top-level parameter in the request body, not inside chat_template_kwargs. You should write:
response = self.client.chat.completions.create(
model=self.model_name,
messages=messages,
max_tokens=10000,
temperature=0.3,
add_generation_prompt=False,
extra_body={
"chat_template_kwargs": {"enable_thinking": False},
"stop": ["\n", "<|im_start|>"]
}
)
According to vLLM API docs and OpenAI-compatible server docs, add_generation_prompt is not part of chat_template_kwargs.
Would you like more detail or further examples?
Sources:
Hint: Mention @RunLLM in the post for followups.
TypeError: Completions.create() got an unexpected keyword argument 'add_generation_prompt'
seems not right
@required_args(["messages", "model"], ["messages", "model", "stream"])
def create(
self,
*,
messages: Iterable[ChatCompletionMessageParam],
model: Union[str, ChatModel],
audio: Optional[ChatCompletionAudioParam] | Omit = omit,
frequency_penalty: Optional[float] | Omit = omit,
function_call: completion_create_params.FunctionCall | Omit = omit,
functions: Iterable[completion_create_params.Function] | Omit = omit,
logit_bias: Optional[Dict[str, int]] | Omit = omit,
logprobs: Optional[bool] | Omit = omit,
max_completion_tokens: Optional[int] | Omit = omit,
max_tokens: Optional[int] | Omit = omit,
metadata: Optional[Metadata] | Omit = omit,
modalities: Optional[List[Literal["text", "audio"]]] | Omit = omit,
n: Optional[int] | Omit = omit,
parallel_tool_calls: bool | Omit = omit,
prediction: Optional[ChatCompletionPredictionContentParam] | Omit = omit,
presence_penalty: Optional[float] | Omit = omit,
prompt_cache_key: str | Omit = omit,
prompt_cache_retention: Optional[Literal["in-memory", "24h"]] | Omit = omit,
reasoning_effort: Optional[ReasoningEffort] | Omit = omit,
response_format: completion_create_params.ResponseFormat | Omit = omit,
safety_identifier: str | Omit = omit,
seed: Optional[int] | Omit = omit,
service_tier: Optional[Literal["auto", "default", "flex", "scale", "priority"]] | Omit = omit,
stop: Union[Optional[str], SequenceNotStr[str], None] | Omit = omit,
store: Optional[bool] | Omit = omit,
stream: Optional[Literal[False]] | Literal[True] | Omit = omit,
stream_options: Optional[ChatCompletionStreamOptionsParam] | Omit = omit,
temperature: Optional[float] | Omit = omit,
tool_choice: ChatCompletionToolChoiceOptionParam | Omit = omit,
tools: Iterable[ChatCompletionToolUnionParam] | Omit = omit,
top_logprobs: Optional[int] | Omit = omit,
top_p: Optional[float] | Omit = omit,
user: str | Omit = omit,
verbosity: Optional[Literal["low", "medium", "high"]] | Omit = omit,
web_search_options: completion_create_params.WebSearchOptions | Omit = omit,
# Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
# The extra values given here take precedence over values defined on the client or passed to this method.
extra_headers: Headers | None = None,
extra_query: Query | None = None,
extra_body: Body | None = None,
timeout: float | httpx.Timeout | None | NotGiven = not_given,
) -> ChatCompletion | Stream[ChatCompletionChunk]:
The error occurs because add_generation_prompt is not a direct argument to Completions.create() in the OpenAI Python client. Instead, you must pass it inside the extra_body dictionary. For vLLM, extra parameters like add_generation_prompt should be included in extra_body, not as a top-level argument. Example:
response = client.chat.completions.create(
model="your-model",
messages=messages,
extra_body={"add_generation_prompt": False}
)
This is documented in vLLM’s OpenAI-compatible server docs.
Would you like more detail or troubleshooting tips?
Sources:
Hint: Mention @RunLLM in the post for followups.