I noticed there’s a min_tokens parameter in vLLM inference. This sampling parameter is what I needed. But I am serving vLLM as a openai api compatible server, and clients uses openai api to request results. How to pass min_tokens parameter via openai api?
I am calling it with following code snippet and it looks like this parameter is not allowed.
response = await client.chat.completions.create(
model="deepseek-reasoner",
messages=messages,
stream=True,
min_tokens = 4096,
)
TypeError: AsyncCompletions.create() got an unexpected keyword argument 'min_tokens'. Did you mean 'max_tokens'?