Problem with Gemma3 and vLLM

youlearnit · December 5, 2025, 10:45pm

I have a problem with gemma3 and vLLM running in docker.

I am using this:
docker pull vllm/vllm-openai:latest

I have 2x 5090 and this is my docker run command.
[“–model”,“google/gemma-3-12b-it”,“–served-model-name”,“gemma3”,“–dtype”,“auto”,“–max-model-len”,“8096”,“–max-num-seqs”,“64”,“–gpu-memory-utilization”,“0.94”,“–enable-prefix-caching”,“–enable-chunked-prefill”,“–swap-space”,“24”,“–tensor-parallel-size”,“2”,“–max-num-batched-tokens”,“8096”,“–disable-log-requests”,“–max-log-len”,“100”]

The problem is:

While chatting with the running LLM, all works fine. When the connection is dissapeared once and there comes error, and then the connection comes back, the chat wont work anymore.

LiteLLM playground gives this error:
Error fetching response:Error: 400 litellm.BadRequestError: Hosted_vllmException - Conversation roles must alternate user/assistant/user/assistant/… Conversation roles must alternate user/assistant/user/assistant/… Received Model Group=gemma3
Available Model Group Fallbacks=None

When trying another chat, without litellm but anything else the behaviour is same.
After some error or when the AI does not response but user types something the error appears. I think its related to conversation roles gets messed up, wrong order etc. in the chat history. This has to be problem between vLLM and gemma3, not with any chat because 2 different chat behaves similarly. Is this problem known with gemma3 models or with any other models also?

RunLLM · December 5, 2025, 10:45pm

Yes, this is a known issue with Gemma 3 models (and some others) in vLLM: the chat template enforces strict alternation between user and assistant roles in the conversation history. If the message sequence does not strictly alternate (e.g., two user messages in a row, or after a connection error the history is malformed), vLLM will return a 400 error: “Conversation roles must alternate user/assistant/user/assistant/…” This is not unique to Gemma 3; other models with strict chat templates (like Mistral) also exhibit this behavior. The client (e.g., LiteLLM playground) is responsible for maintaining a valid alternating message history, as vLLM does not auto-correct or skip invalid sequences. See details in this issue and here.

To avoid this, ensure your client always sends a valid alternating sequence of user/assistant messages, especially after errors or dropped connections. If you need more flexible role ordering, vLLM currently does not support it for Gemma 3 and similar models unless you provide a custom chat template that relaxes this rule, but this is not recommended as it may degrade model performance. Would you like more detail on how to implement a workaround or debug your client’s message history?

Sources:

Hint: Mention @RunLLM in the post for followups.

youlearnit · December 5, 2025, 10:47pm

yes @RunLLM provide all possible solutions for this problem.

RunLLM · December 5, 2025, 10:47pm

Here are all possible solutions, based strictly on the data sources:

Fix Client Message History: Ensure your client (e.g., LiteLLM playground or any chat frontend) always sends a strictly alternating sequence of user/assistant messages. After a connection error or failed response, do not append extra user messages without an intervening assistant message. This is the most robust and recommended solution for Gemma 3 and similar models (source, source).
Custom Chat Template: If you need to allow non-alternating roles (e.g., two user messages in a row), you can provide a custom chat template to vLLM that relaxes the alternation rule. However, this is not recommended, as it may degrade model performance and is not officially supported for Gemma 3 (source, source).
Reset Conversation on Error: After a connection error or malformed conversation, reset the chat history and start a new conversation to avoid role alternation issues (source).
Check for Client Bugs: Some clients may mishandle message history after errors. Review your client code to ensure it does not append user messages after a failed assistant response (source).

Would you like a step-by-step guide for any of these solutions?

Sources:

Hint: Mention @RunLLM in the post for followups.

youlearnit · December 5, 2025, 11:02pm

@RunLLM the solution could be here:

You should use system role, not developer

Gemma3 is trained on bfloat16, so changing the dtype may result in numeric instability. Try removing that argument so that bfloat16 is used automatically.

So my current docker run command is this, how should I change it? what should be the dtype?
What about the role, where does that come? what does it mean?

docker run -d --gpus all --ipc=host --network host -p 8000:8000 -v /home/ubuntu/vllm_models:/root/.cache/huggingface -e HUGGING_FACE_HUB_TOKEN=“” -e VLLM_SLEEP_WHEN_IDLE=1 --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 --restart unless-stopped --name vllm_gemma-3-12b-it-tp2-prod3_lite vllm/vllm-openai:latest --model google/gemma-3-12b-it --served-model-name gemma3 --dtype auto --max-model-len 8096 --max-num-seqs 64 --gpu-memory-utilization 0.94 --enable-prefix-caching --enable-chunked-prefill --swap-space 24 --tensor-parallel-size 2 --max-num-batched-tokens 8096 --disable-log-requests --max-log-len 100

RunLLM · December 5, 2025, 11:03pm

For Gemma 3, you should use --dtype auto (or omit --dtype entirely) so vLLM will use bfloat16 automatically, which is required for numerical stability. Do not set --dtype float16 or --dtype half, as this will cause empty or unstable outputs (source, source).

The “role” refers to the message role in the chat API payload (e.g., “system”, “user”, “assistant”). You should use “system” for system prompts, not “developer”—this is set in your client code, not in the Docker command. For example, your chat messages should look like:

[
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "Hello!"}
]

(source, source)

Would you like a step-by-step example of a correct client request?

Sources:

Hint: Mention @RunLLM in the post for followups.

youlearnit · December 5, 2025, 11:30pm

@RunLLM dtype auto didnt help, is there a chat template for gemma3 then. Looks like this thing cant be fixed. vllm is too strict.

RunLLM · December 5, 2025, 11:30pm

Yes, there are chat templates for Gemma 3, but vLLM enforces strict alternation of user/assistant roles in the conversation history, as defined by the chat template. The default Gemma 3 chat template is included in the model’s tokenizer config, and you can also provide a custom template if needed. However, even with a custom template, vLLM will still enforce the alternation rule unless you explicitly modify the template to relax this requirement (source, source).

If you want to use a custom chat template, you can pass it with the --chat-template argument when starting vLLM. Example templates and instructions are available in the vLLM examples directory. Would you like a step-by-step guide to using or customizing a chat template for Gemma 3?

Sources:

Hint: Mention @RunLLM in the post for followups.

youlearnit · December 5, 2025, 11:56pm

I cant continue usin vLLM because of this bug.
vLLM is not compatible with gemma3 and maybe some other models.

This Conversation roles must alternate -problem with multiple different chat applications just breaks the user experience. I am not able to fix this. This happens also with openweb UI if a connection to vllm is cut during chat, and then restored, the chat cant continue.

I need to go back to ollama which I hate so much. let me know when you have fixed this.

DarkLight1337 · December 6, 2025, 2:37am

This is not true. It is enforced by the chat template. If you change the chat template then this should go away.

youlearnit · December 6, 2025, 9:53am

How should I change a chat template for gemma3 when running the **vllm/vllm-openai:latest
**
It looks like it needs a template because OpenwebUI, Drupal chatbot, and litellm and maybe many others cant handle error messages, and the chat will not work after those and needs a history clear/refresh.

RunLLM · December 6, 2025, 9:54am

To change the chat template for Gemma 3 when running vllm/vllm-openai:latest, provide your custom template using the --chat-template argument in your Docker run command. The template should be a .jinja file that matches your desired conversation formatting and can relax the strict alternation rule if needed. Example usage:

docker run ... vllm/vllm-openai:latest \
  --model google/gemma-3-12b-it \
  --chat-template /path/to/your_custom_template.jinja

You can find example templates and guidance in the vLLM examples directory. If your clients (OpenWebUI, Drupal chatbot, LiteLLM, etc.) have trouble after errors, a custom template that is less strict about role alternation may help, but be aware this could affect model performance.

Would you like a step-by-step guide to writing or modifying a Gemma 3 chat template?

Sources:

https://github.com/vllm-project/vllm/issues/14734

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Gemma 3 Quantization General	5	1518	June 21, 2025
Not able to run google/gemma-3n-E4B-it General	3	419	September 22, 2025
Trouble Running vLLM `0.9.1` with Latest Transformers (Gemma3nConfig Errors) General	1	685	June 30, 2025
Issue running gemma-3-27b-it with vLLM version: 0.12.0 General	1	178	December 17, 2025
Issue serving gemma3-27b-it Model Support	1	582	September 19, 2025

Problem with Gemma3 and vLLM

Related topics