I have a problem with gemma3 and vLLM running in docker.
I am using this:
docker pull vllm/vllm-openai:latest
I have 2x 5090 and this is my docker run command.
[“–model”,“google/gemma-3-12b-it”,“–served-model-name”,“gemma3”,“–dtype”,“auto”,“–max-model-len”,“8096”,“–max-num-seqs”,“64”,“–gpu-memory-utilization”,“0.94”,“–enable-prefix-caching”,“–enable-chunked-prefill”,“–swap-space”,“24”,“–tensor-parallel-size”,“2”,“–max-num-batched-tokens”,“8096”,“–disable-log-requests”,“–max-log-len”,“100”]
The problem is:
While chatting with the running LLM, all works fine. When the connection is dissapeared once and there comes error, and then the connection comes back, the chat wont work anymore.
LiteLLM playground gives this error:
Error fetching response:Error: 400 litellm.BadRequestError: Hosted_vllmException - Conversation roles must alternate user/assistant/user/assistant/… Conversation roles must alternate user/assistant/user/assistant/… Received Model Group=gemma3
Available Model Group Fallbacks=None
When trying another chat, without litellm but anything else the behaviour is same.
After some error or when the AI does not response but user types something the error appears. I think its related to conversation roles gets messed up, wrong order etc. in the chat history. This has to be problem between vLLM and gemma3, not with any chat because 2 different chat behaves similarly. Is this problem known with gemma3 models or with any other models also?