Issues with Voxtral models and omni

I’m trying to setup voxtral models for STT and TTS. I have some issues and could use some advice.

With `mistralai/Voxtral-4B-TTS-2603` I’m able to launch vllm omni but it doesn’t respect the memory limit. I use docker container and the vllm command is vllm serve mistralai/Voxtral-4B-TTS-2603 –omni --gpu-memory-utilization 0.2. However vllm seems to ignore the last parameter and still consumes 65 out of 80 GB of memory. In the logs I can see that it parses the argument but it doesn’t seem to have any effect. Why? And what’s the right way to do it?

Then with mistralai/Voxtral-Mini-4B-Realtime-2602I’m not able to start it at all. Current docker container seems to have old transformer version (why?). Upon installing it directly with uv I now get the error: “ValueError: Model class VoxtralRealtimeForConditionalGeneration not found in diffusion model registry”. I have cuda-13, python-13, transformers-5.5.3, vllm-0.19.0+cu130, vllm-omni-0.19.0rc2.dev88+g73fb68ad2. Couldn’t find any information on that issue. What am I missing?

Thank you in advance

For the Voxtral TTS memory issue: In multi-stage Omni models, the --gpu-memory-utilization CLI flag was previously ignored if the YAML config set its own value, due to how configs were merged. This was fixed so explicit CLI flags now override YAML, but you must use a recent vllm-omni version with the fix applied. If you’re still seeing the issue, ensure you’re not using a stale config or old vllm-omni version—see fix --gpu-memory-utilization CLI override.

For the Voxtral-Mini-4B-Realtime-2602 error: This model requires the --omni flag and must be launched with vllm-omni serve, not vllm serve. Also, the error about “VoxtralRealtimeForConditionalGeneration not found” suggests a version mismatch or missing model registry entry, possibly due to an outdated transformers or vllm-omni install. Make sure you’re using compatible versions and follow the official installation instructions. Would you like more detail on troubleshooting or Docker setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Concerning Voxtral TTS I tried running it in the same vllm 0.19 environment to check if memory utilization is fixed. Sadly in this environment it doesn’t work at all. I get `RuntimeError: Orchestrator initialization failed: ‘VoxtralTTSConfig’ object has no attribute ‘text_config’`.

As for the Voxtral STT I tried “VLLM_DISABLE_COMPILE_CACHE=1 vllm-omni serve mistralai/Voxtral-Mini-4B-Realtime-2602 --omni --port 8001 --dtype bfloat16 --enforce-eager --gpu-memory-utilization 0.2” and it still gives me “ValueError: Model class VoxtralRealtimeForConditionalGeneration not found in diffusion model registry.”

I was hoping that the latest 0.19.0 version in combination with 0.19.0rc2 for omni would produce better results. Perhaps I’ll try again when omni 0.19.0 is released and the updated docker container is available.

For Voxtral-Mini-4B-Realtime-2602 it turned out that it works without omni. The next problem is that the docker image includes an old version of the transformers package. It’s not the first time I encounter the issue when the model is too fresh and requires a newer version of transformers package. Why does the docker image is shipped with an old version?