Omitting the vision stack for gemma3?

In the documentation of Gemma3, it is mentioned that you can omit the vision stack by running the gemma-3-27b-it model with Gemma3ForCausalLM instead of Gemma3ForConditionalGeneration. Is there a way to do this with vLLM? I tried overriding the models config.json, but I’m running into issues with missing configuration parameters.

AttributeError: 'Gemma3Config' object has no attribute 'num_hidden_layers'

Possibly similar to this github issue: AttributeError: 'Gemma3Config' object has no attribute 'vocab_size' · Issue #36683 · huggingface/transformers · GitHub

Currently we don’t provide a way to allow users to load only the language model from a multimodal model, but this is indeed a feature that we’re considering supporting!

1 Like