Certainly! For Qwen3.5-4B on a single GPU (tp=1), you do NOT need --mm-encoder-tp-mode data. A minimal command for text-only inference would be:
vllm serve Qwen/Qwen3.5-4B \
--tensor-parallel-size 1 \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
If you want to enable multimodal (vision) support on a single GPU, you can simply omit --mm-encoder-tp-mode data, as it has no effect with tp=1:
vllm serve Qwen/Qwen3.5-4B \
--tensor-parallel-size 1 \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
No additional parallelism flags are needed for small models on a single GPU. Would you like more examples for multi-GPU or larger models?
Sources:
Hint: Mention @RunLLM in the post for followups.