Deployment parameters for qwen3.5-4b?

Certainly! For Qwen3.5-4B on a single GPU (tp=1), you do NOT need --mm-encoder-tp-mode data. A minimal command for text-only inference would be:

vllm serve Qwen/Qwen3.5-4B \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

If you want to enable multimodal (vision) support on a single GPU, you can simply omit --mm-encoder-tp-mode data, as it has no effect with tp=1:

vllm serve Qwen/Qwen3.5-4B \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

No additional parallelism flags are needed for small models on a single GPU. Would you like more examples for multi-GPU or larger models?

Sources:

Hint: Mention @RunLLM in the post for followups.