Deployment parameters for qwen3.5-4b?

RunLLM · March 24, 2026, 5:05pm

Certainly! For Qwen3.5-4B on a single GPU (tp=1), you do NOT need --mm-encoder-tp-mode data. A minimal command for text-only inference would be:

vllm serve Qwen/Qwen3.5-4B \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

If you want to enable multimodal (vision) support on a single GPU, you can simply omit --mm-encoder-tp-mode data, as it has no effect with tp=1:

vllm serve Qwen/Qwen3.5-4B \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

No additional parallelism flags are needed for small models on a single GPU. Would you like more examples for multi-GPU or larger models?

Sources:

Qwen3.5 Usage Guide

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Need to serve a Qwen3 LLM with 235B params General	9	465	March 7, 2026
Critique my vLLM configuration for qwen3-coder-next Model Support	3	241	March 10, 2026
Hosting Qwen 3.5 35B-A3B model Model Support	1	1240	February 25, 2026
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	10	2010	February 4, 2026
Qwen3.5-27B-FP8 Speculative Decoding Speculative Decoding	2	2062	April 11, 2026

Deployment parameters for qwen3.5-4b?

Related topics