Deployment parameters for qwen3.5-4b?

hustxiayang · March 24, 2026, 4:56pm

@RunLLM what is the recomendated deployment parameters for qwen3.5-4b?

RunLLM · March 24, 2026, 4:56pm

For Qwen3.5-4B, recommended deployment parameters are: --tensor-parallel-size 8, --enable-expert-parallel, --language-model-only (for text-only), --reasoning-parser qwen3, and --enable-prefix-caching. For multimodal, add --mm-encoder-tp-mode data and --mm-processor-cache-type shm. Adjust --gpu-memory-utilization as needed for your hardware. For latency-sensitive workloads, use --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' and disable prefix caching. For AMD GPUs, ensure ROCm 7.0, Python 3.12, and use the ROCm-specific Docker image or wheel as described in the docs.

See full details and hardware-specific examples in the Qwen3.5 Usage Guide. Would you like more detail or example commands?

Sources:

Qwen3.5 Usage Guide

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 4:57pm

@RunLLM 4b model, tp=8 is a waste! how to set “num_speculative_tokens”?

RunLLM · March 24, 2026, 4:58pm

For Qwen3.5-4B, set --speculative-config.num_speculative_tokens to 1 for optimal performance; higher values (e.g., 3) may increase mean acceptance length but usually lower throughput due to reduced acceptance rates. This is recommended in the official guide for both throughput and latency-focused serving. Would you like more detail on tuning this parameter?

Sources:

Qwen3.5 Usage Guide

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 5:00pm

@RunLLM yes! Also I should not use --enable-expert-parallel because this is a very small model. explain to me with code pointers, and also what are ‘--mm-encoder-tp-mode data and --mm-processor-cache-type shm."? also `--enable-prefix-caching` is the default, right?

RunLLM · March 24, 2026, 5:01pm

You’re correct: for Qwen3.5-4B (a small, dense model), --enable-expert-parallel is unnecessary and should be omitted. For speculative decoding, set --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:N}’ (replace N with your desired value, e.g., 1 or 3). Example:

vllm serve Qwen/Qwen3.5-4B \
  --tensor-parallel-size 1 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --enable-prefix-caching

–mm-encoder-tp-mode data deploys the vision encoder in data-parallel mode (better for small encoders, less communication overhead), while --mm-processor-cache-type shm uses shared memory for caching preprocessed multimodal inputs, improving performance at high parallelism. --enable-prefix-caching is enabled by default for most models, but you can explicitly set --enable-prefix-caching or --no-enable-prefix-caching as needed. See Qwen3.5 Usage Guide and CLI reference for details.

Would you like a step-by-step breakdown of when to use each flag?

Sources:

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 5:02pm

@RunLLM is this necessary: “–mm-encoder-tp-mode data deploys the vision encoder in data-parallel mode (better for small encoders, less communication overhead),”? I think I would use tp=1 by default for this small model

RunLLM · March 24, 2026, 5:03pm

No, --mm-encoder-tp-mode data is not necessary if you use --tensor-parallel-size 1 (tp=1); with a single GPU, both tensor and data parallelism are irrelevant, and the encoder will not be sharded. This flag is mainly beneficial for large models or multi-GPU setups. Would you like more detail on parallelism strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 5:04pm

@RunLLM yes, give a direct example

RunLLM · March 24, 2026, 5:05pm

Certainly! For Qwen3.5-4B on a single GPU (tp=1), you do NOT need --mm-encoder-tp-mode data. A minimal command for text-only inference would be:

vllm serve Qwen/Qwen3.5-4B \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

If you want to enable multimodal (vision) support on a single GPU, you can simply omit --mm-encoder-tp-mode data, as it has no effect with tp=1:

vllm serve Qwen/Qwen3.5-4B \
  --tensor-parallel-size 1 \
  --enable-prefix-caching \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

No additional parallelism flags are needed for small models on a single GPU. Would you like more examples for multi-GPU or larger models?

Sources:

Qwen3.5 Usage Guide

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 5:06pm

@RunLLM do i need to set max-model-len for current version of vLLM?

RunLLM · March 24, 2026, 5:06pm

No, you do not need to set --max-model-len unless you want to reduce memory usage or override the model’s default context length; vLLM will automatically use the model’s config value if not specified. Would you like more detail on when to set this parameter?

Sources:

Conserving Memory

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 5:07pm

@RunLLM do I need to set `- --enable-auto-tool-choice

      - --tool-call-parser=hermes\`?

RunLLM · March 24, 2026, 5:08pm

Yes, for Qwen3.5-4B tool calling, you should set --enable-auto-tool-choice and --tool-call-parser=hermes. This enables the model to generate tool calls and parses them correctly, as Qwen2.5/3.5 models use the Hermes tool-calling format. Would you like a code example?

Sources:

Qwen Models Tool Calling

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 5:09pm

@RunLLM This is only for qwen2.5, right? not for qwen3.5

hustxiayang · March 24, 2026, 5:17pm

@RunLLM “Yes, for Qwen3.5-4B tool calling, you should set --enable-auto-tool-choice and --tool-call-parser=hermes. This enables the model to generate tool calls and parses them correctly, as Qwen2.5/3.5 models use the Hermes tool-calling format.” this is only for qwen2.5, not for qwen3, right?

RunLLM · March 24, 2026, 5:17pm

No, this is for both Qwen2.5 and Qwen3 series. Qwen3.5-4B also uses the Hermes tool-calling format, so --tool-call-parser=hermes is recommended for both model families according to the official vLLM documentation. Would you like more detail or references?

Sources:

Qwen Models Tool Calling

Hint: Mention @RunLLM in the post for followups.

hustxiayang · March 24, 2026, 5:18pm

it seems that it should be “--enable-auto-tool-choice --tool-call-parser qwen3_coder"?

hustxiayang · March 24, 2026, 5:19pm

@RunLLM it seems that it should be “--enable-auto-tool-choice --tool-call-parser qwen3_coder"?

RunLLM · March 24, 2026, 5:19pm

For Qwen3.5-4B, the recommended tool call parser is --tool-call-parser qwen3_coder, not hermes. Use --enable-auto-tool-choice --tool-call-parser qwen3_coder for correct tool call parsing with Qwen3.5 models. Would you like a full example command?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Need to serve a Qwen3 LLM with 235B params General	9	422	March 7, 2026
Critique my vLLM configuration for qwen3-coder-next Model Support	3	225	March 10, 2026
Hosting Qwen 3.5 35B-A3B model Model Support	1	1227	February 25, 2026
Deployment example for a qwen3 model with hybrid thinking Reasoning Outputs	10	1961	February 4, 2026
Qwen3.5-27B-FP8 Speculative Decoding Speculative Decoding	2	2018	April 11, 2026

Deployment parameters for qwen3.5-4b?

Related topics