@RunLLM what is the recomendated deployment parameters for qwen3.5-4b?
For Qwen3.5-4B, recommended deployment parameters are: --tensor-parallel-size 8, --enable-expert-parallel, --language-model-only (for text-only), --reasoning-parser qwen3, and --enable-prefix-caching. For multimodal, add --mm-encoder-tp-mode data and --mm-processor-cache-type shm. Adjust --gpu-memory-utilization as needed for your hardware. For latency-sensitive workloads, use --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' and disable prefix caching. For AMD GPUs, ensure ROCm 7.0, Python 3.12, and use the ROCm-specific Docker image or wheel as described in the docs.
See full details and hardware-specific examples in the Qwen3.5 Usage Guide. Would you like more detail or example commands?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 4b model, tp=8 is a waste! how to set “num_speculative_tokens”?
For Qwen3.5-4B, set --speculative-config.num_speculative_tokens to 1 for optimal performance; higher values (e.g., 3) may increase mean acceptance length but usually lower throughput due to reduced acceptance rates. This is recommended in the official guide for both throughput and latency-focused serving. Would you like more detail on tuning this parameter?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM yes! Also I should not use --enable-expert-parallel because this is a very small model. explain to me with code pointers, and also what are ‘--mm-encoder-tp-mode data and --mm-processor-cache-type shm."? also `--enable-prefix-caching` is the default, right?
You’re correct: for Qwen3.5-4B (a small, dense model), --enable-expert-parallel is unnecessary and should be omitted. For speculative decoding, set --speculative-config ‘{“method”:“mtp”,“num_speculative_tokens”:N}’ (replace N with your desired value, e.g., 1 or 3). Example:
vllm serve Qwen/Qwen3.5-4B \
--tensor-parallel-size 1 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--enable-prefix-caching
–mm-encoder-tp-mode data deploys the vision encoder in data-parallel mode (better for small encoders, less communication overhead), while --mm-processor-cache-type shm uses shared memory for caching preprocessed multimodal inputs, improving performance at high parallelism. --enable-prefix-caching is enabled by default for most models, but you can explicitly set --enable-prefix-caching or --no-enable-prefix-caching as needed. See Qwen3.5 Usage Guide and CLI reference for details.
Would you like a step-by-step breakdown of when to use each flag?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM is this necessary: “–mm-encoder-tp-mode data deploys the vision encoder in data-parallel mode (better for small encoders, less communication overhead),”? I think I would use tp=1 by default for this small model
No, --mm-encoder-tp-mode data is not necessary if you use --tensor-parallel-size 1 (tp=1); with a single GPU, both tensor and data parallelism are irrelevant, and the encoder will not be sharded. This flag is mainly beneficial for large models or multi-GPU setups. Would you like more detail on parallelism strategies?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM yes, give a direct example
Certainly! For Qwen3.5-4B on a single GPU (tp=1), you do NOT need --mm-encoder-tp-mode data. A minimal command for text-only inference would be:
vllm serve Qwen/Qwen3.5-4B \
--tensor-parallel-size 1 \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
If you want to enable multimodal (vision) support on a single GPU, you can simply omit --mm-encoder-tp-mode data, as it has no effect with tp=1:
vllm serve Qwen/Qwen3.5-4B \
--tensor-parallel-size 1 \
--enable-prefix-caching \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
No additional parallelism flags are needed for small models on a single GPU. Would you like more examples for multi-GPU or larger models?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM do i need to set max-model-len for current version of vLLM?
No, you do not need to set --max-model-len unless you want to reduce memory usage or override the model’s default context length; vLLM will automatically use the model’s config value if not specified. Would you like more detail on when to set this parameter?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM do I need to set `- --enable-auto-tool-choice
- --tool-call-parser=hermes\`?
Yes, for Qwen3.5-4B tool calling, you should set --enable-auto-tool-choice and --tool-call-parser=hermes. This enables the model to generate tool calls and parses them correctly, as Qwen2.5/3.5 models use the Hermes tool-calling format. Would you like a code example?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM This is only for qwen2.5, right? not for qwen3.5
@RunLLM “Yes, for Qwen3.5-4B tool calling, you should set --enable-auto-tool-choice and --tool-call-parser=hermes. This enables the model to generate tool calls and parses them correctly, as Qwen2.5/3.5 models use the Hermes tool-calling format.” this is only for qwen2.5, not for qwen3, right?
No, this is for both Qwen2.5 and Qwen3 series. Qwen3.5-4B also uses the Hermes tool-calling format, so --tool-call-parser=hermes is recommended for both model families according to the official vLLM documentation. Would you like more detail or references?
Sources:
Hint: Mention @RunLLM in the post for followups.
it seems that it should be “--enable-auto-tool-choice --tool-call-parser qwen3_coder"?
@RunLLM it seems that it should be “--enable-auto-tool-choice --tool-call-parser qwen3_coder"?
For Qwen3.5-4B, the recommended tool call parser is --tool-call-parser qwen3_coder, not hermes. Use --enable-auto-tool-choice --tool-call-parser qwen3_coder for correct tool call parsing with Qwen3.5 models. Would you like a full example command?
Sources:
Hint: Mention @RunLLM in the post for followups.