@RunLLM I’m running vLLM 0.17.0 in a docker container on Ubuntu 22 and serving qwen3-coder-next with 8 qty A6000 GPUs. Please review my docker config and suggest improvements.
Your config is mostly correct, but for 8 GPUs, tensor and pipeline parallel sizes should multiply to 8 (e.g., 2×4 or 4×2), not 2×3=6. Also, for A6000 (no NVLink), set NCCL_P2P_DISABLE=1 and NCCL_SHM_DISABLE=1 to avoid NCCL issues and use PCIe communication, as recommended in distributed vLLM deployments. Consider increasing --gpu-memory-utilization if you have headroom, and ensure --max-model-len is set appropriately for your workload. Use --shm-size in Docker if you encounter shared memory errors.
For Qwen3-Coder-Next, check if expert parallelism is needed (--enable-expert-parallel) and match the parallelism flags to your hardware and model requirements. For more details on optimal NCCL and parallelism settings, see vLLM parallelism scaling and Qwen3-Coder vLLM recipe. Would you like a step-by-step breakdown or a sample revised config?
@runllm What are the advantages of tensor and pipeline parallel sizes set to either 2×4 or 4×2? The default model length for qwen3-coder-next is 256k, correct?
It seems likely that both 2×4 and 4×2 (tensor × pipeline parallel) will use all 8 GPUs, but the choice affects communication patterns and efficiency. 2×4 (TP=2, PP=4) means each layer is split across 2 GPUs and the model is pipelined over 4 stages; 4×2 (TP=4, PP=2) splits each layer across 4 GPUs and pipelines over 2 stages. For A6000s (no NVLink), lower tensor parallelism (e.g., TP=2) may reduce inter-GPU communication overhead, making 2×4 preferable for throughput and stability, as suggested in vLLM parallelism docs.
My understanding is that the default max model length for Qwen3-Coder-Next is not explicitly stated in the data sources, so I cannot confirm if it is 256k. Would you like more detail on parallelism trade-offs or model length configuration?