Critique my vLLM configuration for qwen3-coder-next

@RunLLM I’m running vLLM 0.17.0 in a docker container on Ubuntu 22 and serving qwen3-coder-next with 8 qty A6000 GPUs. Please review my docker config and suggest improvements.

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    runtime: nvidia
    networks:
      - ai-network
    ipc: host
    ulimits:
      memlock:
        soft: -1
        hard: -1
    environment:
      HF_TOKEN: "${HF_TOKEN}"
      NCCL_DEBUG: "WARN"
      NCCL_SHM_DISABLE: "0"
      NCCL_P2P_DISABLE: "0"
      NCCL_IB_DISABLE: "1"
      NCCL_SOCKET_IFNAME: "eth0"
      NCCL_COMM_BLOCKING: "1"
      # https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html
    command: >
      --model Qwen/Qwen3-Coder-Next
      --tensor-parallel-size 2
      --pipeline-parallel-size 3
      --gpu-memory-utilization 0.85
      --host 0.0.0.0
      --port 8000
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
    ports:
      - "8000:8000"
    volumes:
      - hf_cache:/root/.cache/huggingface
      - triton_cache:/root/.triton
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

networks:
  ai-network:
    external: true

volumes:
  hf_cache:
  triton_cache:

Your config is mostly correct, but for 8 GPUs, tensor and pipeline parallel sizes should multiply to 8 (e.g., 2×4 or 4×2), not 2×3=6. Also, for A6000 (no NVLink), set NCCL_P2P_DISABLE=1 and NCCL_SHM_DISABLE=1 to avoid NCCL issues and use PCIe communication, as recommended in distributed vLLM deployments. Consider increasing --gpu-memory-utilization if you have headroom, and ensure --max-model-len is set appropriately for your workload. Use --shm-size in Docker if you encounter shared memory errors.

For Qwen3-Coder-Next, check if expert parallelism is needed (--enable-expert-parallel) and match the parallelism flags to your hardware and model requirements. For more details on optimal NCCL and parallelism settings, see vLLM parallelism scaling and Qwen3-Coder vLLM recipe. Would you like a step-by-step breakdown or a sample revised config?

Sources:

Hint: Mention @RunLLM in the post for followups.

@runllm What are the advantages of tensor and pipeline parallel sizes set to either 2×4 or 4×2? The default model length for qwen3-coder-next is 256k, correct?

It seems likely that both 2×4 and 4×2 (tensor × pipeline parallel) will use all 8 GPUs, but the choice affects communication patterns and efficiency. 2×4 (TP=2, PP=4) means each layer is split across 2 GPUs and the model is pipelined over 4 stages; 4×2 (TP=4, PP=2) splits each layer across 4 GPUs and pipelines over 2 stages. For A6000s (no NVLink), lower tensor parallelism (e.g., TP=2) may reduce inter-GPU communication overhead, making 2×4 preferable for throughput and stability, as suggested in vLLM parallelism docs.

My understanding is that the default max model length for Qwen3-Coder-Next is not explicitly stated in the data sources, so I cannot confirm if it is 256k. Would you like more detail on parallelism trade-offs or model length configuration?

Sources:

Hint: Mention @RunLLM in the post for followups.