vllm serve /autodl-fs/data/weight/DeepSeek-V3.1/ --trust-remote-code --max-num-seqs 16 --max-model-len 8192 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --distributed_executor_backend “mp” --no-enable-prefix-caching --enable-expert-parallel --no-async-scheduling --default-chat-template-kwargs ‘{“enable_thinking”: false}’ --compilation_config ‘{“cudagraph_mode”: “FULL_DECODE_ONLY”}’
Your command mostly uses valid vLLM serve arguments, but there are some issues:
- The correct flag is
--distributed-executor-backend, not--distributed_executor_backend. - The correct flag is
--compilation-config, not--compilation_config. - Use standard double quotes
"for JSON and arguments, not curly quotes or single quotes. - All other flags are supported and documented for distributed, expert parallel, and compilation config usage in vLLM serve. See engine args and serve args.
Would you like a corrected command or more detail on any specific flag?
Sources:
Hint: Mention @Herald in the post for followups.