Vllm serve拉起服务,到最后报错了

vllm serve /autodl-fs/data/weight/DeepSeek-V3.1/ --trust-remote-code --max-num-seqs 16 --max-model-len 8192 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 8 --distributed_executor_backend “mp” --no-enable-prefix-caching --enable-expert-parallel --no-async-scheduling --default-chat-template-kwargs ‘{“enable_thinking”: false}’ --compilation_config ‘{“cudagraph_mode”: “FULL_DECODE_ONLY”}’

Your command mostly uses valid vLLM serve arguments, but there are some issues:

  • The correct flag is --distributed-executor-backend, not --distributed_executor_backend.
  • The correct flag is --compilation-config, not --compilation_config.
  • Use standard double quotes " for JSON and arguments, not curly quotes or single quotes.
  • All other flags are supported and documented for distributed, expert parallel, and compilation config usage in vLLM serve. See engine args and serve args.

Would you like a corrected command or more detail on any specific flag?

Sources:

Hint: Mention @Herald in the post for followups.