如何通过vllm的日志信息,组装vllm serve指令

我的vllm日志信息如下,帮我组装vllm serve指令。

Initializing a V1 LLM engine (v0.11.0) with config: model=‘/mnt/workspace/model/base/Qwen3-VL-4B-Instruct’, speculative_config=None, tokenizer=‘/mnt/workspace/model/base/Qwen3-VL-4B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10240, download_dir=None, load_format=safetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/workspace/model/base/Qwen3-VL-4B-Instruct, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”,“vllm.plamo2_mamba_mixer”,“vllm.gdn_attention”,“vllm.sparse_attn_indexer”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:[2,1],“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“use_inductor_graph_partition”:false,“pass_config”:{},“max_capture_size”:4,“local_cache_dir”:null}

vllm_kwargs: {‘dtype’: ‘bfloat16’, ‘gpu_memory_utilization’: 0.25, ‘kv_cache_dtype’: ‘auto’, ‘load_format’: ‘safetensors’, ‘max_model_len’: 10240, ‘max_num_seqs’: 3, ‘tensor_parallel_size’: 1, ‘block_size’: 16, ‘enable_chunked_prefill’: True, ‘enable_prefix_caching’: False, ‘limit_mm_per_prompt’: {‘image’: 1}, ‘mm_processor_kwargs’: {‘max_pixels’: 262144}}

non-default args: {‘load_format’: ‘safetensors’, ‘dtype’: ‘bfloat16’, ‘max_model_len’: 10240, ‘block_size’: 16, ‘enable_prefix_caching’: False, ‘gpu_memory_utilization’: 0.25, ‘max_num_seqs’: 3, ‘disable_log_stats’: True, ‘limit_mm_per_prompt’: {‘image’: 1}, ‘mm_processor_kwargs’: {‘max_pixels’: 262144}, ‘enable_chunked_prefill’: True, ‘model’: ‘/mnt/workspace/model/base/Qwen3-VL-4B-Instruct’}

根据你提供的配置信息,你可以用如下命令启动 vllm serve:

vllm serve /mnt/workspace/model/base/Qwen3-VL-4B-Instruct \
  --load-format safetensors \
  --dtype bfloat16 \
  --max-model-len 10240 \
  --block-size 16 \
  --enable-prefix-caching False \
  --gpu-memory-utilization 0.25 \
  --max-num-seqs 3 \
  --disable-log-stats \
  --limit-mm-per-prompt '{"image":1}' \
  --mm-processor-kwargs '{"max_pixels":262144}' \
  --enable-chunked-prefill

请根据实际需要调整端口(如 --port 8000),并确保路径和参数与实际环境一致。详细参数说明可参考官方文档

需要更详细的参数解释或遇到报错吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM 我的下面这个推理指令报错了:

TMPDIR=/msswift_tmp/tmp
vllm serve /msswift_tmp/output/4B/gaojia/v1-20251204-115223/checkpoint-40
–load-format safetensors
–dtype bfloat16
–max-model-len 10240
–block-size 16
–enable-prefix-caching False
–gpu-memory-utilization 0.25
–max-num-seqs 3
–disable-log-stats
–limit-mm-per-prompt ‘{“image”:1}’
–tensor-parallel-size 1
–pipeline-parallel-size 1
–port 9014
–enable-chunked-prefill
–mm-processor-kwargs ‘{“max_pixels”:262144}’
–kv-cache-dtype auto
–served-model-name “Qwen3-VL-4B-Instruct”
INFO 12-12 14:18:21 [init.py:216] Automatically detected platform cuda.
usage: vllm [-h] [-v] {chat,complete,serve,bench,collect-env,run-batch} …
vllm: error: unrecognized arguments: False

你的命令报错的主要原因是参数前缀用了全角破折号(–),而不是标准的半角短横线(-),以及布尔参数应为--enable-prefix-caching--no-enable-prefix-caching,不能直接跟False。此外,JSON参数建议用英文单引号或双引号。

正确写法如下:

TMPDIR=/msswift_tmp/tmp vllm serve /msswift_tmp/output/4B/gaojia/v1-20251204-115223/checkpoint-40 \
  --load-format safetensors \
  --dtype bfloat16 \
  --max-model-len 10240 \
  --block-size 16 \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.25 \
  --max-num-seqs 3 \
  --disable-log-stats \
  --limit-mm-per-prompt '{"image":1}' \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 1 \
  --port 9014 \
  --enable-chunked-prefill \
  --mm-processor-kwargs '{"max_pixels":262144}' \
  --kv-cache-dtype auto \
  --served-model-name Qwen3-VL-4B-Instruct

详细参数说明见官方文档。需要进一步排查请告知。

Sources:

Hint: Mention @RunLLM in the post for followups.