如何查看配置的batch size是多大?

INFO 12-15 17:44:53 [v1/engine/core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model=‘/mnt/workspace/model/base/Qwen3-VL-4B-Instruct’, speculative_config=None, tokenizer=‘/mnt/workspace/model/base/Qwen3-VL-4B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10240, download_dir=None, load_format=safetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-VL-4B-Instruct, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”,“vllm.plamo2_mamba_mixer”,“vllm.gdn_attention”,“vllm.sparse_attn_indexer”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:[2,1],“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“use_inductor_graph_partition”:false,“pass_config”:{},“max_capture_size”:120,“local_cache_dir”:null}

The log message shows the initialization of a vLLM V1 engine with a detailed configuration, including model path, tokenizer, dtype, max sequence length, parallelism settings, and a custom compilation_config. This configuration is typical for launching a vLLM engine and matches the documented parameters for LLMEngine and LLM initialization, including support for custom compilation and CUDA graph settings as described in the official API and CLI docs (LLMEngine, LLM, compilation_config).

If you are asking for an explanation or troubleshooting, please clarify your specific question or issue regarding this configuration or log output.

Sources:

Hint: Mention @RunLLM in the post for followups.

他的batch size是动态的。会根据负载(等待队列中的数据prefill batch)
但不会大于你设置的 max_num_seqs ,和 (max_batch_tokens / 输入token len)

我看你另一个问题里面,你这里最好调一下 max_batch_tokens 和 max-model-len,可以提升你的并发性能。