如何查看配置的batch size是多大？

sxcnmslll · December 15, 2025, 9:47am

INFO 12-15 17:44:53 [v1/engine/core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model=‘/mnt/workspace/model/base/Qwen3-VL-4B-Instruct’, speculative_config=None, tokenizer=‘/mnt/workspace/model/base/Qwen3-VL-4B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10240, download_dir=None, load_format=safetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-VL-4B-Instruct, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”,“vllm.mamba_mixer”,“vllm.short_conv”,“vllm.linear_attention”,“vllm.plamo2_mamba_mixer”,“vllm.gdn_attention”,“vllm.sparse_attn_indexer”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“cudagraph_mode”:[2,1],“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“use_inductor_graph_partition”:false,“pass_config”:{},“max_capture_size”:120,“local_cache_dir”:null}

RunLLM · December 15, 2025, 9:47am

The log message shows the initialization of a vLLM V1 engine with a detailed configuration, including model path, tokenizer, dtype, max sequence length, parallelism settings, and a custom compilation_config. This configuration is typical for launching a vLLM engine and matches the documented parameters for LLMEngine and LLM initialization, including support for custom compilation and CUDA graph settings as described in the official API and CLI docs (LLMEngine, LLM, compilation_config).

If you are asking for an explanation or troubleshooting, please clarify your specific question or issue regarding this configuration or log output.

Sources:

Hint: Mention @RunLLM in the post for followups.

jaysontree · December 22, 2025, 4:02am

他的batch size是动态的。会根据负载（等待队列中的数据prefill batch）
但不会大于你设置的 max_num_seqs ，和 (max_batch_tokens / 输入token len)

我看你另一个问题里面，你这里最好调一下 max_batch_tokens 和 max-model-len，可以提升你的并发性能。

Topic		Replies	Views
vLLM V1 - Default max CUDA graph size V1 Feedback	2	1827	June 30, 2025
Maximum batch size with Pipeline Parallelism General	3	334	June 17, 2025
Batched processing over openai api General	3	124	November 30, 2025
Question about parameter: max-num-batched-tokens General	3	2279	November 25, 2025
Suggestion to improve inferencing speed Model Support	17	808	March 11, 2026

如何查看配置的batch size是多大？

Related topics