Num request running stays on 1

biranofer · August 29, 2025, 6:30pm

On a prefill test (with max_tokens=1) generating multiple requests of lengths 6000 tokens, while max_num_batched_tokens is 60000, max_num_seqs is 16. The vllm:num_requests_running is always 1, even when vllm:num_requests_waiting gets to 30. It runs on A100.
The log for a request is:
————————————————————————————–
params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=, stop_token_ids=, bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), ………
……. prompt_embeds shape: None, lora_request: None.
—————————————————————————————

This is the server startup log, Any idea ?
—————————————————————————————-
INFO 08-29 09:22:20 [init.py:235] Automatically detected platform cuda.
INFO 08-29 09:22:24 [api_server.py:1755] vLLM API server version 0.10.1.dev1+gbcc0a3cbe
INFO 08-29 09:22:24 [cli_args.py:261] non-default args: {‘model_tag’: ‘mistralai/Mistral-7B-Instruct-v0.2’, ‘host’: ‘0.0.0.0’, ‘model’: ‘mistralai/Mistral-7B-Instruct-v0.2’, ‘trust_remote_code’: True, ‘max_model_len’: 16384, ‘block_size’: 64, ‘enable_prefix_caching’: False, ‘max_num_batched_tokens’: 60000, ‘max_num_seqs’: 16, ‘enable_chunked_prefill’: False}
INFO 08-29 09:22:35 [config.py:1604] Using max model len 16384
INFO 08-29 09:22:35 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=60000.
/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py:24: FutureWarning: It is strongly recommended to run mistral models with --tokenizer-mode "mistral" to ensure correct encoding and decoding.
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
INFO 08-29 09:22:44 [init.py:235] Automatically detected platform cuda.
INFO 08-29 09:22:48 [core.py:572] Waiting for init message from front-end.
INFO 08-29 09:22:48 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev1+gbcc0a3cbe) with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, speculative_config=None, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:32,“local_cache_dir”:null}
INFO 08-29 09:22:50 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 08-29 09:22:50 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 08-29 09:22:50 [gpu_model_runner.py:1843] Starting to load model mistralai/Mistral-7B-Instruct-v0.2…
INFO 08-29 09:22:50 [gpu_model_runner.py:1875] Loading model from scratch…
INFO 08-29 09:22:50 [cuda.py:290] Using Flash Attention backend on V1 engine.
INFO 08-29 09:22:50 [weight_utils.py:296] Using model weights format [‘*.safetensors’]
INFO 08-29 09:23:13 [weight_utils.py:312] Time spent downloading weights for mistralai/Mistral-7B-Instruct-v0.2: 23.020574 seconds

Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.89s/it]

Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:03<00:01, 1.84s/it]

Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.87s/it]

Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.87s/it]

INFO 08-29 09:23:19 [default_loader.py:262] Loading weights took 5.67 seconds
INFO 08-29 09:23:20 [gpu_model_runner.py:1892] Model loading took 13.4967 GiB and 29.303647 seconds
INFO 08-29 09:23:27 [backends.py:530] Using cache directory: /tmp/.cache/vllm/torch_compile_cache/043061b246/rank_0_0/backbone for vLLM’s torch.compile
INFO 08-29 09:23:27 [backends.py:541] Dynamo bytecode transform time: 6.85 s
INFO 08-29 09:23:29 [backends.py:194] Cache the graph for dynamic shape for later use
INFO 08-29 09:23:53 [backends.py:215] Compiling a graph for dynamic shape takes 25.93 s
INFO 08-29 09:24:13 [monitor.py:34] torch.compile takes 32.78 s in total
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ[‘TORCH_CUDA_ARCH_LIST’].
warnings.warn(
INFO 08-29 09:24:14 [gpu_worker.py:255] Available KV cache memory: 19.70 GiB
INFO 08-29 09:24:14 [kv_cache_utils.py:833] GPU KV cache size: 161,344 tokens
INFO 08-29 09:24:14 [kv_cache_utils.py:837] Maximum concurrency for 16,384 tokens per request: 9.85x

Capturing CUDA graph shapes: 0%| | 0/7 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 43%|████▎ | 3/7 [00:00<00:00, 24.64it/s]
Capturing CUDA graph shapes: 86%|████████▌ | 6/7 [00:00<00:00, 26.17it/s]
Capturing CUDA graph shapes: 100%|██████████| 7/7 [00:00<00:00, 25.50it/s]
INFO 08-29 09:24:15 [gpu_model_runner.py:2485] Graph capturing finished in 1 secs, took 0.11 GiB
INFO 08-29 09:24:15 [core.py:193] init engine (profile, create kv cache, warmup model) took 54.94 seconds
/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py:24: FutureWarning: It is strongly recommended to run mistral models with --tokenizer-mode "mistral" to ensure correct encoding and decoding.
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)

RunLLM · August 29, 2025, 6:31pm

The reason vllm:num_requests_running is always 1, even with many waiting requests, is that your max_num_seqs is set to 16, but each request is 6000 tokens and max_num_batched_tokens is 60000. This means only 10 requests (60000 / 6000) can be processed in a batch, but vLLM will only batch up to max_num_seqs requests. However, if your requests are long and max_num_batched_tokens is the limiting factor, vLLM will only process as many requests as can fit within that token budget. If your requests are arriving faster than they can be processed, the rest will queue, and if the scheduler is not able to batch more than one at a time (due to the token or sequence limits), only one will run at a time. This is expected if the batch is always “full” with a single request, or if the scheduler is not able to batch more due to the configuration or request arrival pattern. See Performance Tuning with Chunked Prefill and Context length and batch size.

If you want more concurrent requests running, try reducing the per-request token length, increasing max_num_batched_tokens, or adjusting max_num_seqs to better match your workload. Also, ensure chunked prefill is enabled (it is by default in vLLM V1), as this can help with batching and throughput. For more details on how these parameters interact, see Optimization and Tuning.

Would you like a step-by-step breakdown of how vLLM batches requests and how to tune these parameters for your use case?

Sources:

biranofer · August 29, 2025, 6:38pm

with max_num_batched_tokens = 60000 and each request length is 6000, and max_num_seqs is 16, more than 1 requests should be processed concurrently !

RunLLM · August 29, 2025, 6:39pm

Yes, with max_num_batched_tokens=60000, max_num_seqs=16, and each request length 6000 tokens, vLLM should be able to process up to 10 requests concurrently (since 60000 / 6000 = 10, which is less than 16). If only 1 request is running at a time, this is not expected and may indicate a scheduling or configuration issue. Please ensure chunked prefill is enabled (it is by default in vLLM V1), and that no other constraints (like memory or model-specific limits) are preventing batching. See Performance Tuning with Chunked Prefill and SchedulerConfig docs.

Would you like a step-by-step troubleshooting guide to diagnose why batching is not occurring as expected?

Sources:

Topic		Replies	Views
VLLM V1 Scheduler: Inconsistent Request Scheduling Under Token Budget Limit General	1	70	July 30, 2025
The vLLM server fails on large max-num-batched-tokens General	1	74	September 13, 2025
Maximum batch size with Pipeline Parallelism General	3	58	June 17, 2025
Vllm Latency(TTFT) Spike on load Tesing General	9	141	August 16, 2025
Performance Issue While Requests Queuing General	3	277	April 8, 2025

Num request running stays on 1

Related topics