Some questions about Maximum concurrency

Yaawennn · July 22, 2025, 11:03am

When initializing LLM, different max_model_len are set. When max_model_len is set smaller, the maximum throughput is higher, but the speed input is smaller. Why?

RunLLM · July 22, 2025, 11:04am

When you set a smaller max_model_len, vLLM pre-allocates less memory for intermediate activations and more for the KV cache, which generally increases throughput (more requests can be processed in parallel). However, a smaller max_model_len limits the maximum sequence length per request, so each input is shorter, resulting in a lower speed input (tokens processed per second) per request, even though overall throughput (requests per second) is higher. This is a trade-off between per-request input size and system-wide parallelism and efficiency. See the discussion in vllm/issues/1559 and vllm optimization docs.

Would you like a more detailed step-by-step explanation?

Sources:

Yaawennn · July 22, 2025, 11:09am

I have two different prompts, and the average length of the two sets of data is similar. I set max_model_len=8192 and max_model_len=40960 respectively. Why is the throughput of the second set low, but the inference speed is fast, and the speed input is larger?

RunLLM · July 22, 2025, 11:09am

When you set max_model_len=40960, vLLM allocates much more memory for each sequence, reducing the number of sequences that can be processed in parallel (lower throughput). However, with a larger max_model_len, each sequence can be longer, so the model can process more tokens per request, resulting in a higher speed input (tokens/sec). In contrast, max_model_len=8192 allows more sequences in parallel (higher throughput), but each sequence is shorter, so speed input is lower. This is a direct trade-off between batch parallelism and per-sequence token processing, as described in the vLLM performance docs and related issues.

Would you like a step-by-step breakdown of how memory allocation and scheduling affect these metrics?

Sources:

Yaawennn · July 22, 2025, 11:12am

But the average length of my two sets of data is about 1024, I don’t think there is such a balance problem as you mentioned

RunLLM · July 22, 2025, 11:12am

If your average prompt length is ~1024 for both datasets, but you set max_model_len much higher (e.g., 40960), vLLM still reserves memory for the maximum possible sequence length per request, not the average. This reduces the number of parallel sequences (lower throughput), even if most requests are much shorter. The speed input appears higher because each batch processes fewer, but potentially longer, sequences, so the per-request token rate can be higher, but overall system throughput drops. This is a known effect in vLLM’s memory allocation and scheduling, as explained in the performance docs.

Would you like a more technical explanation of how vLLM allocates memory and schedules requests?

Sources:

performance docs

Topic		Replies	Views
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	383	May 26, 2025
Maximum batch size with Pipeline Parallelism General	3	36	June 17, 2025
Performance Issue While Requests Queuing General	3	139	April 8, 2025
How does VRAM affect concurrent performance General	1	106	May 12, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	50	230	July 25, 2025

Some questions about Maximum concurrency

Related topics