Some questions about Maximum concurrency

When initializing LLM, different max_model_len are set. When max_model_len is set smaller, the maximum throughput is higher, but the speed input is smaller. Why?

When you set a smaller max_model_len, vLLM pre-allocates less memory for intermediate activations and more for the KV cache, which generally increases throughput (more requests can be processed in parallel). However, a smaller max_model_len limits the maximum sequence length per request, so each input is shorter, resulting in a lower speed input (tokens processed per second) per request, even though overall throughput (requests per second) is higher. This is a trade-off between per-request input size and system-wide parallelism and efficiency. See the discussion in vllm/issues/1559 and vllm optimization docs.

Would you like a more detailed step-by-step explanation?

Sources:

I have two different prompts, and the average length of the two sets of data is similar. I set max_model_len=8192 and max_model_len=40960 respectively. Why is the throughput of the second set low, but the inference speed is fast, and the speed input is larger?

When you set max_model_len=40960, vLLM allocates much more memory for each sequence, reducing the number of sequences that can be processed in parallel (lower throughput). However, with a larger max_model_len, each sequence can be longer, so the model can process more tokens per request, resulting in a higher speed input (tokens/sec). In contrast, max_model_len=8192 allows more sequences in parallel (higher throughput), but each sequence is shorter, so speed input is lower. This is a direct trade-off between batch parallelism and per-sequence token processing, as described in the vLLM performance docs and related issues.

Would you like a step-by-step breakdown of how memory allocation and scheduling affect these metrics?

Sources:

But the average length of my two sets of data is about 1024, I don’t think there is such a balance problem as you mentioned

If your average prompt length is ~1024 for both datasets, but you set max_model_len much higher (e.g., 40960), vLLM still reserves memory for the maximum possible sequence length per request, not the average. This reduces the number of parallel sequences (lower throughput), even if most requests are much shorter. The speed input appears higher because each batch processes fewer, but potentially longer, sequences, so the per-request token rate can be higher, but overall system throughput drops. This is a known effect in vLLM’s memory allocation and scheduling, as explained in the performance docs.

Would you like a more technical explanation of how vLLM allocates memory and schedules requests?

Sources: