Performance Issue While Requests Queuing

I have observed significant performance drop while multiple requests queuing. The log:

2025-04-08T12:08:30.529973929+08:00 INFO 04-07 21:08:30 metrics.py:455] Avg prompt throughput: 660.4 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 10.3%, CPU KV cache usage: 0.0%.

2025-04-08T12:08:36.720537888+08:00 INFO 04-07 21:08:36 metrics.py:455] Avg prompt throughput: 660.5 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 13.0%, CPU KV cache usage: 0.0%.

2025-04-08T12:08:42.878858340+08:00 INFO 04-07 21:08:42 metrics.py:455] Avg prompt throughput: 663.2 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 16.3%, CPU KV cache usage: 0.0%.

2025-04-08T12:08:49.039230854+08:00 INFO 04-07 21:08:49 metrics.py:455] Avg prompt throughput: 662.6 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 12.2%, CPU KV cache usage: 0.0%.

2025-04-08T12:08:55.246619736+08:00 INFO 04-07 21:08:55 metrics.py:455] Avg prompt throughput: 658.4 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 10.1%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:01.424944182+08:00 INFO 04-07 21:09:01 metrics.py:455] Avg prompt throughput: 661.7 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 16.8%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:07.684243808+08:00 INFO 04-07 21:09:07 metrics.py:455] Avg prompt throughput: 652.9 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 12.7%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:13.920842016+08:00 INFO 04-07 21:09:13 metrics.py:455] Avg prompt throughput: 655.6 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 12.5%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:20.080677444+08:00 INFO 04-07 21:09:20 metrics.py:455] Avg prompt throughput: 663.7 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 11.7%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:26.244572096+08:00 INFO 04-07 21:09:26 metrics.py:455] Avg prompt throughput: 663.5 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 12.0%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:32.400637360+08:00 INFO 04-07 21:09:32 metrics.py:455] Avg prompt throughput: 664.1 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 9.1%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:38.564065543+08:00 INFO 04-07 21:09:38 metrics.py:455] Avg prompt throughput: 663.3 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 12.9%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:44.724271293+08:00 INFO 04-07 21:09:44 metrics.py:455] Avg prompt throughput: 662.8 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 13.3%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:50.898168467+08:00 INFO 04-07 21:09:50 metrics.py:455] Avg prompt throughput: 661.8 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 12.3%, CPU KV cache usage: 0.0%.

2025-04-08T12:09:57.048084767+08:00 INFO 04-07 21:09:57 metrics.py:455] Avg prompt throughput: 664.4 tokens/s, Avg generation throughput: 2.3 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 9.2%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:03.199702922+08:00 INFO 04-07 21:10:03 metrics.py:455] Avg prompt throughput: 664.5 tokens/s, Avg generation throughput: 2.0 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 10.5%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:09.357446536+08:00 INFO 04-07 21:10:09 metrics.py:455] Avg prompt throughput: 663.6 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 10.4%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:15.522468971+08:00 INFO 04-07 21:10:15 metrics.py:455] Avg prompt throughput: 663.1 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 14.5%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:21.753645907+08:00 INFO 04-07 21:10:21 metrics.py:455] Avg prompt throughput: 656.1 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 15.9%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:28.034414325+08:00 INFO 04-07 21:10:28 metrics.py:455] Avg prompt throughput: 651.4 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 12 reqs, GPU KV cache usage: 14.7%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:34.279180770+08:00 INFO 04-07 21:10:34 metrics.py:455] Avg prompt throughput: 655.1 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 12 reqs, GPU KV cache usage: 20.7%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:40.602537502+08:00 INFO 04-07 21:10:40 metrics.py:455] Avg prompt throughput: 646.8 tokens/s, Avg generation throughput: 1.1 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 11 reqs, GPU KV cache usage: 21.5%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:46.766708524+08:00 INFO 04-07 21:10:46 metrics.py:455] Avg prompt throughput: 662.9 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 26.4%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:52.986051330+08:00 INFO 04-07 21:10:52 metrics.py:455] Avg prompt throughput: 657.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 22.0%, CPU KV cache usage: 0.0%.

2025-04-08T12:10:59.146242833+08:00 INFO 04-07 21:10:59 metrics.py:455] Avg prompt throughput: 662.6 tokens/s, Avg generation throughput: 2.8 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 24.7%, CPU KV cache usage: 0.0%.

2025-04-08T12:11:05.317306213+08:00 INFO 04-07 21:11:05 metrics.py:455] Avg prompt throughput: 661.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 2 reqs, GPU KV cache usage: 23.8%, CPU KV cache usage: 0.0%.

2025-04-08T12:11:11.474292340+08:00 INFO 04-07 21:11:11 metrics.py:455] Avg prompt throughput: 662.3 tokens/s, Avg generation throughput: 3.7 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 26.5%, CPU KV cache usage: 0.0%.


The generation speed drops less than 5 t/s, but the throughput can reach 18 t/s. Is this a vLLM bug or configuration issue?
My vLLM information:
Version: 0.7.3
Arguments:

- '--tensor-parallel-size'
            - '4'
            - '--gpu-memory-utilization'
            - '0.91'
            - '--enable-auto-tool-choice'
            - '--tool-call-parser'
            - hermes
            - '--disable-log-requests'
            - '--uvicorn-log-level'
            - warning
            - '--enforce-eager'
            - '--max-model-len'
            - '60000'
            - '--rope-scaling'
            - >-
              {"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":32768}

@thiner How did you measure generation speed?
If the generation speed that you observed is the generation speed of only-one request. Then what you are observing is expected. As the vLLM server processing more requests, the generation speed of individual request drops, as the compute capacity is shared to process other requests in parallel.

Thanks for your quick response. Given the parallel processing nature of vLLM, is there any method to limit the number of parallel requests?

You can control the number of requests the vLLM server can process in parallel through --max-num-seqs when you are calling vllm serve.

Engine Arguments — vLLM