I have observed significant performance drop while multiple requests queuing. The log:
2025-04-08T12:08:30.529973929+08:00 INFO 04-07 21:08:30 metrics.py:455] Avg prompt throughput: 660.4 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 10.3%, CPU KV cache usage: 0.0%.
2025-04-08T12:08:36.720537888+08:00 INFO 04-07 21:08:36 metrics.py:455] Avg prompt throughput: 660.5 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 13.0%, CPU KV cache usage: 0.0%.
2025-04-08T12:08:42.878858340+08:00 INFO 04-07 21:08:42 metrics.py:455] Avg prompt throughput: 663.2 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 16.3%, CPU KV cache usage: 0.0%.
2025-04-08T12:08:49.039230854+08:00 INFO 04-07 21:08:49 metrics.py:455] Avg prompt throughput: 662.6 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 12.2%, CPU KV cache usage: 0.0%.
2025-04-08T12:08:55.246619736+08:00 INFO 04-07 21:08:55 metrics.py:455] Avg prompt throughput: 658.4 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 10.1%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:01.424944182+08:00 INFO 04-07 21:09:01 metrics.py:455] Avg prompt throughput: 661.7 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 16.8%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:07.684243808+08:00 INFO 04-07 21:09:07 metrics.py:455] Avg prompt throughput: 652.9 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 9 reqs, GPU KV cache usage: 12.7%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:13.920842016+08:00 INFO 04-07 21:09:13 metrics.py:455] Avg prompt throughput: 655.6 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 12.5%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:20.080677444+08:00 INFO 04-07 21:09:20 metrics.py:455] Avg prompt throughput: 663.7 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 11.7%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:26.244572096+08:00 INFO 04-07 21:09:26 metrics.py:455] Avg prompt throughput: 663.5 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 12.0%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:32.400637360+08:00 INFO 04-07 21:09:32 metrics.py:455] Avg prompt throughput: 664.1 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 9.1%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:38.564065543+08:00 INFO 04-07 21:09:38 metrics.py:455] Avg prompt throughput: 663.3 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 12.9%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:44.724271293+08:00 INFO 04-07 21:09:44 metrics.py:455] Avg prompt throughput: 662.8 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 13.3%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:50.898168467+08:00 INFO 04-07 21:09:50 metrics.py:455] Avg prompt throughput: 661.8 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 12.3%, CPU KV cache usage: 0.0%.
2025-04-08T12:09:57.048084767+08:00 INFO 04-07 21:09:57 metrics.py:455] Avg prompt throughput: 664.4 tokens/s, Avg generation throughput: 2.3 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 9.2%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:03.199702922+08:00 INFO 04-07 21:10:03 metrics.py:455] Avg prompt throughput: 664.5 tokens/s, Avg generation throughput: 2.0 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 5 reqs, GPU KV cache usage: 10.5%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:09.357446536+08:00 INFO 04-07 21:10:09 metrics.py:455] Avg prompt throughput: 663.6 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 10.4%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:15.522468971+08:00 INFO 04-07 21:10:15 metrics.py:455] Avg prompt throughput: 663.1 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 14.5%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:21.753645907+08:00 INFO 04-07 21:10:21 metrics.py:455] Avg prompt throughput: 656.1 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 15.9%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:28.034414325+08:00 INFO 04-07 21:10:28 metrics.py:455] Avg prompt throughput: 651.4 tokens/s, Avg generation throughput: 0.8 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 12 reqs, GPU KV cache usage: 14.7%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:34.279180770+08:00 INFO 04-07 21:10:34 metrics.py:455] Avg prompt throughput: 655.1 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 12 reqs, GPU KV cache usage: 20.7%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:40.602537502+08:00 INFO 04-07 21:10:40 metrics.py:455] Avg prompt throughput: 646.8 tokens/s, Avg generation throughput: 1.1 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 11 reqs, GPU KV cache usage: 21.5%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:46.766708524+08:00 INFO 04-07 21:10:46 metrics.py:455] Avg prompt throughput: 662.9 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 8 reqs, GPU KV cache usage: 26.4%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:52.986051330+08:00 INFO 04-07 21:10:52 metrics.py:455] Avg prompt throughput: 657.0 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 7 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 22.0%, CPU KV cache usage: 0.0%.
2025-04-08T12:10:59.146242833+08:00 INFO 04-07 21:10:59 metrics.py:455] Avg prompt throughput: 662.6 tokens/s, Avg generation throughput: 2.8 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 6 reqs, GPU KV cache usage: 24.7%, CPU KV cache usage: 0.0%.
2025-04-08T12:11:05.317306213+08:00 INFO 04-07 21:11:05 metrics.py:455] Avg prompt throughput: 661.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 2 reqs, GPU KV cache usage: 23.8%, CPU KV cache usage: 0.0%.
2025-04-08T12:11:11.474292340+08:00 INFO 04-07 21:11:11 metrics.py:455] Avg prompt throughput: 662.3 tokens/s, Avg generation throughput: 3.7 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 26.5%, CPU KV cache usage: 0.0%.
The generation speed drops less than 5 t/s, but the throughput can reach 18 t/s. Is this a vLLM bug or configuration issue?
My vLLM information:
Version: 0.7.3
Arguments:
- '--tensor-parallel-size'
- '4'
- '--gpu-memory-utilization'
- '0.91'
- '--enable-auto-tool-choice'
- '--tool-call-parser'
- hermes
- '--disable-log-requests'
- '--uvicorn-log-level'
- warning
- '--enforce-eager'
- '--max-model-len'
- '60000'
- '--rope-scaling'
- >-
{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":32768}