Engine 000: Avg prompt throughput: 100.0 tokens/s, Avg generation throughput: 121.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 99.2%, Prefix cache hit rate: 94.9%
[1m39s < 10m23s] Progress: 137/1000 (0 failed) (13.7%), Decoding: 1, Prefilling: 38, TTFT: 0.00 (20684.01), ITL: 22.06 (23.27), Decoding throughput: 45.32 (181.19), Prefilling throughput: 0.00 (49.64)
Traceback (most recent call last):
File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 539, in start
message, payload = await protocol.read() # type: ignore[union-attr]
^^^^^^^^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/aiohttp/streams.py", line 680, in read
await self._waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/ysgg1/vllm_split/scripts/benchmark/backend_request_func.py", line 188, in async_request_openai
async with session.post(url=api_url, json=payload, headers=headers) as response:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 1517, in __aenter__
self._resp: _RetType = await self._coro
^^^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 786, in _request
resp = await handler(req)
^^^^^^^^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client.py", line 764, in _connect_and_send_request
await resp.start(conn)
File "/opt/ac2/lib/python3.12/site-packages/aiohttp/client_reqrep.py", line 534, in start
with self._timer:
^^^^^^^^^^^
File "/opt/ac2/lib/python3.12/site-packages/aiohttp/helpers.py", line 713, in __exit__
raise asyncio.TimeoutError from exc_val
TimeoutError
http: proxy error: context canceled
The utilization rate of the v0 version GPU will not be so high, and all requests can be processed normally. In version V1, when the GPU utilization rate reaches around 99%, the request will get stuck and display a timeout message. After a longer period of time, the decode server will indicate that the memory is insufficient to handle 1000 requests.