Error when running disaggregated prefill

Command
HIP_VISIBLE_DEVICES=0,1,2,3 python3
-m vllm.entrypoints.openai.api_server
–model $model
–port 8100
–max-model-len 10000
–quantization fp8
–dtype float16
–gpu-memory-utilization 0.9
–max-num-batched-token 10000
–tensor-parallel-size 4
–trust-remote-code
–kv-transfer-config
‘{“kv_connector”:“PyNcclConnector”,“kv_role”:“kv_producer”,“kv_rank”:0,“kv_parallel_size”:2,“kv_buffer_size”:5e9}’ &

HIP_VISIBLE_DEVICES=4,5,6,7 python3
-m vllm.entrypoints.openai.api_server
–model $model
–port 8200
–max-model-len 10000
–quantization fp8
–dtype float16
–gpu-memory-utilization 0.9
–max-num-batched-token 10000
–tensor-parallel-size 4
–trust-remote-code
–kv-transfer-config
‘{“kv_connector”:“PyNcclConnector”,“kv_role”:“kv_consumer”,“kv_rank”:1,“kv_parallel_size”:2,“kv_buffer_size”:5e9}’ &

qps=10, input len = 256, output len =256

Error
Traceback (most recent call last):
[rank0]: File “”, line 1, in
[rank0]: File “/opt/conda/envs/py_3.12/lib/python3.12/multiprocessing/spawn.py”, line 122, in spawn_main
[rank0]: exitcode = _main(fd, parent_sentinel)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File “/opt/conda/envs/py_3.12/lib/python3.12/multiprocessing/spawn.py”, line 135, in _main
[rank0]: return self._bootstrap(parent_sentinel)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File “/opt/conda/envs/py_3.12/lib/python3.12/multiprocessing/process.py”, line 332, in _bootstrap
[rank0]: threading._shutdown()
[rank0]: File “/opt/conda/envs/py_3.12/lib/python3.12/threading.py”, line 1594, in _shutdown
[rank0]: atexit_call()
[rank0]: File “/opt/conda/envs/py_3.12/lib/python3.12/concurrent/futures/thread.py”, line 31, in _python_exit
[rank0]: t.join()
[rank0]: File “/opt/conda/envs/py_3.12/lib/python3.12/threading.py”, line 1149, in join
[rank0]: self._wait_for_tstate_lock()
[rank0]: File “/opt/conda/envs/py_3.12/lib/python3.12/threading.py”, line 1169, in _wait_for_tstate_lock
[rank0]: if lock.acquire(block, timeout):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File “/var/lib/jenkins/vllm/vllm/engine/multiprocessing/engine.py”, line 426, in signal_handler
[rank0]: raise KeyboardInterrupt(“MQLLMEngine terminated”)
[rank0]: KeyboardInterrupt: MQLLMEngine terminated
[2025-07-10 23:19:50,661] ERROR in app: Exception on request POST /v1/chat/completions
Traceback (most recent call last):
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/connector.py”, line 1115, in _wrap_create_connection
sock = await aiohappyeyeballs.start_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohappyeyeballs/impl.py”, line 122, in start_connection
raise first_exception
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohappyeyeballs/impl.py”, line 73, in start_connection
sock = await _connect_sock(
^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohappyeyeballs/impl.py”, line 208, in _connect_sock
await loop.sock_connect(sock, address)
File “/opt/conda/envs/py_3.12/lib/python3.12/asyncio/selector_events.py”, line 651, in sock_connect
return await fut
^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/asyncio/selector_events.py”, line 691, in _sock_connect_cb
raise OSError(err, f’Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed (‘127.0.0.1’, 8100)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/quart/app.py”, line 1464, in handle_request
return await self.full_dispatch_request(request_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/quart/app.py”, line 1502, in full_dispatch_request
result = await self.handle_user_exception(error)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/quart/app.py”, line 1059, in handle_user_exception
raise error
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/quart/app.py”, line 1500, in full_dispatch_request
result = await self.dispatch_request(request_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/quart/app.py”, line 1597, in dispatch_request
return await self.ensure_async(handler)(**request_.view_args) # type: ignore[return-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/var/lib/jenkins/vllm/benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py”, line 87, in catch_all
async with session.request(method, upstream, headers=request.headers, json=data) as resp:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/client.py”, line 1425, in aenter
self._resp: _RetType = await self._coro
^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/client.py”, line 703, in _request
conn = await self._connector.connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/connector.py”, line 548, in connect
proto = await self._create_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/connector.py”, line 1056, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/connector.py”, line 1406, in _create_direct_connection
raise last_exc
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/connector.py”, line 1375, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/conda/envs/py_3.12/lib/python3.12/site-packages/aiohttp/connector.py”, line 1130, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:8100 ssl:default [Connect call failed (‘127.0.0.1’, 8100)]

The error message Cannot connect to host localhost:8100 ssl:default [Connect call failed ('127.0.0.1', 8100)] means that the proxy or client is trying to connect to the vLLM server on port 8100, but the server is not running or not reachable at that address. This is a connection error, not a vLLM-specific bug. The root cause is that the vLLM server process on port 8100 either failed to start, crashed, or is not listening on that port at the time of the request. The traceback also shows a KeyboardInterrupt: MQLLMEngine terminated, which suggests the engine process was killed or exited unexpectedly, possibly due to a manual interrupt or an internal error. There is no evidence in the provided data sources that the command-line arguments themselves are at fault, but the server must be running and healthy before the proxy or client can connect to it.

To resolve this, ensure that the vLLM server on port 8100 is started successfully and remains running before sending requests or starting the proxy. You may need to check the server logs for earlier errors or crashes, and confirm that no other process is using the port. This is a general networking/process management issue, not a vLLM API or configuration bug, as confirmed by similar connection errors in the vLLM issue tracker and troubleshooting discussions (source, source).

Would you like more detail on how to debug why the vLLM server process is not staying up?

Sources:

8100 connection was successful and it completed 30% requests and then gave the error

If the vLLM server on port 8100 completed some requests and then started returning connection errors, it likely crashed or was killed during processing. This can happen due to out-of-memory (OOM) errors, internal exceptions, or resource exhaustion. The error KeyboardInterrupt: MQLLMEngine terminated in your logs suggests the engine process was interrupted or terminated unexpectedly, which would cause subsequent connection attempts to fail (source, source).

Check the vLLM server logs for OOM errors, assertion failures, or stack traces before the crash. Common causes include exceeding GPU memory limits, too many concurrent requests, or long-running requests that exhaust resources. After a crash, the server must be restarted to resume handling requests (source, source).

Would you like step-by-step guidance on how to diagnose the server crash?

Sources:

Is it running into OOM because of KV buffer in disaggregated prefill

Yes, it is possible that your server is running into OOM (Out Of Memory) due to the KV buffer size in disaggregated prefill. If the kv_buffer_size is set too large, it can reduce the available GPU memory for inference, leading to OOM, especially under high concurrency or large batch sizes (source, source). The recommended practice is to set kv_buffer_size to about 5–10% of GPU memory, and monitor for OOM events.

If you see OOM errors, try lowering kv_buffer_size or reducing gpu_memory_utilization to leave more headroom for inference and buffer operations (source).

Would you like more detail on how to tune these parameters for your workload?

Sources: