Hi.
I’m running Gemma3 27b on multi-chip TPU via
MODEL=google/gemma-3-27b-it
VLLM_USE_V1=1 vllm serve $MODEL --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 4096 --max-num-seqs 512 --tensor-parallel-size 4 --max-model-len 2048 &
python benchmarks/benchmark_serving.py --backend vllm --model $MODEL --dataset-name random --random-input-len 1800 --random-output-len 128 --random-prefix-len 0 --num-prompts 1000
The server was up successfully but when I send the benchmarking request, it crashes with an error:
INFO: Application startup complete.
python benchmarks/benchmark_serving.py --backend vllm --mopython benchmarks/benchmark_serving.py --backend vllm --model $MODEL --dataset-name random --random-input-len 1800 --random-output-len 128 --random-prefix-len 0 --num-prompts 1000
INFO 04-04 20:14:52 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
INFO 04-04 20:14:53 [__init__.py:239] Automatically detected platform tpu.
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='google/gemma-3-27b-it', tokenizer=None, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1800, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
INFO 04-04 20:15:02 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Starting initial single prompt test run...
INFO: 127.0.0.1:41240 - "POST /v1/completions HTTP/1.1" 200 OK
CRITICAL 04-04 20:15:09 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-04 20:15:09 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Traceback (most recent call last):
File "/mnt/disks/persist/vllm/benchmarks/benchmark_serving.py", line 1030, in <module>
main(args)
File "/mnt/disks/persist/vllm/benchmarks/benchmark_serving.py", line 659, in main
benchmark_result = asyncio.run(
File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/mnt/disks/persist/vllm/benchmarks/benchmark_serving.py", line 294, in benchmark
raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/client_proto.py", line 93, in connection_lost
uncompleted = self._parser.feed_eof()
File "aiohttp/_http_parser.pyx", line 508, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
Not enough data for satisfy transfer length header.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/disks/persist/vllm/benchmarks/backend_request_func.py", line 287, in async_request_openai_completions
async for chunk_bytes in response.content:
File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 52, in __anext__
rv = await self.read_func()
File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 352, in readline
return await self.readuntil()
File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 386, in readuntil
await self._wait("readuntil")
File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 347, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>
From the error, it’s not very clear about where the real failure is. Has anyone seen something similar? Thanks!