Running Gemma 3 on multi-chip TPU failure

Hi.

I’m running Gemma3 27b on multi-chip TPU via

MODEL=google/gemma-3-27b-it
VLLM_USE_V1=1 vllm serve $MODEL --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 4096 --max-num-seqs 512 --tensor-parallel-size 4 --max-model-len 2048 &
python benchmarks/benchmark_serving.py --backend vllm --model $MODEL --dataset-name random --random-input-len 1800 --random-output-len 128 --random-prefix-len 0 --num-prompts 1000

The server was up successfully but when I send the benchmarking request, it crashes with an error:

INFO:     Application startup complete.
python benchmarks/benchmark_serving.py --backend vllm --mopython benchmarks/benchmark_serving.py --backend vllm --model $MODEL --dataset-name random --random-input-len 1800 --random-output-len 128 --random-prefix-len 0 --num-prompts 1000
INFO 04-04 20:14:52 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
INFO 04-04 20:14:53 [__init__.py:239] Automatically detected platform tpu.
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='google/gemma-3-27b-it', tokenizer=None, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1800, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
INFO 04-04 20:15:02 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Starting initial single prompt test run...
INFO:     127.0.0.1:41240 - "POST /v1/completions HTTP/1.1" 200 OK
CRITICAL 04-04 20:15:09 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-04 20:15:09 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Traceback (most recent call last):
  File "/mnt/disks/persist/vllm/benchmarks/benchmark_serving.py", line 1030, in <module>
    main(args)
  File "/mnt/disks/persist/vllm/benchmarks/benchmark_serving.py", line 659, in main
    benchmark_result = asyncio.run(
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/mnt/disks/persist/vllm/benchmarks/benchmark_serving.py", line 294, in benchmark
    raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Traceback (most recent call last):
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/client_proto.py", line 93, in connection_lost
    uncompleted = self._parser.feed_eof()
  File "aiohttp/_http_parser.pyx", line 508, in aiohttp._http_parser.HttpParser.feed_eof
aiohttp.http_exceptions.TransferEncodingError: 400, message:
  Not enough data for satisfy transfer length header.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/disks/persist/vllm/benchmarks/backend_request_func.py", line 287, in async_request_openai_completions
    async for chunk_bytes in response.content:
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 52, in __anext__
    rv = await self.read_func()
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 352, in readline
    return await self.readuntil()
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 386, in readuntil
    await self._wait("readuntil")
  File "/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/aiohttp/streams.py", line 347, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <TransferEncodingError: 400, message='Not enough data for satisfy transfer length header.'>

From the error, it’s not very clear about where the real failure is. Has anyone seen something similar? Thanks!

I tried the offline test by running a modified tpu.py, and it fails too: gist:d641f2c1e33418f48d27cfc65386937f · GitHub

Though, the error MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue. is very vague and doesn’t say exactly what fails. Also, the precompilation succeeded as you can see Compilation finished in in 4.54.

Note gemma3 1b on single chip works.

I pulled the latest code from main and I was able to see the real error: gist:dad1febba89027747cb740b1e042e14a · GitHub by running the tpu.py script.

I used a smaller model google/gemma-3-1b-it but it didn’t trigger the error. I think it’s because gemma-3-1b-it does not support multi-modality.
But google/gemma-3-4b-it and google/gemma-3-27b-it result in the same error.

The error you are encountering seems to be related to the multi-modality support in the Gemma 3 models. According to the vLLM documentation, the Gemma 3 models, such as google/gemma-3-4b-it and google/gemma-3-27b-it, support text and image inputs. However, the google/gemma-3-1b-it model is likely text-only, which is why it does not trigger the same error.

If you are experiencing issues with multi-modality, it might be due to the current limitations in vLLM’s handling of mixed attention patterns for text and image inputs. The documentation notes that V0 correctly implements the model’s attention pattern, but V1 uses a simplified pattern that may not fully support the original model’s behavior, especially with image inputs. This could be causing the error you are seeing with the larger models that support multi-modality.

Would you like more details on how to potentially resolve this issue or further information on the differences between V0 and V1 handling of these models?

Sources: