Vllm bench serve not all requests are successful. whats the reason?

jayakommuru · October 23, 2025, 12:50pm

Hi,

i have created the vllm server using
vllm serve meta-llama/Llama-3.1-8B --port 8003 --max-model-len 2400 --uvicorn-log-level error --max-num-seqs 8 2>&1 | tee server.log

and i have started the benchmark CLI using the following command:
vllm bench serve --backend openai --base-url ``http://localhost:8003`` --model meta-llama/Llama-3.1-8B --dataset-name custom --dataset-path vllm_custom_dataset.jsonl --custom-skip-chat-template --num-prompts 50 --max-concurrency 3 --result-dir ./ --result-filename vllm_benchmark_custom_prompts_concurrency2.json

as you can see, i have given the number of prompts as 50. but in the benchmark output I can see only 36 requests are successful as shown below, what can be the reason for this ?

============ Serving Benchmark Result ============
Successful requests: 36
Maximum request concurrency: 3
Benchmark duration (s): 45.18
Total input tokens: 70188
Total generated tokens: 9216
Request throughput (req/s): 0.80
Output token throughput (tok/s): 203.99
Peak output token throughput (tok/s): 207.00
Peak concurrent requests: 6.00
Total Token throughput (tok/s): 1757.58
---------------Time to First Token----------------
Mean TTFT (ms): 37.15
Median TTFT (ms): 35.54
P99 TTFT (ms): 51.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.60
Median TPOT (ms): 14.59
P99 TPOT (ms): 14.71
---------------Inter-token Latency----------------
Mean ITL (ms): 14.60
Median ITL (ms): 14.57
P99 ITL (ms): 15.06

RunLLM · October 23, 2025, 12:51pm

The most common reason for fewer successful requests than the number of prompts in vLLM benchmarks is that some requests fail due to exceeding the model’s maximum context length, invalid input formatting, or server-side errors. In your case, with --max-model-len set to 2400, if any prompt plus expected output tokens exceeds this limit, those requests will fail, as seen in similar reports and error logs in the vLLM community. Additionally, malformed prompts or issues in the custom dataset can also cause failures, but the benchmark tool typically reports these as failed requests without detailed error messages in the summary output. You can check the detailed error logs or the result file for specific error messages per request to confirm the cause.

This behavior is documented in several vLLM issues, where users observe fewer successful requests than expected, often due to input length or dataset formatting problems, and the benchmark tool marks these as failed without always surfacing the exact error in the summary output. For more details, see the discussion in vllm-project/vllm/issues/11167 and vllm-project/vllm/issues/10163.

Would you like a step-by-step guide on how to debug which requests failed and why?

Sources:

Hint: Mention @RunLLM in the post for followups.

jayakommuru · October 23, 2025, 12:59pm

@RunLLM but the issues given above dont have a solution or how to find the reason why some requests are not successful

RunLLM · October 23, 2025, 1:00pm

To find out why some requests are not successful, you should inspect the detailed benchmark result file (e.g., vllm_benchmark_custom_prompts_concurrency2.json) and the server logs. The result file contains per-request status and error messages, which can help you identify if failures are due to input length, formatting, or server errors. This approach is recommended in vllm-project/vllm/issues/10163, where users are advised to check the saved results for error details.

Additionally, reviewing the server.log for error traces during the benchmark run can reveal issues like context length exceeded or malformed input. This method is also suggested in vllm-project/vllm/issues/11167.

Would you like a step-by-step example of how to analyze the result file and logs?

Sources:

Hint: Mention @RunLLM in the post for followups.

Binwon · October 23, 2025, 3:10pm

From my experience, the detokenizer often causes failures. Sometimes the model produces invalid or strange tokens, which results in a crash or failure in my setup.

jayakommuru · October 23, 2025, 3:23pm

@Binwon the issue doesn’t occur with concurrency of 1, as and when i try to benchmark with higher concurrency, say 2 or 3, there would always be few failed requests. Wanted to know if there’s a reason for this.
Also, from whatever requests that got successful, is it fine to assume the latency numbers are correct ?

Topic		Replies	Views
Vllm 0.10.1 v1 benchmark Only a part of the requests can be processed before it gets stuck General	1	114	November 4, 2025
How to benchmark concurrency General	26	1076	September 15, 2025
Benchmark initial single prompt test General	2	75	October 11, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	143	November 10, 2025
Vllm 0.10.1 v1 benchmark stuck General	1	187	November 5, 2025

Vllm bench serve not all requests are successful. whats the reason?

Related topics