i have created the vllm server using vllm serve meta-llama/Llama-3.1-8B --port 8003 --max-model-len 2400 --uvicorn-log-level error --max-num-seqs 8 2>&1 | tee server.log
and i have started the benchmark CLI using the following command: vllm bench serve --backend openai --base-url ``http://localhost:8003`` --model meta-llama/Llama-3.1-8B --dataset-name custom --dataset-path vllm_custom_dataset.jsonl --custom-skip-chat-template --num-prompts 50 --max-concurrency 3 --result-dir ./ --result-filename vllm_benchmark_custom_prompts_concurrency2.json
as you can see, i have given the number of prompts as 50. but in the benchmark output I can see only 36 requests are successful as shown below, what can be the reason for this ?
============ Serving Benchmark Result ============ Successful requests: 36 Maximum request concurrency: 3 Benchmark duration (s): 45.18 Total input tokens: 70188 Total generated tokens: 9216 Request throughput (req/s): 0.80 Output token throughput (tok/s): 203.99 Peak output token throughput (tok/s): 207.00 Peak concurrent requests: 6.00 Total Token throughput (tok/s): 1757.58 ---------------Time to First Token---------------- Mean TTFT (ms): 37.15 Median TTFT (ms): 35.54 P99 TTFT (ms): 51.97 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 14.60 Median TPOT (ms): 14.59 P99 TPOT (ms): 14.71 ---------------Inter-token Latency---------------- Mean ITL (ms): 14.60 Median ITL (ms): 14.57 P99 ITL (ms): 15.06
The most common reason for fewer successful requests than the number of prompts in vLLM benchmarks is that some requests fail due to exceeding the model’s maximum context length, invalid input formatting, or server-side errors. In your case, with --max-model-len set to 2400, if any prompt plus expected output tokens exceeds this limit, those requests will fail, as seen in similar reports and error logs in the vLLM community. Additionally, malformed prompts or issues in the custom dataset can also cause failures, but the benchmark tool typically reports these as failed requests without detailed error messages in the summary output. You can check the detailed error logs or the result file for specific error messages per request to confirm the cause.
This behavior is documented in several vLLM issues, where users observe fewer successful requests than expected, often due to input length or dataset formatting problems, and the benchmark tool marks these as failed without always surfacing the exact error in the summary output. For more details, see the discussion in vllm-project/vllm/issues/11167 and vllm-project/vllm/issues/10163.
Would you like a step-by-step guide on how to debug which requests failed and why?
To find out why some requests are not successful, you should inspect the detailed benchmark result file (e.g., vllm_benchmark_custom_prompts_concurrency2.json) and the server logs. The result file contains per-request status and error messages, which can help you identify if failures are due to input length, formatting, or server errors. This approach is recommended in vllm-project/vllm/issues/10163, where users are advised to check the saved results for error details.
Additionally, reviewing the server.log for error traces during the benchmark run can reveal issues like context length exceeded or malformed input. This method is also suggested in vllm-project/vllm/issues/11167.
Would you like a step-by-step example of how to analyze the result file and logs?
From my experience, the detokenizer often causes failures. Sometimes the model produces invalid or strange tokens, which results in a crash or failure in my setup.
@Binwon the issue doesn’t occur with concurrency of 1, as and when i try to benchmark with higher concurrency, say 2 or 3, there would always be few failed requests. Wanted to know if there’s a reason for this.
Also, from whatever requests that got successful, is it fine to assume the latency numbers are correct ?