请问我在使用vllm 推理qwen3-vl的时候多次请求的过程中比如发起5个请求后，发送的内容是一致的，前面4个请求相对较快返回，最后一个请求，也在running没有处于waiting中，但是等待的结果却比他们慢了很多，比如前面四个可能40s能返回，最后一个需要6分钟

Squidward-J · December 11, 2025, 6:00am

RunLLM · December 11, 2025, 6:01am

The dashboard visualizes key vLLM production metrics: E2E Request Latency (P99, P95, P90, Avg), Token Throughput (prompt/generation tokens/sec), Time Per Output Token Latency (various percentiles), and Scheduler State (running/waiting requests). The spike at 13:50 in both latency and throughput, along with a peak in running/waiting tasks, suggests a temporary load increase or resource contention, followed by recovery and stabilization—this is typical for systems under bursty traffic or resource constraints, as described in vLLM metrics documentation and Grafana dashboard examples.

These metrics help diagnose performance bottlenecks, monitor SLOs, and understand how request load impacts latency and throughput. Would you like a detailed explanation of how to interpret or act on these metrics?

Sources:

vLLM Metrics Documentation

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Vllm处理长输入性能变差讨论 General	8	372	November 19, 2025
Performance Issue While Requests Queuing General	3	377	April 8, 2025
在V100显卡上，vLLM并发问题 General	2	105	November 13, 2025
Vllm bench serve not all requests are successful. whats the reason? Benchmarking	5	135	October 23, 2025
Num request running stays on 1 Benchmarking	3	146	August 29, 2025

Related topics