I am running Qwen/Qwen3-4B-Instruct-2507 on vLLM 0.6.x using an NVIDIA L4 (24GB) GPU, and I am trying to optimize latency for a real-time voice application. My deployment uses the following vLLM configuration:
To measure performance, I ran 20 concurrent requests with a short prompt (“Explain KV caching like I’m five.”) and max_tokens=256. The results are significantly higher than expected for a 4B model on an L4. In streaming mode, I observed an average latency of about 15.7 seconds (p50 ≈ 16.5s), with TTFT around 1.1–1.3 seconds. Non-streaming calls averaged about 13.3 seconds (p50 ≈ 13.7s). Aggregate throughput ranged between 300–380 tokens/sec across both tests. These numbers are far too high for the sub-second latency typically required for real-time workloads.
I’m trying to determine whether this behavior is expected with vLLM’s continuous batching defaults on an L4, or whether specific configuration changes are necessary to reduce both TTFT and end-to-end latency. I would appreciate recommendations on settings such as --max-num-batched-tokens, batching limits, disabling or tuning continuous batching, or whether applying AWQ/GPTQ quantization meaningfully improves latency on this GPU. My goal is to reach <1s response time for short outputs at moderate concurrency, so any guidance from others who have optimized vLLM for low-latency inference on L4 would be helpful.
Your observed latencies are higher than expected for a 4B model on an L4, especially for real-time use. To optimize for low-latency, you should reduce batching parameters: set --max-num-batched-tokens to a lower value (e.g., 2048 or even 1024) and limit --max-num-seqs to a small number (e.g., 4–8). This prioritizes prompt processing and reduces queueing delays, at the cost of some throughput. Also, ensure chunked prefill is enabled (default in vLLM 0.6.x), as it helps balance prompt and decode latency, but smaller batch sizes further reduce TTFT and E2E latency for short prompts. Quantization (AWQ/GPTQ) may help with throughput but often does not improve—and can worsen—first-token latency for small batches on L4, due to kernel overheads and dequantization costs, as noted in vLLM discussions.
For real-time workloads, the recommended approach is to tune for minimal batching and concurrency, even if it reduces aggregate throughput. Example: