Suggestion to improve inferencing speed

I am running Qwen/Qwen3-4B-Instruct-2507 on vLLM 0.6.x using an NVIDIA L4 (24GB) GPU, and I am trying to optimize latency for a real-time voice application. My deployment uses the following vLLM configuration:

python3 -m vllm.entrypoints.openai.api_server
–model Qwen/Qwen3-4B-Instruct-2507
–host 0.0.0.0 --port 8000
–download-dir /models
–max-model-len 8192
–gpu-memory-utilization 0.85
–tensor-parallel-size 1

To measure performance, I ran 20 concurrent requests with a short prompt (“Explain KV caching like I’m five.”) and max_tokens=256. The results are significantly higher than expected for a 4B model on an L4. In streaming mode, I observed an average latency of about 15.7 seconds (p50 ≈ 16.5s), with TTFT around 1.1–1.3 seconds. Non-streaming calls averaged about 13.3 seconds (p50 ≈ 13.7s). Aggregate throughput ranged between 300–380 tokens/sec across both tests. These numbers are far too high for the sub-second latency typically required for real-time workloads.

I’m trying to determine whether this behavior is expected with vLLM’s continuous batching defaults on an L4, or whether specific configuration changes are necessary to reduce both TTFT and end-to-end latency. I would appreciate recommendations on settings such as --max-num-batched-tokens, batching limits, disabling or tuning continuous batching, or whether applying AWQ/GPTQ quantization meaningfully improves latency on this GPU. My goal is to reach <1s response time for short outputs at moderate concurrency, so any guidance from others who have optimized vLLM for low-latency inference on L4 would be helpful.

Your observed latencies are higher than expected for a 4B model on an L4, especially for real-time use. To optimize for low-latency, you should reduce batching parameters: set --max-num-batched-tokens to a lower value (e.g., 2048 or even 1024) and limit --max-num-seqs to a small number (e.g., 4–8). This prioritizes prompt processing and reduces queueing delays, at the cost of some throughput. Also, ensure chunked prefill is enabled (default in vLLM 0.6.x), as it helps balance prompt and decode latency, but smaller batch sizes further reduce TTFT and E2E latency for short prompts. Quantization (AWQ/GPTQ) may help with throughput but often does not improve—and can worsen—first-token latency for small batches on L4, due to kernel overheads and dequantization costs, as noted in vLLM discussions.

For real-time workloads, the recommended approach is to tune for minimal batching and concurrency, even if it reduces aggregate throughput. Example:

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --host 0.0.0.0 --port 8000 \
  --download-dir /models \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --tensor-parallel-size 1 \
  --max-num-batched-tokens 1024 \
  --max-num-seqs 4

This should significantly lower TTFT and E2E latency for short, concurrent requests. For further details, see Optimization and Tuning, Performance Tuning with Chunked Prefill, and AWQ quantization latency discussion.

Would you like a step-by-step breakdown or more details on quantization effects?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM For --max-model-len 8192 and --max-num-batched-tokens 1024 won’t LLm loose the prompt tokens if the batched tokens are lesser than the model-len set? like won’t it loose some token or context while prefill as it is batched?

No, you will not lose prompt tokens or context. If --max-num-batched-tokens is less than --max-model-len, vLLM will automatically chunk the prompt into multiple steps during prefill, ensuring the full prompt is processed in sequence. This does not drop or truncate tokens; it just splits the computation into smaller batches for efficiency and lower latency. This behavior is documented in the vLLM optimization guide and confirmed in vLLM discussions—chunked prefill preserves all input tokens, regardless of batch size settings.

Would you like more detail on how chunked prefill works internally?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Yeah could you detail it out

Certainly! Here’s how vLLM handles cases where --max-num-batched-tokens is less than --max-model-len:

  1. Chunked Prefill Mechanism:
    When a prompt’s length exceeds the max_num_batched_tokens limit, vLLM automatically splits (chunks) the prompt into smaller segments. Each chunk fits within the batch token budget and is processed sequentially. The model accumulates the KV cache across these steps, so the entire prompt is effectively processed in order—no tokens are lost or skipped. This is called “chunked prefill” and is enabled by default in vLLM v1+ (source).

  2. No Loss of Context:
    Each chunk updates the model’s internal state (KV cache), so when the next chunk is processed, it continues from where the previous left off. This ensures the full prompt is seen by the model, even if it’s longer than the per-batch token limit. The only tradeoff is a slight increase in total prefill time for very long prompts, but all context is preserved (source).

Example:
If you set --max-model-len 8192 and --max-num-batched-tokens 1024, and submit a prompt of 4096 tokens, vLLM will process it in 4 sequential chunks of 1024 tokens each, updating the KV cache after each chunk. The model will then generate output as if it had seen the entire 4096-token prompt at once.

Let me know if you want a code-level or scheduling diagram explanation!

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM what happens if --max-model-len and --max-num-batched-tokens both are set to 16384 will it be fast or slow? The input tokens might not exceed 10K tokens mostly

If both --max-model-len and --max-num-batched-tokens are set to 16384, vLLM will process up to 16,384 tokens in a single batch, which is efficient for throughput but can increase latency for short prompts (like ≤10K tokens) due to larger batch prefill and higher memory usage. For real-time, low-latency workloads with short inputs, this setting may actually slow down response time compared to using a smaller --max-num-batched-tokens (e.g., 2048), as smaller batches reduce prompt queuing and prefill time (Performance Tuning with Chunked Prefill).

In summary: high values favor throughput, but for low-latency and short prompts, use a lower --max-num-batched-tokens for faster responses.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Will the vLLM wait until the max-num-batched-tokens are accumulated?

No, vLLM does not wait until max-num-batched-tokens are accumulated before processing; it schedules batches dynamically based on available requests and the current token budget. Requests are processed as soon as possible, up to the specified limits, to minimize latency (Optimization and Tuning, Performance Tuning with Chunked Prefill).

Would you like more detail on the dynamic batching and scheduling policy?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Yeah you can continue giving more details and what is the default scheduling policy?

vLLM uses a dynamic batching scheduler that does not wait for the batch to fill; it processes available requests immediately, up to the limits set by max-num-batched-tokens and max-num-seqs. This minimizes latency for real-time workloads, as new requests are picked up and processed as soon as resources allow (Optimization and Tuning).

The default scheduling policy is First-Come, First-Served (FCFS), meaning requests are handled in the order they arrive. You can change this to a priority-based policy if needed, but FCFS is the standard for most deployments (SchedulerConfig).

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM The below log gets repeated and hits the timeout error
(APIServer pid=1) INFO 03-11 00:07:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.4%, Prefix cache hit rate: 85.8%, MM cache hit rate: 0.0%

I have set max-model-len 16384 max-num-batched-tokens 4096 The in prompt tokens is at max 8192

This log pattern—“Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs…”—repeating with timeouts, typically indicates the engine is stuck during prompt processing or generation, often due to a long prompt, resource contention, or a deadlock. Since your max-model-len (16384) and max-num-batched-tokens (4096) are set high, but your prompt is only 8192 tokens, this should not cause token loss, but may still result in slow prefill if the batch size is too large for your GPU or if other configuration issues exist (Troubleshooting).

Common causes include: insufficient GPU memory, a bug in the model or vLLM version, or a scheduling deadlock. Try lowering max-num-batched-tokens (e.g., to 1024), enabling debug logging (export VLLM_LOGGING_LEVEL=DEBUG), and monitoring GPU utilization. If the issue persists, consult the troubleshooting guide for further debugging steps.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM If I set max-num-batched-tokens to 1024 won’t that make the throughput slower?

Yes, setting max-num-batched-tokens to 1024 will generally reduce throughput (tokens/sec) compared to higher values, because smaller batches are processed at a time. However, this tradeoff is necessary for lower latency in real-time applications—smaller batches mean requests are processed faster, reducing wait time for each user (Performance Tuning with Chunked Prefill).

Would you like a detailed explanation of the throughput-latency tradeoff?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Does vLLM make sure the batched tokens doesn’t overlap between requests?

Yes, vLLM ensures that batched tokens from different requests do not overlap; each request’s tokens are kept distinct within the batch, and the model processes them independently. The batching mechanism simply groups requests for efficiency, but each sequence’s context and output are isolated (Performance Tuning with Chunked Prefill).

Would you like more detail on how vLLM manages token separation in batches?

Sources:

Hint: Mention @RunLLM in the post for followups.