vLLM L40S quantization optimization

Ranjith · May 29, 2026, 7:39am

@RunLLM
One key observation from our study is regarding model quantization. Most of the original model weights are in BF16 format, but to serve these models efficiently on our 48GB L40S GPUs, we need to use quantized variants such as AWQ (4-bit quantization) or FP8 (8-bit quantization).

From our analysis:

We also observed that FP8 quantization is not yet fully optimized for L40S GPUs, whereas AWQ appears to be better optimized. Because of this:

AWQ consumes lesser vRAM and model loading completes in around 30 seconds.
FP8 consumes comparatively higher vRAM and model loading takes close to 2 minutes.

However, in terms of output quality/reasoning performance, FP8 responses are noticeably better than AWQ. This also indicates that the original BF16 weights would likely perform significantly better than any quantized version.

I was also exploring this direction further to understand the trade-off between model quality, vRAM utilization, and model loading latency.

Could you confirm this

RunLLM · May 29, 2026, 7:40am

Your observations are correct: FP8 quantization in vLLM is not yet fully optimized for L40S (Ada Lovelace, SM 8.9) GPUs, leading to higher VRAM usage and longer model loading times compared to AWQ. AWQ is more memory-efficient and loads faster, but FP8 generally preserves model quality better, with output quality closer to original BF16 weights. BF16 will always provide the best quality, but at the cost of much higher VRAM usage and slower loading times. These trade-offs are well-documented in the vLLM quantization docs and user reports (docs: FP8, docs: AWQ, user report, user report).

Would you like more detail on the technical reasons for these differences, or guidance on choosing between AWQ and FP8 for your use case?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 11:52am

@RunLLM Sometimes the LLM requests are getting timed out why? The time out set from client side is 120 secs, I served Qwen3 VL 30B FP8 on L40S

RunLLM · May 29, 2026, 11:53am

LLM requests may time out if the server takes longer than the client’s 120-second timeout to process a request. For Qwen3 VL 30B FP8 on L40S, this can happen due to high VRAM usage, large prompt/context length, high concurrency, or suboptimal FP8 kernel performance on L40S. FP8 quantization is not fully optimized for L40S, leading to slower inference and longer loading times, which increases the risk of timeouts (user report, docs: FP8).

You can mitigate this by reducing max_model_len, max_num_seqs, or batch size, or by increasing the client timeout. Would you like more detail on tuning these parameters?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 11:53am

@RunLLM the GPU used was A6000 not L40S

RunLLM · May 29, 2026, 11:54am

Timeouts can still occur on A6000 GPUs for Qwen3 VL 30B FP8 due to similar reasons: FP8 quantization is not fully optimized for Ampere (A6000), leading to higher VRAM usage and slower inference, especially with large prompts or high concurrency (user report, docs: FP8). If requests exceed the client’s 120-second timeout, they will fail.

Reducing max_model_len, max_num_seqs, or batch size, or increasing the client timeout, can help. Would you like more detail on specific tuning strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 12:07pm

@RunLLM But with same long context some of the requests get processed why suddenly things stop working?

RunLLM · May 29, 2026, 12:08pm

This behavior is often due to GPU KV cache exhaustion or fragmentation: as more long-context requests are processed, the available KV cache fills up, and once it nears 100%, new requests may get stuck in the waiting queue or never get scheduled, causing the server to stop responding until restarted (issue example, issue example). Earlier requests may succeed if there is enough free KV cache, but as usage accumulates, the system can deadlock or stall.

Would you like more detail on how to monitor or tune KV cache usage to avoid this?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 12:09pm

@RunLLM Yeah I want to tune the KV Cache usage how can that be done? Don’t the KV Cache get cleared once the request is completed?

RunLLM · May 29, 2026, 12:10pm

You can tune KV cache usage in vLLM by adjusting --gpu-memory-utilization (default 0.92) or explicitly setting --kv-cache-memory-bytes to control the amount of GPU memory reserved for KV cache (docs, API reference). KV cache blocks are freed when a request completes, but if the cache is full and blocks are still referenced (e.g., due to prefix caching or concurrent requests), eviction may not occur immediately (design doc).

Would you like step-by-step guidance on setting these parameters or monitoring cache usage?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 12:13pm

@RunLLM Could you let me know how to monitor the cache usage?

RunLLM · May 29, 2026, 12:14pm

You can monitor KV cache usage in vLLM using the Prometheus metric vllm:kv_cache_usage_perc, which shows the fraction of used KV cache blocks (0–1). This metric is available at the /metrics endpoint and can be visualized in Grafana (metrics doc, metrics overview).

Would you like more detail on setting up Prometheus/Grafana or interpreting these metrics?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 12:18pm

# HELP vllm:kv_cache_usage_perc KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:kv_cache_usage_perc gauge
vllm:kv_cache_usage_perc{engine="0",model_name="Agent-Model"} 0.0
# HELP vllm:prefix_cache_queries_total Prefix cache queries, in terms of number of queried tokens.
# TYPE vllm:prefix_cache_queries_total counter
vllm:prefix_cache_queries_total{engine="0",model_name="Agent-Model"} 131450.0
# HELP vllm:prefix_cache_queries_created Prefix cache queries, in terms of number of queried tokens.
# TYPE vllm:prefix_cache_queries_created gauge
vllm:prefix_cache_queries_created{engine="0",model_name="Agent-Model"} 1.7800383546102061e+09
# HELP vllm:prefix_cache_hits_total Prefix cache hits, in terms of number of cached tokens.
# TYPE vllm:prefix_cache_hits_total counter
vllm:prefix_cache_hits_total{engine="0",model_name="Agent-Model"} 80320.0
# HELP vllm:prefix_cache_hits_created Prefix cache hits, in terms of number of cached tokens.
# TYPE vllm:prefix_cache_hits_created gauge
vllm:prefix_cache_hits_created{engine="0",model_name="Agent-Model"} 1.7800383546102252e+09

@RunLLM What does Prefix cache mean and how to clear that?

RunLLM · May 29, 2026, 12:18pm

Prefix cache in vLLM refers to the caching of KV (key-value) blocks for prompt prefixes, allowing future requests with the same prefix to reuse cached computations and reduce latency (design doc). There is no built-in command to manually clear the prefix cache; it is managed automatically by vLLM and evicted based on LRU or memory pressure (metrics doc).

Would you like more detail on how prefix caching works or how to tune its behavior?

Sources:

docs: Prefix caching design

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 12:19pm

@RunLLM Yeah tell me how to tune its behaviour

RunLLM · May 29, 2026, 12:19pm

You can tune prefix cache behavior in vLLM by adjusting the amount of GPU memory allocated for KV cache (using --gpu-memory-utilization or --kv-cache-memory-bytes), which indirectly controls how much prefix cache can be retained before eviction (docs). The eviction policy is LRU (least recently used) by default and cannot be changed to LFU or other policies as of now (design doc).

If you want to isolate cache between users or sessions, you can use the cache_salt parameter in your request to prevent unwanted cache sharing (design doc: cache isolation). Would you like more detail on these parameters or example usage?

Sources:

design doc

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 12:24pm

@RunLLM Does this use the same KV cache used at inference? Because from my calculation the KV Cache size for 80320 tokens is nearly 7.5GB will this lead to timeout?

RunLLM · May 29, 2026, 12:25pm

Yes, prefix cache uses the same KV cache memory as inference in vLLM (design doc). If the KV cache fills up (e.g., 80,320 tokens ≈ 7.5GB for a large model), new requests may be delayed or blocked, potentially leading to timeouts, especially with long contexts or high concurrency (metrics doc).

Would you like more detail on how to estimate KV cache requirements or avoid timeouts?

Sources:

docs: Prefix caching design

Hint: Mention @RunLLM in the post for followups.

Ranjith · May 29, 2026, 12:25pm

@RunLLM Yeah give more details

RunLLM · May 29, 2026, 12:26pm

Certainly! The KV cache in vLLM stores the intermediate key and value tensors for each token, layer, and attention head during inference. Prefix caching reuses these KV blocks for repeated prompt prefixes, but both inference and prefix cache share the same GPU memory pool (design doc). If the total number of cached tokens (from active and cached prefixes) approaches the KV cache capacity, vLLM will evict the least recently used blocks, but if the cache is full and cannot evict (e.g., due to ongoing requests or pinned blocks), new requests may be delayed or time out (metrics doc).

To estimate KV cache requirements, use the formula:

KV cache size ≈ num_tokens × num_layers × num_heads × head_dim × 2 (K+V) × dtype_size

For example, a 30B model with 80,320 tokens, 48 layers, 64 heads, head_dim 128, and fp16 (2 bytes) would need:

80,320 × 48 × 64 × 128 × 2 × 2 bytes ≈ 8.1 GB

If your workload often fills the cache, reduce max_model_len, batch size, or concurrency, or increase GPU memory allocation for KV cache to avoid timeouts.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Suggestion to improve inferencing speed Model Support	17	1159	March 11, 2026
Support for (sparse) key value caching KV-Cache	16	876	May 3, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	288	November 10, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	1011	September 8, 2025
LLM memory caching General	7	273	May 7, 2026

vLLM L40S quantization optimization

Related topics