@RunLLM
One key observation from our study is regarding model quantization. Most of the original model weights are in BF16 format, but to serve these models efficiently on our 48GB L40S GPUs, we need to use quantized variants such as AWQ (4-bit quantization) or FP8 (8-bit quantization).
From our analysis:
We also observed that FP8 quantization is not yet fully optimized for L40S GPUs, whereas AWQ appears to be better optimized. Because of this:
AWQ consumes lesser vRAM and model loading completes in around 30 seconds.
FP8 consumes comparatively higher vRAM and model loading takes close to 2 minutes.
However, in terms of output quality/reasoning performance, FP8 responses are noticeably better than AWQ. This also indicates that the original BF16 weights would likely perform significantly better than any quantized version.
I was also exploring this direction further to understand the trade-off between model quality, vRAM utilization, and model loading latency.
Your observations are correct: FP8 quantization in vLLM is not yet fully optimized for L40S (Ada Lovelace, SM 8.9) GPUs, leading to higher VRAM usage and longer model loading times compared to AWQ. AWQ is more memory-efficient and loads faster, but FP8 generally preserves model quality better, with output quality closer to original BF16 weights. BF16 will always provide the best quality, but at the cost of much higher VRAM usage and slower loading times. These trade-offs are well-documented in the vLLM quantization docs and user reports (docs: FP8, docs: AWQ, user report, user report).
Would you like more detail on the technical reasons for these differences, or guidance on choosing between AWQ and FP8 for your use case?
LLM requests may time out if the server takes longer than the client’s 120-second timeout to process a request. For Qwen3 VL 30B FP8 on L40S, this can happen due to high VRAM usage, large prompt/context length, high concurrency, or suboptimal FP8 kernel performance on L40S. FP8 quantization is not fully optimized for L40S, leading to slower inference and longer loading times, which increases the risk of timeouts (user report, docs: FP8).
You can mitigate this by reducing max_model_len, max_num_seqs, or batch size, or by increasing the client timeout. Would you like more detail on tuning these parameters?
Timeouts can still occur on A6000 GPUs for Qwen3 VL 30B FP8 due to similar reasons: FP8 quantization is not fully optimized for Ampere (A6000), leading to higher VRAM usage and slower inference, especially with large prompts or high concurrency (user report, docs: FP8). If requests exceed the client’s 120-second timeout, they will fail.
Reducing max_model_len, max_num_seqs, or batch size, or increasing the client timeout, can help. Would you like more detail on specific tuning strategies?
This behavior is often due to GPU KV cache exhaustion or fragmentation: as more long-context requests are processed, the available KV cache fills up, and once it nears 100%, new requests may get stuck in the waiting queue or never get scheduled, causing the server to stop responding until restarted (issue example, issue example). Earlier requests may succeed if there is enough free KV cache, but as usage accumulates, the system can deadlock or stall.
Would you like more detail on how to monitor or tune KV cache usage to avoid this?
You can tune KV cache usage in vLLM by adjusting --gpu-memory-utilization (default 0.92) or explicitly setting --kv-cache-memory-bytes to control the amount of GPU memory reserved for KV cache (docs, API reference). KV cache blocks are freed when a request completes, but if the cache is full and blocks are still referenced (e.g., due to prefix caching or concurrent requests), eviction may not occur immediately (design doc).
Would you like step-by-step guidance on setting these parameters or monitoring cache usage?
You can monitor KV cache usage in vLLM using the Prometheus metric vllm:kv_cache_usage_perc, which shows the fraction of used KV cache blocks (0–1). This metric is available at the /metrics endpoint and can be visualized in Grafana (metrics doc, metrics overview).
Would you like more detail on setting up Prometheus/Grafana or interpreting these metrics?
# HELP vllm:kv_cache_usage_perc KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:kv_cache_usage_perc gauge
vllm:kv_cache_usage_perc{engine="0",model_name="Agent-Model"} 0.0
# HELP vllm:prefix_cache_queries_total Prefix cache queries, in terms of number of queried tokens.
# TYPE vllm:prefix_cache_queries_total counter
vllm:prefix_cache_queries_total{engine="0",model_name="Agent-Model"} 131450.0
# HELP vllm:prefix_cache_queries_created Prefix cache queries, in terms of number of queried tokens.
# TYPE vllm:prefix_cache_queries_created gauge
vllm:prefix_cache_queries_created{engine="0",model_name="Agent-Model"} 1.7800383546102061e+09
# HELP vllm:prefix_cache_hits_total Prefix cache hits, in terms of number of cached tokens.
# TYPE vllm:prefix_cache_hits_total counter
vllm:prefix_cache_hits_total{engine="0",model_name="Agent-Model"} 80320.0
# HELP vllm:prefix_cache_hits_created Prefix cache hits, in terms of number of cached tokens.
# TYPE vllm:prefix_cache_hits_created gauge
vllm:prefix_cache_hits_created{engine="0",model_name="Agent-Model"} 1.7800383546102252e+09
@RunLLM What does Prefix cache mean and how to clear that?
Prefix cache in vLLM refers to the caching of KV (key-value) blocks for prompt prefixes, allowing future requests with the same prefix to reuse cached computations and reduce latency (design doc). There is no built-in command to manually clear the prefix cache; it is managed automatically by vLLM and evicted based on LRU or memory pressure (metrics doc).
Would you like more detail on how prefix caching works or how to tune its behavior?
You can tune prefix cache behavior in vLLM by adjusting the amount of GPU memory allocated for KV cache (using --gpu-memory-utilization or --kv-cache-memory-bytes), which indirectly controls how much prefix cache can be retained before eviction (docs). The eviction policy is LRU (least recently used) by default and cannot be changed to LFU or other policies as of now (design doc).
If you want to isolate cache between users or sessions, you can use the cache_salt parameter in your request to prevent unwanted cache sharing (design doc: cache isolation). Would you like more detail on these parameters or example usage?
@RunLLM Does this use the same KV cache used at inference? Because from my calculation the KV Cache size for 80320 tokens is nearly 7.5GB will this lead to timeout?
Yes, prefix cache uses the same KV cache memory as inference in vLLM (design doc). If the KV cache fills up (e.g., 80,320 tokens ≈ 7.5GB for a large model), new requests may be delayed or blocked, potentially leading to timeouts, especially with long contexts or high concurrency (metrics doc).
Would you like more detail on how to estimate KV cache requirements or avoid timeouts?
Certainly! The KV cache in vLLM stores the intermediate key and value tensors for each token, layer, and attention head during inference. Prefix caching reuses these KV blocks for repeated prompt prefixes, but both inference and prefix cache share the same GPU memory pool (design doc). If the total number of cached tokens (from active and cached prefixes) approaches the KV cache capacity, vLLM will evict the least recently used blocks, but if the cache is full and cannot evict (e.g., due to ongoing requests or pinned blocks), new requests may be delayed or time out (metrics doc).
To estimate KV cache requirements, use the formula:
For example, a 30B model with 80,320 tokens, 48 layers, 64 heads, head_dim 128, and fp16 (2 bytes) would need:
80,320 × 48 × 64 × 128 × 2 × 2 bytes ≈ 8.1 GB
If your workload often fills the cache, reduce max_model_len, batch size, or concurrency, or increase GPU memory allocation for KV cache to avoid timeouts.