Gigantic tmp directory

jhsansom · July 15, 2025, 1:30pm

Hi all,

I have been using vllm (version 0.9.1) to run Qwen/Qwen2.5-32B-Instruct. It works well, except that, while running the server and prompting it, vllm creates a folder at /tmp/<username>/vllm/vllm-instance-176ca that quickly grows to be quite large. While running an experiment overnight last night, the directory grew to 1.8TB.

I have two questions:

What is this directory used for? It seems like it must be used to store a cache related to previous prompts that are no longer needed, but I don’t know.
Is there a way to disable it?

Here is the command I used to launch my vllm server:

vllm serve Qwen/Qwen2.5-32B-Instruct \
    --quantization fp8 \
    --tensor_parallel_size 8 \
    --gpu_memory_utilization 0.9 \
    --max_model_len 16384 \
    --max_num_seqs 8 \
    --port 8080 \
    --disable_log_requests \
    --enforce-eager

RunLLM · July 15, 2025, 1:31pm

It seems likely that the /tmp//vllm/vllm-instance-* directory is used for temporary storage related to vLLM’s KV cache, especially when features like CPU offloading or large context windows are in use. This directory can grow very large if many requests or long contexts are processed, as vLLM may offload KV cache blocks to disk when GPU/CPU memory is insufficient or when certain optimizations are enabled. There is no explicit documentation or configuration option in the provided sources to fully disable this disk usage for KV cache, and attempts to disable or clear the KV cache are not currently supported in vLLM—see the discussion in this issue.

If you want to limit disk usage, you may need to reduce the context length (max_model_len), batch size (max_num_seqs), or avoid enabling features that trigger CPU/disk offloading. There is no documented flag to turn off disk-based KV cache or to set a maximum disk usage for this directory in the current vLLM versions, and setting gpu_memory_utilization to 0 or trying to disable the cache is not supported and will cause errors. Would you like more detail on how to tune these parameters to reduce disk usage?

Sources:

I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?

jhsansom · July 15, 2025, 1:34pm

Thanks for the response, but if this is true, it seems like a much needed feature addition or bug fix. It shouldn’t be necessary to keep old KV cache blocks.

RunLLM · July 15, 2025, 1:35pm

You are correct—several users have requested more control over KV cache persistence and disk usage, and maintainers have acknowledged this as a needed feature or improvement. Currently, vLLM does not provide a way to disable or limit the disk-based KV cache, and the cache eviction policy is designed for performance, not for minimizing disk usage. There is ongoing discussion about adding such features, but as of now, no fix or option exists to prevent accumulation of old KV cache blocks on disk (see discussion).

If this is a blocker for your use case, you may want to follow or contribute to the relevant GitHub issues to advocate for this feature. Would you like more detail on the current cache eviction policy or on how to monitor and manage disk usage manually?

Sources:

I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?

robertshaw · July 15, 2025, 2:01pm

I don’t think the bot is correct here.

RunLLM · July 15, 2025, 2:02pm

You are correct—vLLM and its supported models do not perform facial recognition or person identification from images. If you provide an image and ask “Who is this?”, the model will typically respond with a disclaimer such as “I’m sorry, I can’t tell who this is based on the image,” because these models are not trained for or permitted to identify individuals (see supported models and capabilities). This is expected behavior and not a bug.

Would you like more detail on what types of image understanding vLLM models can perform?

Sources:

see supported models and capabilities

robertshaw · July 15, 2025, 2:02pm

Do you have VLLM_TRACE_FUNCTION=1 in your enviornment? This logs the name of every python function called to disk to the directory listed above.

This should only be used for debugging

RunLLM · July 15, 2025, 2:02pm

If you have the environment variable VLLM_TRACE_FUNCTION=1 set, vLLM will log every Python function call to disk for debugging purposes, which can rapidly consume large amounts of disk space. This setting should only be enabled when actively debugging hangs or crashes, not during normal operation (troubleshooting docs, logger API). Leaving it enabled will cause the /tmp/<username>/vllm/vllm-instance-* directory to grow very large.

Would you like more detail on how to disable this or clean up the logs?

Sources:

jhsansom · July 15, 2025, 2:04pm

Yes! Thank you very much. I forgot I had run it in “debug” mode, which turns that flag on within my repository.

RunLLM · July 15, 2025, 2:05pm

You’re welcome! Running in “debug” mode with VLLM_TRACE_FUNCTION=1 enabled will indeed generate extensive logs and quickly fill up disk space. For normal usage, make sure to unset this environment variable or switch out of debug mode to prevent further large log files (troubleshooting docs).

Would you like instructions on how to clean up existing log files or further tips on managing debug settings?

Sources:

Troubleshooting — vLLM

jhsansom · July 15, 2025, 2:06pm

@RunLLM, you didn’t help bro!! Credit to @robertshaw for the solution.

Topic		Replies	Views
Support for (sparse) key value caching KV-Cache	16	74	May 3, 2025
VUA - library code for LLM inference engines for external storage of KV caches KV-Cache	1	23	May 13, 2025
How log kvcache usage and prefix hit rate in offline infer? General	19	112	May 13, 2025
In single node deployment environment, how can we make vllm call unified_attention more often to trigger KVCache connector workload General	12	20	June 24, 2025
Why vLLM uses alot of CPU memory General	1	390	April 21, 2025

Gigantic tmp directory

Related topics