Hi! While using vLLM versions 0.10.*, I encountered the problem that when inference zai-org/GLM 4.5V models, the memory of my GPUs is rapidly clogging up. I am using 8 A100 and gpu_memory_utilisation=0.5 and max_seq_len=1, but the memory is still clogged. Studying this issue, the most effective solution in terms of stability, but not speed, turned out to be using the VLLM_USE_V1=0 flag. The memory is also clogged, but there is no cuda out of memory. Maybe there is a known way to overcome this problem without using VLLM_USE_V1=0. My current script:
vLLM V1 (VLLM_USE_V1=1) is known to use more GPU memory due to CUDA graph compilation and increased KV cache allocation, especially with large models like GLM-4.5V, even when gpu_memory_utilization is set low and max_seq_len is minimal. This can lead to rapid memory consumption and OOM errors. Disabling V1 (VLLM_USE_V1=0) improves stability but at the cost of speed, as you observed. There is currently no fully effective workaround in V1 other than reducing max_num_seqs, max_model_len, or using enforce_eager=True to disable CUDA graphs, which also impacts performance. Adjusting these parameters may help, but memory usage will still be higher than V0. This is a known issue, and the only reliable solution for now is to use V0 for stability, as confirmed by multiple users and maintainers (source, source, source, source).
If you want to experiment, try lowering max_num_seqs, max_model_len, and/or set enforce_eager=True in your config to reduce memory usage, but expect a trade-off in throughput and latency. There is no parameter or flag in the data sources that fully resolves the V1 memory spike for large models without these trade-offs. Would you like a step-by-step breakdown or code example for these adjustments?
Sources:
Can you elaborate on what you mean by “clogging up memory”? You can try disabling CUDA graph as suggested by RunLLM first.
“Clogging up memory” refers to vLLM rapidly allocating most or all available GPU memory, often leading to out-of-memory (OOM) errors or leaving little room for other processes. This is common in vLLM V1 due to aggressive KV cache allocation and CUDA graph compilation, which both consume significant memory—even with low gpu_memory_utilization and minimal max_seq_len (source, source).
Disabling CUDA graph (by setting enforce_eager=True in your LLM config or using the --enforce-eager CLI flag) can reduce memory usage, as CUDA graph compilation is a major contributor to the memory spike in V1 (source, source). Would you like a code example for disabling CUDA graph or further details on memory allocation in vLLM?
Sources:
The 40GB free space on each GPU will be completely used up on the inference after loading the model.
“Clogging up memory” here means that after loading the model, vLLM quickly allocates all available GPU memory (e.g., the full 40GB on each A100), leaving no free space for other processes or additional inference tasks. This is due to vLLM’s static allocation of memory for the model weights, KV cache, and (in V1) CUDA graph compilation, which can consume nearly all GPU resources regardless of actual inference needs (source, source).
Disabling CUDA graph (by setting enforce_eager=True) can reduce this memory pressure, as CUDA graph compilation is a major source of extra allocation in V1. Would you like a code example for this adjustment?
Sources:
What parameters did you pass when using V1 Engine?
The main parameters affecting memory usage in vLLM V1 Engine are gpu_memory_utilization, max_model_len, max_num_seqs, and optionally enforce_eager. For example, you might use:
llm = LLM(
model="your-model",
gpu_memory_utilization=0.5,
max_model_len=1,
max_num_seqs=1,
enforce_eager=True # to disable CUDA graph and reduce memory usage
)
or via CLI:
python -m vllm.entrypoints.openai.api_server --model your-model --gpu-memory-utilization 0.5 --max-model-len 1 --max-num-seqs 1 --enforce-eager
These parameters control how much GPU memory is pre-allocated and how large the context/batch can be (source, source). Would you like more details on each parameter?
Sources:
--tensor-parallel-size 4
--pipeline-parallel-size 2
--gpu-memory-utilization 0.5
--swap-space 32
--max-num-seqs 1
--disable-mm-preprocessor-cache
--enable-expert-parallel
--enforce-eager (I'm trying it now)
Your parameters are valid and align with vLLM’s documented options for controlling memory and parallelism. Using --enforce-eager disables CUDA graph, which should help reduce memory usage in V1, though it may lower throughput (source). The other flags (--tensor-parallel-size, --pipeline-parallel-size, --gpu-memory-utilization, --swap-space, --enable-expert-parallel) are all supported and affect how the model is distributed and how much memory is reserved (source).
If memory is still fully consumed, further reduction of --max-model-len or --max-num-seqs may be necessary. Would you like a breakdown of how each parameter impacts memory?
Sources:
