Multimodal inference guideline?

brechtBDCK · June 25, 2025, 7:14am

nvidia-smi shows both gpu’s visible(two times 24GB). However i get the following errors:

(VllmWorkerProcess pid=723155) ERROR 06-24 19:28:36 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: out of memory
(VllmWorkerProcess pid=723155) ERROR 06-24 19:28:36 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=723155) ERROR 06-24 19:28:36 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=723155) ERROR 06-24 19:28:36 [multiproc_worker_utils.py:239] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=723155) ERROR 06-24 19:28:36 [multiproc_worker_utils.py:239]
Failed to evaluate AIDC-AI/Ovis2-1B: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

lmWorkerProcess pid=726023) INFO 06-24 19:30:54 [cuda.py:275] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=726023) INFO 06-24 19:30:54 [cuda.py:324] Using XFormers backend.
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method init_device.
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] File “/home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.
py”, line 233, in _run_worker_process
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] File “/home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/vllm/utils.py”, line 2671, in run_met
hod
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] File “/home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/vllm/worker/worker_base.py”, line 606, in init_device
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] self.worker.init_device() # type: ignore
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] File “/home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/vllm/worker/worker.py”, line 182, in init_device
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] File “/home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/cuda/init.py”, line 529, in set_device
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] RuntimeError: CUDA error: out of memory
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=726023) ERROR 06-24 19:30:55 [multiproc_worker_utils.py:239]
Failed to evaluate Qwen/Qwen2.5-VL-32B-Instruct: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR 06-24 19:30:55 [multiproc_worker_utils.py:121] Worker VllmWorkerProcess pid 596088 died, exit code: -15
INFO 06-24 19:30:55 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
ERROR 06-24 19:30:55 [multiproc_worker_utils.py:121] Worker VllmWorkerProcess pid 643798 died, exit code: -15
INFO 06-24 19:30:55 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
ERROR 06-24 19:30:55 [multiproc_worker_utils.py:121] Worker VllmWorkerProcess pid 691188 died, exit code: -15
INFO 06-24 19:30:55 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
ERROR 06-24 19:30:56 [multiproc_worker_utils.py:121] Worker VllmWorkerProcess pid 572763 died, exit code: -15
INFO 06-24 19:30:56 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
INFO 06-24 19:30:56 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
ERROR 06-24 19:30:56 [multiproc_worker_utils.py:121] Worker VllmWorkerProcess pid 572244 died, exit code: -15
INFO 06-24 19:30:56 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
[rank0]:[W624 19:30:59.935167688 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info,
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

RunLLM · June 25, 2025, 7:15am

You are getting CUDA out-of-memory errors even with two 24GB GPUs because:

The model may still be too large for the combined VRAM, especially for 32B or larger models (e.g., Qwen2.5-VL-32B-Instruct). Even with tensor parallelism, each GPU must hold a shard of the model and its share of the KV cache, and some models simply require more VRAM per GPU than you have (see here, here).
vLLM pre-allocates most GPU memory for the KV cache and model weights. If you set max_model_len or gpu_memory_utilization too high, or if you use a large batch size, you can easily run out of memory (conserving memory).

What to do:

Lower max_model_len (e.g., 1024 or 2048).
Set max_num_seqs=1 to minimize batch size.
Reduce gpu_memory_utilization (e.g., 0.7 or 0.8).
Try a smaller model (e.g., 7B or 13B) to confirm your setup works.
Make sure no other processes are using GPU memory (nvidia-smi).
If using quantized models, ensure you set the correct quantization flag.

Example:

llm = LLM(
    model="Qwen/Qwen2.5-VL-32B-Instruct",
    tensor_parallel_size=2,
    max_model_len=1024,
    max_num_seqs=1,
    gpu_memory_utilization=0.7,
)

Would you like more detail on how to estimate the maximum model size for your GPUs?

Sources:

brechtBDCK · June 25, 2025, 12:13pm

I still have the following errors even though i have plenty of vram:

Evaluating OpenGVLab/InternVL3-14B…
INFO 06-25 09:29:57 [config.py:823] This model supports multiple tasks: {‘reward’, ‘generate’, ‘classify’, ‘embed’, ‘score’}. Defaulting to ‘generate’.
WARNING 06-25 09:29:57 [config.py:3220] Your device ‘Quadro RTX 6000’ (with compute capability 7.5) doesn’t support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 06-25 09:29:57 [config.py:3271] Casting torch.bfloat16 to torch.float16.
WARNING 06-25 09:29:57 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-25 09:29:57 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model=‘OpenGVLab/InternVL3-14B’, speculative_config=None, tokenizer=‘OpenGVLab/InternVL3-14B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12288, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=OpenGVLab/InternVL3-14B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={“level”:0,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:,“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:0,“cudagraph_capture_sizes”:,“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:0,“local_cache_dir”:null}, use_cached_outputs=False,
INFO 06-25 09:29:58 [model_runner.py:1171] Starting to load model OpenGVLab/InternVL3-14B…
Failed to evaluate OpenGVLab/InternVL3-14B: CUDA out of memory. Tried to allocate 1.45 GiB. GPU 0 has a total capacity of 23.64 GiB of which 390.50 MiB is free. Including non-PyTorch memory, this process has 23.25 GiB memory in use. Of the allocated memory 22.87 GiB is allocated by PyTorch, and 189.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

AND

Evaluating Qwen/Qwen2.5-VL-32B-Instruct…
INFO 06-25 09:32:31 [config.py:823] This model supports multiple tasks: {‘reward’, ‘generate’, ‘classify’, ‘embed’, ‘score’}. Defaulting to ‘generate’.
WARNING 06-25 09:32:31 [config.py:3220] Your device ‘Quadro RTX 6000’ (with compute capability 7.5) doesn’t support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 06-25 09:32:31 [config.py:3271] Casting torch.bfloat16 to torch.float16.
WARNING 06-25 09:32:31 [arg_utils.py:1490] The model has a long context length (128000). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
WARNING 06-25 09:32:31 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-25 09:32:31 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model=‘Qwen/Qwen2.5-VL-32B-Instruct’, speculative_config=None, tokenizer=‘Qwen/Qwen2.5-VL-32B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen2.5-VL-32B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={“level”:0,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:,“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:0,“cudagraph_capture_sizes”:,“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:0,“local_cache_dir”:null}, use_cached_outputs=False,
INFO 06-25 09:32:33 [model_runner.py:1171] Starting to load model Qwen/Qwen2.5-VL-32B-Instruct…
Failed to evaluate Qwen/Qwen2.5-VL-32B-Instruct: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 16.50 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.21 GiB is allocated by PyTorch, and 212.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
[rank0]:[W625 09:32:34.771678379 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please seel#shutdown (function operator())

RunLLM · June 25, 2025, 12:14pm

You are running out of memory because both InternVL3-14B and Qwen2.5-VL-32B-Instruct are extremely large models that require much more than 24GB VRAM per GPU, even if you have multiple GPUs. For example, a 14B or 32B model in float16 can require 28GB–64GB+ VRAM per GPU just for weights, not including KV cache or activations (see here, conserving memory).

Key points and solutions:

Your Quadro RTX 6000 (24GB) is not enough for these models with tensor_parallel_size=1. You must use tensor parallelism (tensor_parallel_size=2 or more) to split the model across GPUs, but even then, 2×24GB is likely not enough for 32B models.
Your max_seq_len is set very high (e.g., 128,000 for Qwen2.5-VL-32B-Instruct). This causes vLLM to allocate a huge KV cache, quickly exhausting VRAM. Lower it to 1024–4096.
Always check the logs for actual memory allocation and warnings about context length or cache size.
For these large models, try smaller variants (7B, 13B) to confirm your setup works, or use quantized models if available.

Example for a large model (if you have 2 GPUs):

llm = LLM(
    model="Qwen/Qwen2.5-VL-32B-Instruct",
    tensor_parallel_size=2,
    max_model_len=2048,
    gpu_memory_utilization=0.7,
    max_num_seqs=1,
)

But even with this, 2×24GB may not be enough for 32B models.

Summary:

Lower max_model_len and gpu_memory_utilization
Use tensor_parallel_size=2
Try smaller models if OOM persists

Would you like a code example for a smaller model or more detail on memory estimation?

Sources:

brechtBDCK · June 25, 2025, 12:36pm

Evaluating Salesforce/blip2-opt-6.7b…
INFO 06-25 14:20:27 [config.py:823] This model supports multiple tasks: {‘generate’, ‘embed’, ‘score’, ‘classify’, ‘reward’}. Defaulting to ‘generate’.
INFO 06-25 14:20:27 [config.py:3268] Downcasting torch.float32 to torch.float16.
INFO 06-25 14:20:27 [config.py:1946] Defaulting to use mp for distributed inference
WARNING 06-25 14:20:27 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-25 14:20:27 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model=‘Salesforce/blip2-opt-6.7b’, speculative_config=None, tokenizer=‘Salesforce/blip2-opt-6.7b’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Salesforce/blip2-opt-6.7b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={“level”:0,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:,“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:0,“cudagraph_capture_sizes”:,“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:0,“local_cache_dir”:null}, use_cached_outputs=False,
WARNING 06-25 14:20:28 [utils.py:2597] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’. See Troubleshooting - vLLM for more information. Reason: CUDA is initialized
INFO 06-25 14:20:28 [cuda.py:275] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-25 14:20:28 [cuda.py:324] Using XFormers backend.
WARNING 06-25 14:20:31 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDIA/nccl · GitHub
INFO 06-25 14:20:32 [init.py:244] Automatically detected platform cuda.
(VllmWorkerProcess pid=2064145) INFO 06-25 14:20:34 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2064145) INFO 06-25 14:20:35 [cuda.py:275] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=2064145) INFO 06-25 14:20:35 [cuda.py:324] Using XFormers backend.
[E625 14:30:07.542505700 socket.cpp:1019] [c10d] The client socket has timed out after 600000ms while trying to connect to (192.168.76.68, 54867).
[W625 14:30:07.543024895 TCPStore.cpp:343] [c10d] TCP client failed to connect/validate to host 192.168.76.68:54867 - retrying (try=0, timeout=600000ms, delay=84026ms): The client socket has timed out after 600000ms while trying to connect to (192.168.76.68, 54867).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f09c5b785e8 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x5ba8afe (0x7f09ae636afe in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x1369113 (0x7f09a9df7113 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5bf5691 (0x7f09ae683691 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: + 0x5bf5849 (0x7f09ae683849 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: + 0x5bf5c01 (0x7f09ae683c01 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: + 0x5ba3deb (0x7f09ae631deb in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, c10d::TCPStoreOptions const&) + 0x4b5 (0x7f09ae6346f5 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: + 0xc1e575 (0x7f09bd976575 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0xc52e64 (0x7f09bd9aae64 in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x3895ce (0x7f09bd0e15ce in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x15adae (0x559e380cadae in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #12: _PyObject_MakeTpCall + 0x25b (0x559e380c152b in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #13: + 0x16952b (0x559e380d952b in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #14: _PyObject_Call + 0x118 (0x559e380da0a8 in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #15: + 0x165dc7 (0x559e380d5dc7 in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #16: + 0x1518db (0x559e380c18db in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #17: + 0x38815b (0x7f09bd0e015b in /home/bdck/VLM/venv_vlm_prometheus/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #18: _PyObject_MakeTpCall + 0x25b (0x559e380c152b in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #19: _PyEval_EvalFrameDefault + 0x7611 (0x559e380ba871 in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)
frame #20: _PyFunction_Vectorcall + 0x7c (0x559e380cb6ac in /home/bdck/VLM/venv_vlm_prometheus/bin/python3)

RunLLM · June 25, 2025, 12:37pm

The stack trace you posted is a generic Python/C++ backtrace and does not contain a specific vLLM or PyTorch error message. This usually happens after a fatal error such as CUDA out-of-memory, illegal memory access, or a segmentation fault, and is not actionable by itself (troubleshooting docs).

To diagnose the real issue, look earlier in your logs for explicit error messages (e.g., “CUDA out of memory”, “illegal memory access”, or “RuntimeError”). The stack trace alone is not enough to determine the cause. Would you like help interpreting a specific error message from your logs?

Sources:

troubleshooting docs

brechtBDCK · June 25, 2025, 12:45pm

How do i correctly access my two gpu’s? I ssh into our prometheus server, where nvidia-smi shows two separate quatro’s. I get asynch troubles when using tensor_parallel_size=2

RunLLM · June 25, 2025, 12:45pm

To correctly access both GPUs with vLLM, set tensor_parallel_size=2 and ensure both GPUs are visible (e.g., CUDA_VISIBLE_DEVICES=0,1). Do not manually call torch.cuda.set_device() before initializing vLLM, as this can cause CUDA initialization errors (conserving memory). If you encounter async or NCCL errors, check that your CUDA drivers and NCCL are up to date and that both GPUs are on the same PCIe root complex for optimal communication (troubleshooting).

Example:

CUDA_VISIBLE_DEVICES=0,1 python your_script.py

or in Python:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
from vllm import LLM
llm = LLM(model="your-model", tensor_parallel_size=2)

Would you like more detail on debugging NCCL or async errors?

Sources:

brechtBDCK · June 25, 2025, 1:03pm

i’m running multiple models back to back in the same python3 main.py run. The second model already has to little vram (even though it should fit). Is vllm not releasing its memory automatically?

def main():
load_dotenv() # Load environment variables from .env file
hf_token = os.getenv(“HF_TOKEN”)
login(f"{hf_token}") # to download models from Hugging Face

qa_config = load_questions("Questions&Answers.json")
base_dir = "dataset_BTECH"
models_to_evaluate = [
    "Salesforce/blip2-opt-2.7b",
    "Salesforce/blip2-opt-6.7b",
    "deepseek-ai/deepseek-vl2-tiny",
    "deepseek-ai/deepseek-vl2-small",
    ]

    with open(results_file, "w") as f:
            json.dump(
                {
                    "dataset": dataset_name,
                    "accuracy": overall_accuracy,
                    "per_product_accuracy": accuracy_data,
                    "per_question_accuracy": question_accuracy,
                    "results": results,
                },
                f,
                indent=2,
            )

        print(f"Processed {len(results)} question-image pairs. Overall accuracy: {overall_accuracy:.2%}")
        for product in accuracy_data:
            for condition in accuracy_data[product]:
                acc = accuracy_data[product][condition]
                print(
                    f"{product} {condition}: {acc['correct']}/{acc['total']} correct ({acc['accuracy']:.2%})"
                )
        for product in question_accuracy:
            print(f"\nQuestion accuracy for product '{product}':")
            for question, qacc in question_accuracy[product].items():
                print(f"- {qacc['correct']}/{qacc['total']} correct ({qacc['accuracy']:.2%}): {question}")
        print(f"\nSaved results to {results_file}")

    except Exception as e:
        with open(errors_log, "a") as log_f:
            log_f.write(
                f"{time.strftime('%Y-%m-%d %H:%M:%S')} - {model_name} - {repr(e)}\n"
            )
            log_f.write(traceback.format_exc() + "\n")
        print(f"Failed to evaluate {model_name}: {e}")

if name == “main”:
main()

RunLLM · June 25, 2025, 1:04pm

No, vLLM does not automatically release all GPU memory when you delete or overwrite an LLM instance. This is a known issue: after running one model, the GPU memory may not be fully freed, causing out-of-memory errors when loading the next model in the same process (see here, here, here). The recommended workaround is to manually clean up resources after each model.

Example cleanup code:

import gc
import torch
from vllm.distributed import destroy_distributed_environment, destroy_model_parallel
from vllm.utils import is_cpu

def cleanup():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    if not is_cpu():
        torch.cuda.empty_cache()

# After using a model:
del llm
cleanup()

For some vLLM versions, you may also need to delete internal objects or shut down Ray if used. In general, the most reliable way is to run each model in a separate process (discussion).

Would you like a full example or more detail on process-based isolation?

Sources:

brechtBDCK · June 27, 2025, 8:49am

Evaluating deepseek-ai/deepseek-vl2-tiny…
INFO 06-27 10:48:24 [config.py:533] Overriding HF config with {‘architectures’: [‘DeepseekVLV2ForCausalLM’]}
INFO 06-27 10:48:32 [config.py:823] This model supports multiple tasks: {‘generate’, ‘score’, ‘embed’, ‘classify’, ‘reward’}. Defaulting to ‘generate’.
WARNING 06-27 10:48:32 [config.py:3220] Your device ‘Quadro RTX 6000’ (with compute capability 7.5) doesn’t support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 06-27 10:48:32 [config.py:3271] Casting torch.bfloat16 to torch.float16.
WARNING 06-27 10:48:32 [arg_utils.py:1642] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
WARNING 06-27 10:48:32 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 06-27 10:48:32 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model=‘deepseek-ai/deepseek-vl2-tiny’, speculative_config=None, tokenizer=‘deepseek-ai/deepseek-vl2-tiny’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=deepseek-ai/deepseek-vl2-tiny, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={“level”:0,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:,“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:0,“cudagraph_capture_sizes”:,“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:0,“local_cache_dir”:null}, use_cached_outputs=False,
You are using the default legacy behaviour of the <class ‘transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast’>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in ⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ by ArthurZucker · Pull Request #24565 · huggingface/transformers · GitHub - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 06-27 10:48:35 [cuda.py:275] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-27 10:48:35 [cuda.py:324] Using XFormers backend.
INFO 06-27 10:48:35 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-27 10:48:35 [model_runner.py:1171] Starting to load model deepseek-ai/deepseek-vl2-tiny…
INFO 06-27 10:48:37 [weight_utils.py:292] Using model weights format [‘*.safetensors’]
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00, 2.32s/it]

INFO 06-27 10:48:40 [default_loader.py:272] Loading weights took 2.55 seconds
INFO 06-27 10:48:40 [model_runner.py:1203] Model loading took 6.2934 GiB and 4.399593 seconds
WARNING 06-27 10:48:43 [model_runner.py:1368] Computed max_num_seqs (min(1, 1024 // 2101)) to be less than 1. Setting it to the minimum value of 1.
WARNING 06-27 10:48:43 [profiling.py:232] The sequence length used for profiling (max_num_batched_tokens / max_num_seqs = 1024) is too short to hold the multi-modal embeddings in the worst case (2102 tokens in total, out of which {‘image’: 2101} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase max_model_len, reduce max_num_seqs, and/or reduce mm_counts.
Failed to evaluate deepseek-ai/deepseek-vl2-tiny: Input type (float) and bias type (c10::Half) should be the same
[rank0]:[W627 10:48:44.673599799 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources.

RunLLM · June 27, 2025, 8:50am

Your error, Failed to evaluate deepseek-ai/deepseek-vl2-tiny: Input type (float) and bias type (c10::Half) should be the same, means there is a data type mismatch between the model weights (float16, i.e., c10::Half) and the input (float32, i.e., float). This is a known issue when running on GPUs that do not support bfloat16 and when the image processor or tokenizer outputs float32 tensors (see here).

How to fix:

Explicitly cast your image input to float16 before passing it to vLLM.

If you use PIL and torchvision, do:

import torch
from PIL import Image

image = Image.open("your_image.png").convert("RGB")
image_tensor = torch.from_numpy(np.array(image)).to(torch.float16)

Or, if you use the HuggingFace processor, set the output dtype to float16:

processor = AutoProcessor.from_pretrained("deepseek-ai/deepseek-vl2-tiny")
inputs = processor(images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.float16)

Also:

Your GPU (Quadro RTX 6000, compute capability 7.5) does not support bfloat16, so always use float16 for both model and input.
The warning about sequence length and multi-modal embeddings means your max_model_len is too small for some images. Increase it if you have enough VRAM, or reduce image resolution/size.

Would you like a full code example for DeepSeek-VL2 inference with correct dtype handling?

Sources:

https://github.com/vllm-project/vllm/issues/16570

brechtBDCK · June 27, 2025, 9:00am

It doesnt help?

        for img_name in tqdm(image_files, desc=f"{product}/{condition}", leave=False):
            img_path = os.path.join(base_dir, product, condition, img_name)
            image = Image.open(img_path).convert("RGB")
            width, height = image.size
            image = torch.from_numpy(np.array(image)).to(torch.float16) #does not fix deepseek problem?

The error:
INFO 06-27 10:57:11 [default_loader.py:272] Loading weights took 2.30 seconds
INFO 06-27 10:57:11 [model_runner.py:1203] Model loading took 6.2934 GiB and 4.072829 seconds
WARNING 06-27 10:57:14 [model_runner.py:1368] Computed max_num_seqs (min(1, 2048 // 2101)) to be less than 1. Setting it to the minimum value of 1.
WARNING 06-27 10:57:14 [profiling.py:232] The sequence length used for profiling (max_num_batched_tokens / max_num_seqs = 2048) is too short to hold the multi-modal embeddings in the worst case (2102 tokens in total, out of which {‘image’: 2101} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase max_model_len, reduce max_num_seqs, and/or reduce mm_counts.
Failed to evaluate deepseek-ai/deepseek-vl2-tiny: Input type (float) and bias type (c10::Half) should be the same
[rank0]:[W627 10:57:14.196482695 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources.

RunLLM · June 27, 2025, 9:01am

Directly converting a PIL image to a float16 tensor with torch.from_numpy(np.array(image)).to(torch.float16) does not produce the correct input shape or normalization for DeepSeek-VL2. The model expects preprocessed image tensors (normalized, shape [1, 3, H, W], and float16), not raw pixel arrays. This is why the dtype error persists.

Correct solution:
Use the HuggingFace processor for DeepSeek-VL2 and cast the output to float16:

from transformers import AutoProcessor
from PIL import Image

processor = AutoProcessor.from_pretrained("deepseek-ai/deepseek-vl2-tiny")
image = Image.open(img_path).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.float16)  # Ensure float16

# Pass this to vLLM as multi_modal_data
outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": inputs["pixel_values"]},
}, sampling_params=sampling_params)

Summary:

Always use the model’s processor to preprocess images.
Cast the resulting pixel_values to float16.
Pass the processed tensor to vLLM.

Would you like a full working example for DeepSeek-VL2?

Sources:

brechtBDCK · June 27, 2025, 9:45am

huh?
Failed to evaluate deepseek-ai/deepseek-vl2-tiny: Failed to find C compiler. Please specify via CC environment variable.

RunLLM · June 27, 2025, 9:46am

This error means vLLM (or its dependencies like Triton) cannot find a C compiler on your system. You must install a C compiler (e.g., gcc) and ensure it is available in your environment. On Ubuntu/Debian, run:

sudo apt-get update
sudo apt-get install build-essential

If you still see the error, set the CC environment variable to the path of your C compiler, e.g., export CC=/usr/bin/gcc (source).

Would you like more detail on compiler requirements or troubleshooting?

Sources:

https://github.com/vllm-project/vllm/issues/2997

brechtBDCK · June 27, 2025, 11:07am

trying to get the two gpus working did not work for me? it still only shows 24 GB (one gpu). Is it possible that they are on different nodes or something?

RunLLM · June 27, 2025, 11:08am

Yes, if your two GPUs are on different nodes (i.e., different physical machines), vLLM will not be able to use them together for tensor parallelism by default—multi-node support requires a Ray cluster and special setup (distributed serving docs). If both GPUs are on the same machine, nvidia-smi should show both under the same hostname, and vLLM with tensor_parallel_size=2 will use both.

If you only see 24GB total, it’s likely only one GPU is available to your process, or the GPUs are on separate nodes. Check with nvidia-smi and confirm both GPUs are listed together. If not, you need to run vLLM in a distributed (multi-node) mode using Ray, and follow the multi-node setup instructions.

Would you like step-by-step instructions for multi-node setup?

Sources:

brechtBDCK · June 27, 2025, 7:20pm

Failed to evaluate llava-hf/llava-onevision-qwen2-0.5b-ov-hf: can only concatenate str (not “list”) to str

RunLLM · June 27, 2025, 7:20pm

This error occurs if you pass a list as the prompt instead of a string. vLLM’s generate method expects the prompt field to be a string, not a list, for most models (see here). If you have a list, use only the first element or join the list into a single string.

Example fix:

if isinstance(prompt, list):
    prompt = prompt[0]  # or ''.join(prompt)
outputs = llm.generate({"prompt": prompt, ...})

Would you like more detail or a code example for your use case?

Sources:

https://github.com/vllm-project/vllm/issues/186

Topic		Replies	Views
Speeding up vllm inference for Qwen2.5-VL General	23	4942	June 27, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	30	September 14, 2025
Make install easier General	11	201	July 24, 2025
Question about InternVL implementation General	22	95	September 9, 2025
Jina AI VLM - Support help General	3	12	October 12, 2025

Multimodal inference guideline?

Related topics