2 vllm containers on a single GPU

artist-genai · May 13, 2025, 10:06am

I have a 16GB GPU which is enough to handle 2 instances of 8B models using vLLM. But when I try to do so, even though there is a lot of VRAM left (according to nvidia-smi), the second container fails to start with a cuda error. Can anyone tell if it’s possible and if yes, how?

Docker Command -

docker run -d --name vllmeta --runtime=nvidia --gpus all \

-v ~/.cache/huggingface:/root/.cache/huggingface \

–env “HUGGING_FACE_HUB_TOKEN=” \

–env “VLLM_SERVER_DEV_MODE=1” \

-p 8000:8000 \

–ipc=host \

vllm/vllm-openai:latest \

–model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\

–gpu-memory-utilization 0.5 \

–quantization bitsandbytes \

–dtype float16 \

–enforce-eager \

–max-model-len 2048


Mon May 12 07:58:02 2025

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |

|-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|========================================+========================+======================|

| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |

| N/A 78C P0 33W / 70W | 6631MiB / 15360MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=========================================================================================|

| 0 N/A N/A 329374 C /usr/bin/python3 6620MiB |

+-----------------------------------------------------------------------------------------+

The error that I get after starting the second container.


INFO 05-12 00:40:44 [__init__.py:239] Automatically detected platform cuda.

INFO 05-12 00:40:47 [api_server.py:1043] vLLM API server version 0.8.5.post1

INFO 05-12 00:40:47 [api_server.py:1044] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', max_model_len=2048, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.5, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)

WARNING 05-12 00:40:48 [config.py:2972] Casting torch.bfloat16 to torch.float16.

INFO 05-12 00:40:57 [config.py:717] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.

WARNING 05-12 00:40:57 [config.py:830] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.

WARNING 05-12 00:40:57 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.

WARNING 05-12 00:40:57 [cuda.py:93] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used

INFO 05-12 00:40:58 [api_server.py:246] Started engine process with PID 48

INFO 05-12 00:41:02 [__init__.py:239] Automatically detected platform cuda.

INFO 05-12 00:41:04 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,

INFO 05-12 00:41:06 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.

INFO 05-12 00:41:06 [cuda.py:289] Using XFormers backend.

INFO 05-12 00:41:07 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0

INFO 05-12 00:41:07 [model_runner.py:1108] Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B...

INFO 05-12 00:41:08 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...

INFO 05-12 00:41:08 [weight_utils.py:265] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:06<00:06, 6.23s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 3.97s/it]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 4.31s/it]

INFO 05-12 00:41:18 [model_runner.py:1140] Model loading took 5.2273 GiB and 9.910612 seconds

INFO 05-12 00:41:30 [worker.py:287] Memory profiling takes 12.44 seconds

INFO 05-12 00:41:30 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.56GiB) x gpu_memory_utilization (0.50) = 7.28GiB

INFO 05-12 00:41:30 [worker.py:287] model weights take 5.23GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 0.61GiB.

INFO 05-12 00:41:30 [executor_base.py:112] # cuda blocks: 709, # CPU blocks: 4681

INFO 05-12 00:41:30 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 5.54x

ERROR 05-12 00:41:31 [engine.py:448] CUDA error: invalid argument

ERROR 05-12 00:41:31 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

ERROR 05-12 00:41:31 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1

ERROR 05-12 00:41:31 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 05-12 00:41:31 [engine.py:448] Traceback (most recent call last):

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine

ERROR 05-12 00:41:31 [engine.py:448] engine = MQLLMEngine.from_vllm_config(

ERROR 05-12 00:41:31 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config

ERROR 05-12 00:41:31 [engine.py:448] return cls(

ERROR 05-12 00:41:31 [engine.py:448] ^^^^

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__

ERROR 05-12 00:41:31 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)

ERROR 05-12 00:41:31 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^

Process SpawnProcess-1:

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 278, in __init__

ERROR 05-12 00:41:31 [engine.py:448] self._initialize_kv_caches()

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 435, in _initialize_kv_caches

ERROR 05-12 00:41:31 [engine.py:448] self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 123, in initialize_cache

ERROR 05-12 00:41:31 [engine.py:448] self.collective_rpc("initialize_cache",

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc

ERROR 05-12 00:41:31 [engine.py:448] answer = run_method(self.driver_worker, method, args, kwargs)

ERROR 05-12 00:41:31 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method

ERROR 05-12 00:41:31 [engine.py:448] return func(*args, **kwargs)

ERROR 05-12 00:41:31 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 327, in initialize_cache

ERROR 05-12 00:41:31 [engine.py:448] self._init_cache_engine()

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 333, in _init_cache_engine

ERROR 05-12 00:41:31 [engine.py:448] CacheEngine(self.cache_config, self.model_config,

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py", line 66, in __init__

ERROR 05-12 00:41:31 [engine.py:448] self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")

ERROR 05-12 00:41:31 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 05-12 00:41:31 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py", line 95, in _allocate_kv_cache

ERROR 05-12 00:41:31 [engine.py:448] layer_kv_cache = torch.zeros(

ERROR 05-12 00:41:31 [engine.py:448] ^^^^^^^^^^^^

ERROR 05-12 00:41:31 [engine.py:448] RuntimeError: CUDA error: invalid argument

ERROR 05-12 00:41:31 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

ERROR 05-12 00:41:31 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1

ERROR 05-12 00:41:31 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 05-12 00:41:31 [engine.py:448]

Traceback (most recent call last):

File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

self.run()

File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

self._target(*self._args, **self._kwargs)

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine

raise e

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine

engine = MQLLMEngine.from_vllm_config(

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config

return cls(

^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__

self.engine = LLMEngine(*args, **kwargs)

^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 278, in __init__

self._initialize_kv_caches()

File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 435, in _initialize_kv_caches

self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)

File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 123, in initialize_cache

self.collective_rpc("initialize_cache",

File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc

answer = run_method(self.driver_worker, method, args, kwargs)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method

return func(*args, **kwargs)

^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 327, in initialize_cache

self._init_cache_engine()

File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 333, in _init_cache_engine

CacheEngine(self.cache_config, self.model_config,

File "/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py", line 66, in __init__

self.cpu_cache = self._allocate_kv_cache(self.num_cpu_blocks, "cpu")

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/worker/cache_engine.py", line 95, in _allocate_kv_cache

layer_kv_cache = torch.zeros(

^^^^^^^^^^^^

RuntimeError: CUDA error: invalid argument

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]:[W512 00:41:31.212053077 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Traceback (most recent call last):

File "<frozen runpy>", line 198, in _run_module_as_main

File "<frozen runpy>", line 88, in _run_code

File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module>

uvloop.run(run_server(args))

File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run

return __asyncio.run(

^^^^^^^^^^^^^^

File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run

return runner.run(main)

^^^^^^^^^^^^^^^^

File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run

return self._loop.run_until_complete(task)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete

File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper

return await main

^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server

async with build_async_engine_client(args) as engine_client:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

return await anext(self.gen)

^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client

async with build_async_engine_client_from_engine_args(

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

return await anext(self.gen)

^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args

raise RuntimeError(

RuntimeError: Engine process failed to start. See stack trace for the root cause.

RunLLM · May 13, 2025, 10:07am

This is a known issue with vLLM: when running multiple vLLM instances on the same GPU, the second instance often fails to start due to incorrect GPU memory accounting. vLLM’s memory profiling includes memory already used by other vLLM instances, causing the available memory for the second instance to be underestimated, which leads to errors like “CUDA error: invalid argument” or negative/insufficient KV cache size—even if nvidia-smi shows free VRAM. This behavior is confirmed in multiple reports and is not currently resolved in vLLM, including recent versions (see here, here, and here).

The official documentation and user reports state that, despite the --gpu-memory-utilization flag, vLLM V1 does not allow multiple instances to reliably share a single GPU, because each instance tries to account for all memory, not just its own allocation. There is no supported workaround for running two vLLM containers on the same GPU; the only reliable solution is to run a single vLLM instance per GPU. Would you like more detail on the technical reasons or possible (unsupported) workarounds?

Sources:

Topic		Replies	Views
Does vllm support deploy multiple docker instance on one gpu General	1	47	May 21, 2025
Why is this not working? I corrected it but still NVIDIA GPU Support	1	75	May 8, 2025
Run vLLM on two diffrent GPU General	1	50	May 21, 2025
How to setup amd gpu as default in dual stack gpu? AMD GPU Support	10	112	April 21, 2025
Can anyone help me? Why is this not working? It used 😭 NVIDIA GPU Support	1	68	May 8, 2025

2 vllm containers on a single GPU

Related topics