@RunLLM What are the options to serve this big model ( Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 · Hugging Face ) I can cluster two L40S Gpu cards with 48 GB VRAM each And I can extend the CPU RAM accordingly Can you suggest the ways in which I can serve this model using vLLM
Serving Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 on two L40S GPUs (48GB each) is not officially supported for full-capacity inference, as the model typically requires at least 8×80GB GPUs (A100/H100/H200) for standard deployment. However, you can attempt to serve it with aggressive memory optimizations, but you must significantly reduce the context length (max-model-len), batch size, and possibly use CPU offloading for model weights and KV cache. Expect very limited throughput and possible instability.
Recommended steps:
- Use tensor and pipeline parallelism: set --tensor-parallel-size 1 and --pipeline-parallel-size 2 to split the model across both GPUs, as L40S lacks NVLINK and pipeline parallelism is preferred for such hardware (see docs).
- Aggressively lower --max-model-len (e.g., 2048 or less) and --max-num-seqs to fit within 96GB total VRAM.
- Set --gpu-memory-utilization to 0.9 or lower.
- Use --cpu-offload-gb to offload part of the model to CPU RAM, but note this will reduce performance (see memory guide).
- Optionally, use quantized models or GGUF format for further memory reduction, but GGUF support is experimental (see GGUF guide).
- You may need to set the environment variable VLLM_PP_LAYER_PARTITION to manually control layer distribution between GPUs (see example).
Would you like a step-by-step example command or more detail on each configuration?
Sources:
- Qwen3-VL Usage Guide - vLLM Recipes
- Distributed Inference and Serving — vLLM
- Conserving Memory - vLLM
- GGUF Quantization (experimental)
- GitHub Issue: Deploying on GPUs with different memory
Hint: Mention @RunLLM in the post for followups.
@RunLLM Below is the docker-compose.yaml that I am currently using to serve any model How should change this to accordingly;
- If I want to load using two GPU Cards? (Should the GPU cards be in the same server or can it be in two different servers?)
- If I want to offload to CPU how will I do it?
services:
qwen3-instruct:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-instruct
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes: - /home/ranjith/.cache/huggingface:/root/.cache/huggingface
- /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports: - “9122:8000”
ipc: host
command: >
/root/.cache/huggingface/hub/models–Qwen–Qwen3-30B-A3B-Instruct-2507-FP8/snapshots/5a5a776300a41aaa681dd7ff0106608ef2bc90db
–served-model-name ranjith-model
–gpu-memory-utilization 0.8
–max_model_len 16384
–max_num_batched_tokens 16384
–enable-auto-tool-choice
–tool-call-parser hermes
–enforce-eager
To use two GPU cards in the same server, set NVIDIA_VISIBLE_DEVICES=0,1 and add --tensor-parallel-size 2 to your command. For multi-server (multi-node) deployment, vLLM supports distributed serving, but you must follow the distributed serving guide and cannot do this with a single docker-compose file.
To offload to CPU, add --cpu-offload-gb to your command (e.g., --cpu-offload-gb 32 for 32GB CPU offload). This will move part of the model weights and/or KV cache to CPU RAM, but will reduce performance (see memory guide).
Would you like a full example docker-compose.yaml for these settings?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM
Yeah I need full example docker-compose.yaml for these settings
Here is an example docker-compose.yaml for serving a model on two GPUs in the same server, with optional CPU offloading. This assumes both GPUs are available on the same machine. For multi-node (multi-server) setups, vLLM requires additional distributed configuration and cannot be handled by a single docker-compose file alone (see distributed guide).
version: "3.9"
services:
qwen3-instruct:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-instruct
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=0,1
volumes:
- /home/ranjith/.cache/huggingface:/root/.cache/huggingface
- /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports:
- "9122:8000"
ipc: host
command: >
--model /root/.cache/huggingface/hub/models--Qwen--Qwen3-30B-A3B-Instruct-2507-FP8/snapshots/5a5a776300a41aaa681dd7ff0106608ef2bc90db
--served-model-name ranjith-model
--gpu-memory-utilization 0.8
--tensor-parallel-size 2
--max-model-len 16384
--max-num-batched-tokens 16384
--enable-auto-tool-choice
--tool-call-parser hermes
--enforce-eager
--cpu-offload-gb 32 # Optional: Offload 32GB to CPU RAM
- For CPU offloading, adjust --cpu-offload-gb as needed.
- For multi-node, see distributed serving docs.
Would you like more detail on multi-node deployment?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM , Can I completely run the model on CPU?
Yes, you can run vLLM models entirely on CPU, but you must use a CPU-specific build or Docker image, and performance will be much slower than on GPU. Use the vllm-cpu Docker image (e.g., public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.0) and set appropriate environment variables like VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND (see official CPU install guide).
Note: Very large models (like Qwen3-VL-235B) are generally not practical to run on CPU due to extreme memory and compute requirements (see FAQ). Small models are feasible.
Would you like a step-by-step example for CPU-only deployment?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM , I like to have step-by-step example for CPU-only deployment using docker-compose
Certainly! Here’s a step-by-step example for CPU-only vLLM deployment using Docker Compose. This is suitable for small models; very large models like Qwen3-VL-235B are not practical on CPU (see CPU FAQ).
- Use the official CPU image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
- Set environment variables for KV cache and thread binding.
Example docker-compose.yaml:
version: "3.9"
services:
vllm-cpu:
image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
container_name: vllm-cpu
environment:
- VLLM_CPU_KVCACHE_SPACE=40 # 40 GiB for KV cache (adjust as needed)
- VLLM_CPU_OMP_THREADS_BIND=0-15 # Bind 16 CPU cores (adjust as needed)
volumes:
- /home/ranjith/.cache/huggingface:/root/.cache/huggingface
ports:
- "8000:8000"
command: >
--model meta-llama/Llama-3.2-1B-Instruct
--dtype bfloat16
- Replace --model with your desired model (small models recommended).
- Adjust VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND for your hardware.
Would you like more detail on tuning these parameters?
Sources:
Hint: Mention @RunLLM in the post for followups.