Need to serve a Qwen3 LLM with 235B params

Ranjith · March 6, 2026, 5:52am

@RunLLM What are the options to serve this big model ( Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 · Hugging Face ) I can cluster two L40S Gpu cards with 48 GB VRAM each And I can extend the CPU RAM accordingly Can you suggest the ways in which I can serve this model using vLLM

RunLLM · March 6, 2026, 5:52am

Serving Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 on two L40S GPUs (48GB each) is not officially supported for full-capacity inference, as the model typically requires at least 8×80GB GPUs (A100/H100/H200) for standard deployment. However, you can attempt to serve it with aggressive memory optimizations, but you must significantly reduce the context length (max-model-len), batch size, and possibly use CPU offloading for model weights and KV cache. Expect very limited throughput and possible instability.

Recommended steps:

Use tensor and pipeline parallelism: set --tensor-parallel-size 1 and --pipeline-parallel-size 2 to split the model across both GPUs, as L40S lacks NVLINK and pipeline parallelism is preferred for such hardware (see docs).
Aggressively lower --max-model-len (e.g., 2048 or less) and --max-num-seqs to fit within 96GB total VRAM.
Set --gpu-memory-utilization to 0.9 or lower.
Use --cpu-offload-gb to offload part of the model to CPU RAM, but note this will reduce performance (see memory guide).
Optionally, use quantized models or GGUF format for further memory reduction, but GGUF support is experimental (see GGUF guide).
You may need to set the environment variable VLLM_PP_LAYER_PARTITION to manually control layer distribution between GPUs (see example).

Would you like a step-by-step example command or more detail on each configuration?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · March 7, 2026, 5:38am

@RunLLM Below is the docker-compose.yaml that I am currently using to serve any model How should change this to accordingly;

If I want to load using two GPU Cards? (Should the GPU cards be in the same server or can it be in two different servers?)
If I want to offload to CPU how will I do it?

services:
qwen3-instruct:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-instruct
runtime: nvidia
environment:

NVIDIA_VISIBLE_DEVICES=all
volumes:
/home/ranjith/.cache/huggingface:/root/.cache/huggingface
/dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports:
“9122:8000”
ipc: host
command: >
/root/.cache/huggingface/hub/models–Qwen–Qwen3-30B-A3B-Instruct-2507-FP8/snapshots/5a5a776300a41aaa681dd7ff0106608ef2bc90db
–served-model-name ranjith-model
–gpu-memory-utilization 0.8
–max_model_len 16384
–max_num_batched_tokens 16384
–enable-auto-tool-choice
–tool-call-parser hermes
–enforce-eager

RunLLM · March 7, 2026, 5:39am

To use two GPU cards in the same server, set NVIDIA_VISIBLE_DEVICES=0,1 and add --tensor-parallel-size 2 to your command. For multi-server (multi-node) deployment, vLLM supports distributed serving, but you must follow the distributed serving guide and cannot do this with a single docker-compose file.

To offload to CPU, add --cpu-offload-gb to your command (e.g., --cpu-offload-gb 32 for 32GB CPU offload). This will move part of the model weights and/or KV cache to CPU RAM, but will reduce performance (see memory guide).

Would you like a full example docker-compose.yaml for these settings?

Sources:

Distributed Inference and Serving — vLLM

Hint: Mention @RunLLM in the post for followups.

Ranjith · March 7, 2026, 5:40am

@RunLLM
Yeah I need full example docker-compose.yaml for these settings

RunLLM · March 7, 2026, 5:41am

Here is an example docker-compose.yaml for serving a model on two GPUs in the same server, with optional CPU offloading. This assumes both GPUs are available on the same machine. For multi-node (multi-server) setups, vLLM requires additional distributed configuration and cannot be handled by a single docker-compose file alone (see distributed guide).

version: "3.9"
services:
  qwen3-instruct:
    image: vllm/vllm-openai:v0.15.0
    container_name: qwen3-instruct
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,1
    volumes:
      - /home/ranjith/.cache/huggingface:/root/.cache/huggingface
      - /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
    ports:
      - "9122:8000"
    ipc: host
    command: >
      --model /root/.cache/huggingface/hub/models--Qwen--Qwen3-30B-A3B-Instruct-2507-FP8/snapshots/5a5a776300a41aaa681dd7ff0106608ef2bc90db
      --served-model-name ranjith-model
      --gpu-memory-utilization 0.8
      --tensor-parallel-size 2
      --max-model-len 16384
      --max-num-batched-tokens 16384
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --enforce-eager
      --cpu-offload-gb 32  # Optional: Offload 32GB to CPU RAM

For CPU offloading, adjust --cpu-offload-gb as needed.
For multi-node, see distributed serving docs.

Would you like more detail on multi-node deployment?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · March 7, 2026, 5:42am

@RunLLM , Can I completely run the model on CPU?

RunLLM · March 7, 2026, 5:42am

Yes, you can run vLLM models entirely on CPU, but you must use a CPU-specific build or Docker image, and performance will be much slower than on GPU. Use the vllm-cpu Docker image (e.g., public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.0) and set appropriate environment variables like VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND (see official CPU install guide).

Note: Very large models (like Qwen3-VL-235B) are generally not practical to run on CPU due to extreme memory and compute requirements (see FAQ). Small models are feasible.

Would you like a step-by-step example for CPU-only deployment?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · March 7, 2026, 5:43am

@RunLLM , I like to have step-by-step example for CPU-only deployment using docker-compose

RunLLM · March 7, 2026, 5:44am

Certainly! Here’s a step-by-step example for CPU-only vLLM deployment using Docker Compose. This is suitable for small models; very large models like Qwen3-VL-235B are not practical on CPU (see CPU FAQ).

Use the official CPU image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
Set environment variables for KV cache and thread binding.

Example docker-compose.yaml:

version: "3.9"
services:
  vllm-cpu:
    image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
    container_name: vllm-cpu
    environment:
      - VLLM_CPU_KVCACHE_SPACE=40   # 40 GiB for KV cache (adjust as needed)
      - VLLM_CPU_OMP_THREADS_BIND=0-15  # Bind 16 CPU cores (adjust as needed)
    volumes:
      - /home/ranjith/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.2-1B-Instruct
      --dtype bfloat16

Replace --model with your desired model (small models recommended).
Adjust VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND for your hardware.

Would you like more detail on tuning these parameters?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Deploy a big LLM when GPU VRAM not enough General	21	2843	August 13, 2025
How to serve two vLLM instance using docker? General	3	65	February 26, 2026
Hosting Qwen 3.5 35B-A3B model Model Support	1	656	February 25, 2026
Critique my vLLM configuration for qwen3-coder-next Model Support	3	34	March 10, 2026
Trying to run Qwen3.5-397B-A17B-GPTQ-Int4 Model Support	10	18	March 13, 2026

Need to serve a Qwen3 LLM with 235B params

Related topics