Need to serve a Qwen3 LLM with 235B params

@RunLLM What are the options to serve this big model ( Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 · Hugging Face ) I can cluster two L40S Gpu cards with 48 GB VRAM each And I can extend the CPU RAM accordingly Can you suggest the ways in which I can serve this model using vLLM

Serving Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 on two L40S GPUs (48GB each) is not officially supported for full-capacity inference, as the model typically requires at least 8×80GB GPUs (A100/H100/H200) for standard deployment. However, you can attempt to serve it with aggressive memory optimizations, but you must significantly reduce the context length (max-model-len), batch size, and possibly use CPU offloading for model weights and KV cache. Expect very limited throughput and possible instability.

Recommended steps:

  • Use tensor and pipeline parallelism: set --tensor-parallel-size 1 and --pipeline-parallel-size 2 to split the model across both GPUs, as L40S lacks NVLINK and pipeline parallelism is preferred for such hardware (see docs).
  • Aggressively lower --max-model-len (e.g., 2048 or less) and --max-num-seqs to fit within 96GB total VRAM.
  • Set --gpu-memory-utilization to 0.9 or lower.
  • Use --cpu-offload-gb to offload part of the model to CPU RAM, but note this will reduce performance (see memory guide).
  • Optionally, use quantized models or GGUF format for further memory reduction, but GGUF support is experimental (see GGUF guide).
  • You may need to set the environment variable VLLM_PP_LAYER_PARTITION to manually control layer distribution between GPUs (see example).

Would you like a step-by-step example command or more detail on each configuration?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Below is the docker-compose.yaml that I am currently using to serve any model How should change this to accordingly;

  1. If I want to load using two GPU Cards? (Should the GPU cards be in the same server or can it be in two different servers?)
  2. If I want to offload to CPU how will I do it?

services:
qwen3-instruct:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-instruct
runtime: nvidia
environment:

  • NVIDIA_VISIBLE_DEVICES=all
    volumes:
  • /home/ranjith/.cache/huggingface:/root/.cache/huggingface
  • /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
    ports:
  • “9122:8000”
    ipc: host
    command: >
    /root/.cache/huggingface/hub/models–Qwen–Qwen3-30B-A3B-Instruct-2507-FP8/snapshots/5a5a776300a41aaa681dd7ff0106608ef2bc90db
    –served-model-name ranjith-model
    –gpu-memory-utilization 0.8
    –max_model_len 16384
    –max_num_batched_tokens 16384
    –enable-auto-tool-choice
    –tool-call-parser hermes
    –enforce-eager

To use two GPU cards in the same server, set NVIDIA_VISIBLE_DEVICES=0,1 and add --tensor-parallel-size 2 to your command. For multi-server (multi-node) deployment, vLLM supports distributed serving, but you must follow the distributed serving guide and cannot do this with a single docker-compose file.

To offload to CPU, add --cpu-offload-gb to your command (e.g., --cpu-offload-gb 32 for 32GB CPU offload). This will move part of the model weights and/or KV cache to CPU RAM, but will reduce performance (see memory guide).

Would you like a full example docker-compose.yaml for these settings?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM
Yeah I need full example docker-compose.yaml for these settings

Here is an example docker-compose.yaml for serving a model on two GPUs in the same server, with optional CPU offloading. This assumes both GPUs are available on the same machine. For multi-node (multi-server) setups, vLLM requires additional distributed configuration and cannot be handled by a single docker-compose file alone (see distributed guide).

version: "3.9"
services:
  qwen3-instruct:
    image: vllm/vllm-openai:v0.15.0
    container_name: qwen3-instruct
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,1
    volumes:
      - /home/ranjith/.cache/huggingface:/root/.cache/huggingface
      - /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
    ports:
      - "9122:8000"
    ipc: host
    command: >
      --model /root/.cache/huggingface/hub/models--Qwen--Qwen3-30B-A3B-Instruct-2507-FP8/snapshots/5a5a776300a41aaa681dd7ff0106608ef2bc90db
      --served-model-name ranjith-model
      --gpu-memory-utilization 0.8
      --tensor-parallel-size 2
      --max-model-len 16384
      --max-num-batched-tokens 16384
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --enforce-eager
      --cpu-offload-gb 32  # Optional: Offload 32GB to CPU RAM

Would you like more detail on multi-node deployment?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM , Can I completely run the model on CPU?

Yes, you can run vLLM models entirely on CPU, but you must use a CPU-specific build or Docker image, and performance will be much slower than on GPU. Use the vllm-cpu Docker image (e.g., public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.0) and set appropriate environment variables like VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND (see official CPU install guide).

Note: Very large models (like Qwen3-VL-235B) are generally not practical to run on CPU due to extreme memory and compute requirements (see FAQ). Small models are feasible.

Would you like a step-by-step example for CPU-only deployment?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM , I like to have step-by-step example for CPU-only deployment using docker-compose

Certainly! Here’s a step-by-step example for CPU-only vLLM deployment using Docker Compose. This is suitable for small models; very large models like Qwen3-VL-235B are not practical on CPU (see CPU FAQ).

  1. Use the official CPU image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
  2. Set environment variables for KV cache and thread binding.

Example docker-compose.yaml:

version: "3.9"
services:
  vllm-cpu:
    image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
    container_name: vllm-cpu
    environment:
      - VLLM_CPU_KVCACHE_SPACE=40   # 40 GiB for KV cache (adjust as needed)
      - VLLM_CPU_OMP_THREADS_BIND=0-15  # Bind 16 CPU cores (adjust as needed)
    volumes:
      - /home/ranjith/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.2-1B-Instruct
      --dtype bfloat16
  • Replace --model with your desired model (small models recommended).
  • Adjust VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND for your hardware.

Would you like more detail on tuning these parameters?

Sources:

Hint: Mention @RunLLM in the post for followups.