Hosting Qwen 3.5 35B-A3B model

RunLLM · February 25, 2026, 7:46pm

Your error CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx during vLLM startup with Qwen3.5-35B-A3B is a CUDA/cuBLAS kernel launch failure, often caused by insufficient GPU memory, incompatible CUDA/PyTorch versions, or unsupported model configurations. The logs show 64.69 GiB VRAM used, which is close to the A100 80GB limit, and this model is extremely large. This error is also reported for similar Qwen3 models and large context lengths, especially with single-GPU setups or when using float16 instead of quantized weights, and can be triggered by model weight or KV cache allocation failures or by unsupported hardware/software combinations (see related issues, troubleshooting docs).

Recommended actions:

Lower --max-model-len (try 1024 or lower).
Reduce --gpu-memory-utilization (e.g., 0.7–0.8).
Use quantized model weights if available (e.g., GPTQ/AWQ).
Ensure CUDA, cuBLAS, and PyTorch versions are compatible and up to date.
If possible, use tensor parallelism (--tensor-parallel-size > 1) across multiple GPUs.
If the error persists, check for known issues with Qwen3.5-35B-A3B and vLLM, as some models may not be fully supported on single A100 80GB with float16 (see similar reports, troubleshooting).

Would you like a step-by-step breakdown or example commands for these suggestions?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Trying to run Qwen3.5-397B-A17B-GPTQ-Int4 Model Support	10	419	March 13, 2026
H20 running Qwen3-30B-A3B-AWQ failed General	2	111	November 27, 2025
Need to serve a Qwen3 LLM with 235B params General	9	288	March 7, 2026
torch.OutOfMemoryError: CUDA out of memory General	15	1220	March 3, 2026
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	482	July 4, 2025

Hosting Qwen 3.5 35B-A3B model

Related topics