Hosting Qwen 3.5 35B-A3B model

Your error CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx during vLLM startup with Qwen3.5-35B-A3B is a CUDA/cuBLAS kernel launch failure, often caused by insufficient GPU memory, incompatible CUDA/PyTorch versions, or unsupported model configurations. The logs show 64.69 GiB VRAM used, which is close to the A100 80GB limit, and this model is extremely large. This error is also reported for similar Qwen3 models and large context lengths, especially with single-GPU setups or when using float16 instead of quantized weights, and can be triggered by model weight or KV cache allocation failures or by unsupported hardware/software combinations (see related issues, troubleshooting docs).

Recommended actions:

  • Lower --max-model-len (try 1024 or lower).
  • Reduce --gpu-memory-utilization (e.g., 0.7–0.8).
  • Use quantized model weights if available (e.g., GPTQ/AWQ).
  • Ensure CUDA, cuBLAS, and PyTorch versions are compatible and up to date.
  • If possible, use tensor parallelism (--tensor-parallel-size > 1) across multiple GPUs.
  • If the error persists, check for known issues with Qwen3.5-35B-A3B and vLLM, as some models may not be fully supported on single A100 80GB with float16 (see similar reports, troubleshooting).

Would you like a step-by-step breakdown or example commands for these suggestions?

Sources:

Hint: Mention @RunLLM in the post for followups.