H20 running Qwen3-30B-A3B-AWQ failed

ckdavid233 · November 26, 2025, 5:37am

[Environment]
Hardware:8*H20
cuda:12.4
python_env:python3.10,installed vllm with whl file(https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5-cp38-abi3-manylinux1_x86_64.whl)
[Start_Scripts]
export VLLM_DISABLE_GRAPH_CAPTURE=1
python3 -m vllm.entrypoints.openai.api_server
–model /home/nanyi/models/Qwen3-30B-A3B-AWQ
–dtype float16
–max-model-len 28000
–tensor-parallel-size 1
–gpu-memory-utilization 0.92
–served-model-name Qwen3-30B-A3B-AWQ
–host 0.0.0.0
–port 8000
[Startup_log]
INFO 11-26 04:55:25 [api_server.py:1090] Starting vLLM API server on h ttp://0.0.0.0:8000
INFO 11-26 04:55:25 [launcher.py:28] Available routes are:
INFO 11-26 04:55:25 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /health, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /load, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /version, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /pooling, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /score, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /rerank, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /invocations, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /metrics, Methods: GET
INFO: Started server process [200111]
INFO: Waiting for application startup.
INFO: Application startup complete.
[Question]
At first,I tried to use curl command to test if this model.
curl -s -X POST “h ttp://127.0.0.1:8000/v1/completions”
-H “Content-Type: application/json”
-d ‘{
“model”: “Qwen3-30B-A3B-AWQ”,
“prompt”: “你是一个人工智能助手，请简洁回答用户问题。\n用户：你好，介绍一下你自己\n助手：”,
“max_tokens”: 512
}’
At first It didn’t work.And I send this command after around 3-4 minutes it works.
Then I send another command like above.However the model server did not recieve any requests.
Client was stuck in posting request:
(base) nanyi@h20-12:~$ curl -N “ht tp://localhost:8000/v1/chat/completions” -H “Content-Type: application/json” -d ‘{“model”: “Qwen3-30B-A3B-AWQ”,“messages”:[{“role”:“user”,“content”:“你好，介绍一下你自己”}],“max_tokens”:512,“stream”:true}’
But no contents were display on the model server.
I tried to deploy this model on Jetson Orin AGX(64GB) and Nvidia 4090(48GB) and they both worked well.Is there any details in H20 or my mistakes on deploying?I also tried upgraded vllm to 0.11 on H20 and the problem still there.

RunLLM · November 26, 2025, 5:46am

Your issue is likely related to the NVIDIA H20’s hardware or CUDA environment, not vLLM itself. The vLLM server starts and exposes all expected routes, but requests hang or are delayed, which is not observed on Jetson Orin AGX or 4090. This suggests a system-level or driver issue, possibly with CUDA, NCCL, or network configuration on the H20 cluster. There are no known vLLM bugs specific to H20 in the provided data sources.

Recommended steps: (1) Check that your CUDA, NCCL, and driver versions are fully compatible with the H20 and vLLM; (2) Ensure no network/firewall issues are blocking localhost communication; (3) Monitor GPU and CPU utilization during requests; (4) Review vLLM troubleshooting for hangs and slow model loading, especially regarding shared filesystems and memory swapping; (5) Try running with increased logging (export VLLM_LOGGING_LEVEL=DEBUG) to identify where the process stalls. See vLLM Troubleshooting and GitHub issue 5267 for similar symptoms and debugging tips.

Would you like a step-by-step checklist for debugging this on your H20 system?

Sources:

Hint: Mention @RunLLM in the post for followups.

ckdavid233 · November 27, 2025, 1:29am

I’ve upgraded my Nvidia drivers and cuda version to the latest version adapting to the latest nvidia-fabricmanager version(575.57.08+cu12.9).Then I upgraded vllm to 0.11.It worked well now.

Topic		Replies	Views
Hosting Qwen 3.5 35B-A3B model Model Support	1	1251	February 25, 2026
Does vllm inference work with Qwen3-VL-30B General	3	535	November 30, 2025
Following Qwen3.5 Usage Guide on H20 ,but can not host Qwen3.5-27B General	4	430	February 28, 2026
Vllm on new H20-3e Nvidia has occasional abnormal character like "📐", using Qwen 2.5 VL 72B General	3	369	June 18, 2025
Trying to run Qwen3.5-397B-A17B-GPTQ-Int4 Model Support	10	644	March 13, 2026

H20 running Qwen3-30B-A3B-AWQ failed

Related topics