H20 running Qwen3-30B-A3B-AWQ failed

[Environment]
Hardware:8*H20
cuda:12.4
python_env:python3.10,installed vllm with whl file(https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5-cp38-abi3-manylinux1_x86_64.whl)
[Start_Scripts]
export VLLM_DISABLE_GRAPH_CAPTURE=1
python3 -m vllm.entrypoints.openai.api_server
–model /home/nanyi/models/Qwen3-30B-A3B-AWQ
–dtype float16
–max-model-len 28000
–tensor-parallel-size 1
–gpu-memory-utilization 0.92
–served-model-name Qwen3-30B-A3B-AWQ
–host 0.0.0.0
–port 8000
[Startup_log]
INFO 11-26 04:55:25 [api_server.py:1090] Starting vLLM API server on h ttp://0.0.0.0:8000
INFO 11-26 04:55:25 [launcher.py:28] Available routes are:
INFO 11-26 04:55:25 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /health, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /load, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /ping, Methods: POST, GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /version, Methods: GET
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /pooling, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /score, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /rerank, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /invocations, Methods: POST
INFO 11-26 04:55:25 [launcher.py:36] Route: /metrics, Methods: GET
INFO: Started server process [200111]
INFO: Waiting for application startup.
INFO: Application startup complete.
[Question]
At first,I tried to use curl command to test if this model.
curl -s -X POST “h ttp://127.0.0.1:8000/v1/completions”
-H “Content-Type: application/json”
-d ‘{
“model”: “Qwen3-30B-A3B-AWQ”,
“prompt”: “你是一个人工智能助手,请简洁回答用户问题。\n用户:你好,介绍一下你自己\n助手:”,
“max_tokens”: 512
}’
At first It didn’t work.And I send this command after around 3-4 minutes it works.
Then I send another command like above.However the model server did not recieve any requests.
Client was stuck in posting request:
(base) nanyi@h20-12:~$ curl -N “ht tp://localhost:8000/v1/chat/completions” -H “Content-Type: application/json” -d ‘{“model”: “Qwen3-30B-A3B-AWQ”,“messages”:[{“role”:“user”,“content”:“你好,介绍一下你自己”}],“max_tokens”:512,“stream”:true}’
But no contents were display on the model server.
I tried to deploy this model on Jetson Orin AGX(64GB) and Nvidia 4090(48GB) and they both worked well.Is there any details in H20 or my mistakes on deploying?I also tried upgraded vllm to 0.11 on H20 and the problem still there.

Your issue is likely related to the NVIDIA H20’s hardware or CUDA environment, not vLLM itself. The vLLM server starts and exposes all expected routes, but requests hang or are delayed, which is not observed on Jetson Orin AGX or 4090. This suggests a system-level or driver issue, possibly with CUDA, NCCL, or network configuration on the H20 cluster. There are no known vLLM bugs specific to H20 in the provided data sources.

Recommended steps: (1) Check that your CUDA, NCCL, and driver versions are fully compatible with the H20 and vLLM; (2) Ensure no network/firewall issues are blocking localhost communication; (3) Monitor GPU and CPU utilization during requests; (4) Review vLLM troubleshooting for hangs and slow model loading, especially regarding shared filesystems and memory swapping; (5) Try running with increased logging (export VLLM_LOGGING_LEVEL=DEBUG) to identify where the process stalls. See vLLM Troubleshooting and GitHub issue 5267 for similar symptoms and debugging tips.

Would you like a step-by-step checklist for debugging this on your H20 system?

Sources:

Hint: Mention @RunLLM in the post for followups.

I’ve upgraded my Nvidia drivers and cuda version to the latest version adapting to the latest nvidia-fabricmanager version(575.57.08+cu12.9).Then I upgraded vllm to 0.11.It worked well now.