Following Qwen3.5 Usage Guide on H20 ,but can not host Qwen3.5-27B

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0 ,

following :

uv venv
source .venv/bin/activate
uv pip install -U vllm \
–torch-backend=auto \
–extra-index-url https://wheels.vllm.ai/nightly

CUDA_VISIBLE_DEVICES=2,3 vllm serve /home/jovyan/hlj/ms-swift-main/models/Qwen/Qwen3___5-27B --port 5510 --tensor-parallel-size 2 --max-model-len 8192 --reasoning-parser qwen3 --language-model-only --gpu-memory-utilization 0.2 --served-model-name qwen3.5-27b

设置为双卡时,INFO 02-27 19:45:42 [pynccl.py:111] vLLM is using nccl==2.27.5 到这卡住了 ,半个小时没动静 。 设置为单卡时,能看到模型加载,但会报错 。 虚拟环境的库如下:

Package Version


aiohappyeyeballs 2.4.6
aiohttp 3.11.13
aiosignal 1.3.2
airportsdata 20250224
annotated-types 0.7.0
anyio 4.8.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
astor 0.8.1
asttokens 3.0.0
async-lru 2.0.4
async-timeout 5.0.1
attrs 25.1.0
babel 2.17.0
beautifulsoup4 4.13.3
blake3 1.0.4
bleach 6.2.0
blinker 1.9.0
cachetools 7.0.1
cbor2 5.8.0
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
click 8.1.8
cloudpickle 3.1.1
coloredlogs 15.0.1
comm 0.2.2
compressed-tensors 0.11.0
contourpy 1.3.1
cupy-cuda12x 13.4.0
cycler 0.12.1
debugpy 1.8.12
decorator 5.2.1
defusedxml 0.7.1
depyf 0.19.0
dill 0.3.9
diskcache 5.6.3
distro 1.9.0
dnspython 2.7.0
einops 0.8.1
email_validator 2.2.0
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.2.0
fastapi 0.115.11
fastapi-cli 0.0.7
fastjsonschema 2.21.1
fastrlock 0.8.3
filelock 3.17.0
Flask 3.1.0
flatbuffers 25.2.10
fonttools 4.56.0
fqdn 1.5.1
frozendict 2.4.7
frozenlist 1.5.0
fsspec 2024.6.1
gguf 0.17.1
gunicorn 23.0.0
h11 0.14.0
hf-xet 1.3.1
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
huggingface_hub 0.36.2
humanfriendly 10.0
idna 3.10
importlib_metadata 8.6.1
iniconfig 2.0.0
interegular 0.3.3
ipykernel 6.29.5
ipython 8.33.0
ipywidgets 8.1.5
isoduration 20.11.0
itsdangerous 2.2.0
jedi 0.19.2
Jinja2 3.1.5
jiter 0.13.0
joblib 1.4.2
json5 0.10.0
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter 1.1.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.12.0
jupyter-lsp 2.2.5
jupyter_server 2.15.0
jupyter_server_terminals 0.5.3
jupyterlab 4.3.5
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
jupyterlab_widgets 3.0.13
kiwisolver 1.4.8
langdetect 1.0.9
lark 1.2.2
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.11.3
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
mistral_common 1.9.1
mistune 3.1.2
mpmath 1.3.0
msgpack 1.1.0
msgspec 0.19.0
multidict 6.1.0
nbclient 0.10.2
nbconvert 7.16.6
nbformat 5.10.4
nest-asyncio 1.6.0
networkx 3.3
ninja 1.13.0
notebook 7.3.2
notebook_shim 0.2.4
numba 0.61.2
numpy 1.26.4
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-nccl-cu12 2.27.5
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.3.20
nvidia-nvtx-cu12 12.8.90
onnx 1.17.0
onnx-simplifier 0.4.36
onnxruntime 1.20.1
openai 2.24.0
openai-harmony 0.0.8
opencv-python 4.11.0.86
opencv-python-headless 4.11.0.86
openpyxl 3.1.5
outlines 0.1.11
outlines_core 0.2.11
overrides 7.7.0
packaging 24.2
pandas 2.2.3
pandocfilters 1.5.1
parso 0.8.4
partial-json-parser 0.2.1.1.post5
pexpect 4.9.0
pillow 11.0.0
pip 25.0.1
platformdirs 4.3.6
pluggy 1.5.0
prometheus_client 0.21.1
prometheus-fastapi-instrumentator 7.0.2
prompt_toolkit 3.0.50
propcache 0.3.0
protobuf 5.29.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
pybase64 1.4.3
pybind11 2.13.6
pycountry 24.6.1
pycparser 2.22
pydantic 2.12.5
pydantic_core 2.41.5
pydantic-extra-types 2.11.0
Pygments 2.19.1
pyparsing 3.2.1
pytest 8.3.5
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-json-logger 3.2.1
python-multipart 0.0.20
pytz 2025.1
PyYAML 6.0.2
pyzmq 26.2.1
ray 2.54.0
referencing 0.36.2
regex 2024.11.6
requests 2.32.3
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.9.4
rich-toolkit 0.13.2
rpds-py 0.23.1
safetensors 0.5.3
scikit-learn 1.6.1
scipy 1.15.2
Send2Trash 1.8.3
sentencepiece 0.2.0
setproctitle 1.3.7
setuptools 75.8.2
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
soundfile 0.13.1
soupsieve 2.6
soxr 1.0.0
stack-data 0.6.3
starlette 0.46.0
sympy 1.14.0
terminado 0.18.1
threadpoolctl 3.5.0
tiktoken 0.9.0
tinycss2 1.4.0
tokenizers 0.22.2
tomli 2.2.1
torch 2.9.0
torchaudio 2.8.0
torchvision 0.23.0
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.57.3
triton 3.5.0
typer 0.15.2
types-python-dateutil 2.9.0.20241206
typing_extensions 4.15.0
typing-inspection 0.4.2
tzdata 2025.1
uri-template 1.3.0
urllib3 2.3.0
uv 0.10.6
uvicorn 0.34.0
uvloop 0.21.0
vllm 0.10.2
watchfiles 1.0.4
wcwidth 0.2.13
webcolors 24.11.1
webencodings 0.5.1
websocket-client 1.8.0
websockets 15.0
Werkzeug 3.1.3
wheel 0.45.1
widgetsnbextension 4.0.13
xformers 0.0.32.post1
xgrammar 0.1.23
yarl 1.18.3
zipp 3.21.0

你的问题是:vLLM在双卡(tensor-parallel-size=2)时卡在“vLLM is using nccl==2.27.5”,长时间无响应;单卡能加载模型但报错。

根据vLLM官方文档和大量类似issue,常见原因是NCCL通信或环境配置问题。解决方法包括:

  1. 尝试设置环境变量
    在启动vLLM前,执行:

    export NCCL_P2P_DISABLE=1
    

    这个设置已被多次证实能解决多卡卡住的问题,尤其是在H20、A100等非NVLink环境下。参考参考

  2. 确保CUDA、驱动、NCCL版本兼容
    你的CUDA 12.4、驱动570.86.15、NCCL 2.27.5理论上兼容,但建议优先使用官方推荐的CUDA 12.6或12.8环境,并确保nvidia-nccl-cu12与PyTorch、vLLM版本一致。参考

  3. 检查GPU分配
    确认CUDA_VISIBLE_DEVICES=2,3对应的GPU未被其他进程占用,且编号连续(部分环境下非连续编号会导致NCCL初始化失败)。参考

  4. 调低--gpu-memory-utilization
    你已设置为0.2,理论上不会OOM,但如有报错可适当调整。参考

  5. 开启调试日志
    设置环境变量VLLM_LOGGING_LEVEL=DEBUG,可获得更详细的卡住位置。参考

请先尝试第1步(NCCL_P2P_DISABLE),大部分类似问题都能解决。需要更详细的排查步骤或单卡报错分析吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

same issue. I couldn’t serve the qwen model following guide.

您好,非常感谢超及时的回复 。我尝试了上面的方案1,export NCCL_P2P_DISABLE=1 。但依然不行。由于我是按qwen3.5的安装指引安装。

uv venv
source .venv/bin/activate
uv pip install -U vllm \
–torch-backend=auto \
–extra-index-url https://wheels.vllm.ai/nightly 。vllm是qwen3.5要求的,命令本身也考虑了vllm ,torch,nccl 等的兼容性 。所以推测可能是我的cuda版本问题。对吧?

我通过将cuda升级到12.9 ,单卡能跑通并正常测试了 。但是多卡依然是在nccl通信阶段卡住。即便设置了export NCCL_P2P_DISABLE=1 等环境变量 。改用了vllm/vllm-openai:nightly 这个容器启动的方式,也是单卡可以,多卡不行 。是因为这个vllm版本在h20上的双卡通信还有bug吗 ? qwen3.5系列的多卡测试在哪些设备上测过呢?