Problem Description
When deploying the Qwen3.5-122B-A10B model on 8 Ascend 910B NPUs (not single card) via the Ascend-adapted version of vLLM, testing model accuracy with a 1,319-sample gsm8k_gen dataset results in critical issues:
-
Client configured with 256 concurrency sends all requests at once, but the server’s
Running reqspeaks at only 60+ (stabilizes at ~27) — far below the configured concurrency; -
After the server processes partial requests,
Running reqsand KV Cache usage drop to 0. The client process then freezes at 8% completion (107/1319) with no accuracy results output and no error logs; -
Hardware issues have been ruled out (8× Ascend 910B with 32GB VRAM each, sufficient memory), and no request timeouts/crashes were observed on the server side.
Environment Details
-
Model: Qwen3.5-122B-A10B
-
Framework: vLLM (Ascend-adapted version)
-
Hardware: 8× Ascend 910B NPUs (32GB VRAM per card)
-
Key Configurations:
-
Server:
--device npu --disable-multiprocessing --kv-cache-dtype bfloat16 --max-num-seqs 128 --max-model-len 1024 --tensor-parallel-size 8 -
Client (AISBench): 256 concurrency,
request_rate=0(all requests sent simultaneously), accuracy calculation relies on full response set
-
Troubleshooting Attempts
-
Reset Ascend NPU resources and restart vLLM server — issue persists;
-
Reduced server
max-num-seqsto 80 and client concurrency to 60 — process still freezes; -
Confirmed KV Cache usage drop is normal (requests processed sequentially), but client freezes due to incomplete responses;
-
Verified request token size (gsm8k samples: ~100-500 tokens each) — no oversized requests causing OOM.
Core Questions
-
Why does the client stop sending remaining requests (only 8% complete) and freeze after server
reqs/KV Cache drop to 0? -
How to configure vLLM/client parameters to fully process 1,319 samples and output accuracy results?
-
What are the optimal concurrency/KV Cache configurations for Qwen3.5-122B-A10B on 8 Ascend 910B NPUs?
Supplementary Log Details
-
Server logs:
Running reqsgradually drops from 27 to 0,Waiting reqsremains 0 throughout,generation throughputstabilizes at 270+ tokens/s; -
Client logs:
Request rate (0), sending all requests simultaneously, progress bar stuck at107/1319 [8%].
Key Optimizations for English Issue
-
Technical Terminology Consistency: Uses standard terms (e.g., “concurrency” instead of “并发”, “KV Cache” instead of literal translation, “Ascend 910B NPU” with full hardware naming)
-
Concise Problem Framing: Highlights 8-card configuration upfront (critical context missed in single-card version)
-
GitHub-Friendly Structure: Clear section separation (Problem → Env → Troubleshooting → Questions) with bullet points for readability
-
Precise Technical Details: Explicitly states
tensor-parallel-size=8(key for 8-card setup) and Ascend hardware specs -
Neutral Tone: Avoids subjective language, focuses on reproducible facts and specific technical questions (aligned with open source collaboration norms)