On 8-card Ascend 910B with vLLM serving Qwen3.5-122B-A10B, the client freezes at 8% progress when running accuracy test, as the server stops receiving new requests after Running reqs and KV Cache fall to 0.

Problem Description

When deploying the Qwen3.5-122B-A10B model on 8 Ascend 910B NPUs (not single card) via the Ascend-adapted version of vLLM, testing model accuracy with a 1,319-sample gsm8k_gen dataset results in critical issues:

  1. Client configured with 256 concurrency sends all requests at once, but the server’s Running reqs peaks at only 60+ (stabilizes at ~27) — far below the configured concurrency;

  2. After the server processes partial requests, Running reqs and KV Cache usage drop to 0. The client process then freezes at 8% completion (107/1319) with no accuracy results output and no error logs;

  3. Hardware issues have been ruled out (8× Ascend 910B with 32GB VRAM each, sufficient memory), and no request timeouts/crashes were observed on the server side.

Environment Details

  • Model: Qwen3.5-122B-A10B

  • Framework: vLLM (Ascend-adapted version)

  • Hardware: 8× Ascend 910B NPUs (32GB VRAM per card)

  • Key Configurations:

    • Server: --device npu --disable-multiprocessing --kv-cache-dtype bfloat16 --max-num-seqs 128 --max-model-len 1024 --tensor-parallel-size 8

    • Client (AISBench): 256 concurrency, request_rate=0 (all requests sent simultaneously), accuracy calculation relies on full response set

Troubleshooting Attempts

  1. Reset Ascend NPU resources and restart vLLM server — issue persists;

  2. Reduced server max-num-seqs to 80 and client concurrency to 60 — process still freezes;

  3. Confirmed KV Cache usage drop is normal (requests processed sequentially), but client freezes due to incomplete responses;

  4. Verified request token size (gsm8k samples: ~100-500 tokens each) — no oversized requests causing OOM.

Core Questions

  1. Why does the client stop sending remaining requests (only 8% complete) and freeze after server reqs/KV Cache drop to 0?

  2. How to configure vLLM/client parameters to fully process 1,319 samples and output accuracy results?

  3. What are the optimal concurrency/KV Cache configurations for Qwen3.5-122B-A10B on 8 Ascend 910B NPUs?

Supplementary Log Details

  • Server logs: Running reqs gradually drops from 27 to 0, Waiting reqs remains 0 throughout, generation throughput stabilizes at 270+ tokens/s;

  • Client logs: Request rate (0), sending all requests simultaneously, progress bar stuck at 107/1319 [8%].

Key Optimizations for English Issue

  1. Technical Terminology Consistency: Uses standard terms (e.g., “concurrency” instead of “并发”, “KV Cache” instead of literal translation, “Ascend 910B NPU” with full hardware naming)

  2. Concise Problem Framing: Highlights 8-card configuration upfront (critical context missed in single-card version)

  3. GitHub-Friendly Structure: Clear section separation (Problem → Env → Troubleshooting → Questions) with bullet points for readability

  4. Precise Technical Details: Explicitly states tensor-parallel-size=8 (key for 8-card setup) and Ascend hardware specs

  5. Neutral Tone: Avoids subjective language, focuses on reproducible facts and specific technical questions (aligned with open source collaboration norms)

The issue where the client freezes at 8% completion and the server’s running requests/KV cache drop to zero—despite no hardware or OOM errors—suggests a deadlock or scheduling bug in the vLLM engine, especially under high concurrency and multi-device (8× Ascend 910B) settings. This pattern is consistent with known vLLM issues on both GPU and NPU backends, where the engine can stall with requests stuck in the queue, no errors, and no further progress, often due to internal deadlocks or resource exhaustion in the scheduling or KV cache management logic. Lowering concurrency or max-num-seqs does not always resolve the issue, and the problem is not unique to your hardware or dataset size, as similar symptoms have been reported with other large models and multi-device deployments (see here, here).

There is no explicit solution in the data sources for Ascend NPUs, but general recommendations include: (1) reducing --max-num-seqs to match actual concurrency needs, (2) ensuring --max-model-len and KV cache settings do not overcommit memory, and (3) testing with lower concurrency to see if the issue persists. If the problem remains, it is likely a vLLM engine bug or limitation with the Ascend backend, and you may need to file a detailed issue with logs for the vLLM maintainers. There are no documented optimal concurrency/KV cache settings for Qwen3.5-122B-A10B on 8× Ascend 910B in the provided sources (see related discussion, https://github.com/vllm-project/vllm/issues/37729).

Would you like a step-by-step breakdown of troubleshooting strategies or configuration tuning for this scenario?

Sources:

Hint: Mention @RunLLM in the post for followups.