Dp8ep8 执行模型时没有回显，原因是卡在sampler

yangshanjun · November 20, 2025, 4:44am

卡在 execute_model 函数中，模型已经执行完，但卡在 sampled_token_ids = sampler_output.sampled_token_ids 这行

这个是执行命令：

vllm serve /models/DeepSeek-R1-W4A8-V2 --port 20009 --trust-remote-code --dtype bfloat16 -q slimquant_w4a8_marlin --max-model-len 4000 -tp 1 -dp 8 --gpu-memory-utilization 0.95 --max-num-seqs 128 --block-size 64 --disable-log-requests --enable_expert_parallel --max-num-batched-tokens 192 --max_seq_len_to_capture 192 --kv-cache-dtype fp8

RunLLM · November 20, 2025, 4:45am

根据你描述的卡顿位置（sampled_token_ids = sampler_output.sampled_token_ids），这一步实际上是将采样器输出的 token id 张量赋值给变量。此处的卡顿通常不是简单的 Python 赋值操作导致，而是因为 sampler_output.sampled_token_ids 可能是一个 GPU 上的张量，涉及到 GPU 到 CPU 的同步或数据拷贝，尤其是在大 batch 或高并发、显存紧张时更明显。此问题在 vLLM 的执行模型中也有相关讨论，采样和缓存同步等操作会影响性能【详见源码注释和相关实现】。此外，max_num_batched_tokens 设置较小（如 192），在高并发下也可能导致调度频繁、效率下降，进一步放大同步延迟【参考 max_num_batched_tokens 相关讨论】。

建议排查：

检查是否有隐式的 GPU 到 CPU 数据同步（如 .cpu()、.numpy() 等），可通过 profiling 工具定位瓶颈。
适当增大 max_num_batched_tokens，减少调度频率，提升吞吐。
检查显存利用率和 batch size 设置，避免频繁的 KV cache 交换或溢出。

需要更详细的分析步骤或代码级定位方法吗？

Sources:

vllm.v1.worker.gpu_model_runner.GPUModelRunner._bookkeeping_sync 源码实现

Hint: Mention @RunLLM in the post for followups.

yangshanjun · November 20, 2025, 6:04am

@RunLLM 具体是卡在sample里的这一行了，这一行只是一个数据类型转换，应该不会卡吧

Topic		Replies	Views
FlashMLA issue when running FP8 Deepseek V8 model on H20 General	3	136	September 9, 2025
Why is it so slow to build a odeVLLM from source using Docker? General	39	87	January 17, 2026
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	3	115	November 27, 2025
Pp8并行，update_from_output 会等所有rank的 model_executor.execute_model 执行完了之后才会执行吗 General	84	293	January 8, 2026
推理时报错RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} General	1	149	February 5, 2026

Dp8ep8 执行模型时没有回显，原因是卡在sampler

Related topics