[image] 卡在 execute_model 函数中，模型已经执行完，但卡在 sampled_token_ids = sampler_output.sampled_token_ids 这行这个是执行命令： vllm serve /models/DeepSeek-R1-W4A8-V2 --port 20009 --trust-remote-code --dtype bfloat16 -q slimquant_w4a8_marlin --max-model-len 4000 -tp 1 -dp 8 --gpu-memory-utilization 0.95 --max-num-seq…

根据你描述的卡顿位置（sampled_token_ids = sampler_output.sampled_token_ids），这一步实际上是将采样器输出的 token id 张量赋值给变量。此处的卡顿通常不是简单的 Python 赋值操作导致，而是因为 sampler_output.sampled_token_ids 可能是一个 GPU 上的张量，涉及到 GPU 到 CPU 的同步或数据拷贝，尤其是在大 batch 或高并发、显存紧张时更明显。此问题在 vLLM 的执行模型中也有相关讨论，采样和缓存同步等操作会影响性能【详见源码注释和相关实现】。此外，max_num_batched_tok…

@RunLLM 具体是卡在sample里的这一行了，这一行只是一个数据类型转换，应该不会卡吧 [image]

Dp8ep8 执行模型时没有回显，原因是卡在sampler

General

yangshanjun November 20, 2025, 6:04am 3

@RunLLM 具体是卡在sample里的这一行了，这一行只是一个数据类型转换，应该不会卡吧

Topic		Replies	Views
FlashMLA issue when running FP8 Deepseek V8 model on H20 General	3	153	September 9, 2025
Why is it so slow to build a odeVLLM from source using Docker? General	39	110	January 17, 2026
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	3	156	November 27, 2025
Pp8并行，update_from_output 会等所有rank的 model_executor.execute_model 执行完了之后才会执行吗 General	84	316	January 8, 2026
Vllm中，deepseek的模型刷新kvcache的地方在哪 General	44	115	March 30, 2026

Dp8ep8 执行模型时没有回显，原因是卡在sampler

Related topics