跑模型的时候,CPU 100%,NPU 10%就突然降下来了,我如何知道模型用的到底是npu 还是cpu
你可以通过查看vLLM的运行日志和系统监控工具来判断模型实际运行在NPU还是CPU上。vLLM在支持NPU(如昇腾)时,需要特定的环境配置和参数,且日志中会有如“Using ASCEND_TORCH backend”等提示,表明模型已加载到NPU上;否则一般会显示为CPU或GPU后端。此外,NPU利用率低而CPU占用高,通常说明推理主要在CPU上进行,NPU未被充分利用。
如需进一步确认,可结合npu-smi、top等工具分别监控NPU和CPU的利用率,若NPU利用率持续很低且CPU 100%,则模型大概率实际运行在CPU上。详细配置和支持情况可参考vLLM昇腾NPU支持PR说明及相关评论。
需要更详细的排查步骤或配置方法吗?
Sources:
INFO 08-06 10:03:42 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.9 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.8%, CPU KV cache usage: 0.0%.
INFO 08-06 10:03:49 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.5 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.8%, CPU KV cache usage: 0.0%.
INFO 08-06 10:03:54 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.6 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.8%, CPU KV cache usage: 0.0%.
INFO 08-06 10:04:00 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.8 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.9%, CPU KV cache usage: 0.0%.
INFO 08-06 10:04:05 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.9%, CPU KV cache usage: 0.0%.
INFO 08-06 10:04:12 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.1 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.9%, CPU KV cache usage: 0.0%.
INFO 08-06 10:04:19 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.7 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.9%, CPU KV cache usage: 0.0%.
INFO 08-06 10:04:26 [metrics.py:417] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.1 tokens/s, Running: 20 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 15.9%, CPU KV cache usage: 0.0%.
这个用到了npu了吗
仅从你提供的日志无法直接判断是否用到了NPU。日志只显示了KV cache在GPU和CPU上的使用情况,没有NPU相关字段或提示。如果模型实际运行在NPU,日志通常会有“NPU”或“ASCEND”相关字样,且KV cache usage会显示为NPU而非GPU/CPU。
建议检查启动参数、环境变量和更详细的日志,确认是否指定了NPU设备,并结合npu-smi等工具监控NPU利用率。如无NPU相关信息,说明当前推理未用到NPU。相关说明参考。
需要进一步排查NPU配置方法吗?
Sources:
NFO 08-05 20:01:03 [default_loader.py:272] Loading weights took 0.23 seconds
INFO 08-05 20:01:04 [model_runner.py:1000] Loading model weights took 3.2153 GB
INFO 08-05 20:01:12 [executor_base.py:113] # npu blocks: 3386, # CPU blocks: 292
INFO 08-05 20:01:12 [executor_base.py:118] Maximum concurrency for 40960 tokens per request: 10.58x
INFO 08-05 20:01:13 [llm_engine.py:428] init engine (profile, create kv cache, warmup model) took 8.76 seconds
WARNING 08-05 20:01:13 [config.py:1363] Default sampling parameters have been overridden by the model’s Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with --generation-config vllm
.
INFO 08-05 20:01:13 [serving_chat.py:118] Using default chat sampling params from model: {‘temperature’: 0.6, ‘top_k’: 20, ‘top_p’: 0.95}
INFO 08-05 20:01:13 [serving_completion.py:66] Using default completion sampling params from model: {‘temperature’: 0.6, ‘top_k’: 20, ‘top_p’: 0.95}
INFO 08-05 20:01:13 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8020
根据日志中的 “# npu blocks: 3386, # CPU blocks: 292”,可以确定模型已经用到了NPU(npu blocks数量远大于CPU blocks)。这说明KV cache等主要资源分配在NPU上,推理主要在NPU上执行。相关说明参考。
需要进一步了解NPU利用率监控或配置细节吗?
Sources: