Hello there,
I am relatively new to this topic, so please pardon any mistakes.
Up until version 0.19.1, vLLM could utilize around 96 CPU threads. Starting with version 0.20.0, however, vLLM utilizes only two CPU threads at 100%. This is a very disappointing performance because the machine I have available for testing has 384 CPU threads (2x AMD Epyc 9655) and 1.5 TB of RAM.
vLLM mentions Triton at startup, so I tried to download and compile the Triton-CPU backend ( GitHub - triton-lang/triton-cpu: An experimental CPU backend for Triton · GitHub ), but the compilation process terminated with an error that I was unable to resolve.
For the sake of completeness, I am attaching the log from the vLLM startup on the aforementioned machine.
$ vllm serve Qwen/Qwen3-8B --port 8000 --max-model-len 32768
INFO 06-09 16:43:11 [importing.py:45] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 06-09 16:43:11 [importing.py:69] Triton not installed or not compatible; certain GPU-related functions will not be available.
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344]
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344] █ █ █▄ ▄█
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.22.1
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344] █▄█▀ █ █ █ █ model Qwen/Qwen3-8B
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344]
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:278] non-default args: {'model_tag': 'Qwen/Qwen3-8B', 'model': 'Qwen/Qwen3-8B', 'max_model_len': 32768}
(APIServer pid=72390) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=72390) INFO 06-09 16:43:13 [model.py:617] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=72390) INFO 06-09 16:43:13 [model.py:1752] Using max model len 32768
(APIServer pid=72390) INFO 06-09 16:43:13 [vllm.py:977] Asynchronous scheduling is enabled.
(APIServer pid=72390) INFO 06-09 16:43:13 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=72390) WARNING 06-09 16:43:13 [vllm.py:502] Model runner v2 requires Triton; using the v1 model runner instead.
INFO 06-09 16:43:18 [importing.py:45] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 06-09 16:43:18 [importing.py:69] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore pid=72591) INFO 06-09 16:43:19 [core.py:112] Initializing a V1 LLM engine (v0.22.1) with config: model='Qwen/Qwen3-8B', speculative_config=None, tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': None, 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'dce': True, 'nan_asserts': False, 'epilogue_fusion': True, 'cpp.dynamic_threads': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=72591) INFO 06-09 16:43:19 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.10.51.190 (local), world_size=1, local_world_size=1
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185] OpenMP thread binding info:
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185] VLLM_CPU_OMP_THREADS_BIND='auto', auto_setup=True, skip_setup=False
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185] local_world_size=1, reserve_cpu_num=1
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185] local_rank=0, core ids=[192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286]
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185] reserved_cpus=[287]
INFO 06-09 16:43:22 [importing.py:45] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 06-09 16:43:22 [importing.py:69] Triton not installed or not compatible; certain GPU-related functions will not be available.
[W609 16:43:24.492166409 utils.cpp:68] Warning: NUMA binding: Using MEMBIND policy for memory allocation on the NUMA nodes (0). Memory allocations will be strictly bound to these NUMA nodes. (function init_cpu_memory_env)
WARNING 06-09 16:43:24 [vllm.py:502] Model runner v2 requires Triton; using the v1 model runner instead.
(Worker pid=72791) INFO 06-09 16:43:24 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36457 backend=gloo
(Worker pid=72791) INFO 06-09 16:43:24 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=72791) INFO 06-09 16:43:24 [cpu_model_runner.py:104] Starting to load model Qwen/Qwen3-8B...
(Worker pid=72791) INFO 06-09 16:43:24 [selector.py:130] Using HND KV cache layout for CPU_ATTN backend.
(Worker pid=72791) INFO 06-09 16:43:25 [weight_utils.py:922] Filesystem type for checkpoints: NFS4. Checkpoint size: 15.26 GiB. Available RAM: 1508.73 GiB.
(Worker pid=72791) INFO 06-09 16:43:25 [weight_utils.py:884] Prefetching checkpoint files into page cache started (in background, num_threads=8, block_size=16777216 bytes)
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 10% (1/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 20% (2/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 30% (3/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 40% (4/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 50% (5/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:879] Prefetching checkpoint files into page cache finished in 0.89s
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:05, 1.39s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:02<00:02, 1.02it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.19it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.42it/s]
(Worker pid=72791)
(Worker pid=72791) INFO 06-09 16:43:29 [default_loader.py:397] Loading weights took 3.54 seconds
(Worker pid=72791) INFO 06-09 16:43:37 [cpu_model_runner.py:121] Warming up model for the compilation...
(Worker pid=72791) INFO 06-09 16:44:10 [decorators.py:708] saved AOT compiled function to /home/michalek/.cache/vllm/torch_compile_cache/torch_aot_compile/a84770d3632a2f224a48b63c886c0d28c9b2dfa7750641820143c524c373e06b/rank_0_0/model
(EngineCore pid=72591) INFO 06-09 16:44:38 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker pid=72791) INFO 06-09 16:45:04 [monitor.py:81] Initial profiling/warmup run took 53.78 s
(Worker pid=72791) INFO 06-09 16:45:09 [cpu_model_runner.py:125] Warming up done.
(Worker pid=72791) INFO 06-09 16:45:10 [cpu_worker.py:211] Auto set (675.87/755.19) GiB for KV cache on node 0, with 694.77 GiB requested memory for the worker. 18.9 GiB memory was consumed by non-kv usages.
(EngineCore pid=72591) INFO 06-09 16:45:10 [kv_cache_utils.py:1733] GPU KV cache size: 4,921,472 tokens
(EngineCore pid=72591) INFO 06-09 16:45:10 [kv_cache_utils.py:1734] Maximum concurrency for 32,768 tokens per request: 150.19x
(EngineCore pid=72591) INFO 06-09 16:46:11 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=72591) INFO 06-09 16:46:42 [core.py:309] init engine (profile, create kv cache, warmup model) took 184.74 s
(EngineCore pid=72591) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=72591) WARNING 06-09 16:46:44 [vllm.py:502] Model runner v2 requires Triton; using the v1 model runner instead.
(EngineCore pid=72591) INFO 06-09 16:46:44 [vllm.py:977] Asynchronous scheduling is disabled.
(EngineCore pid=72591) WARNING 06-09 16:46:44 [vllm.py:1058] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=72591) INFO 06-09 16:46:44 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=72390) INFO 06-09 16:46:44 [api_server.py:592] Supported tasks: ['generate']
(APIServer pid=72390) WARNING 06-09 16:46:44 [model.py:1509] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=72390) INFO 06-09 16:46:46 [hf.py:488] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=72390) INFO 06-09 16:46:47 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:37] Available routes are:
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=72390) INFO: Started server process [72390]
(APIServer pid=72390) INFO: Waiting for application startup.
(APIServer pid=72390) INFO: Application startup complete.
Is there a way to improve the performance of the current vLLM releases on CPU-only systems?
Thanks in advance.