The current vLLM CPU backend is not working properly

Hello there,

I am relatively new to this topic, so please pardon any mistakes.

Up until version 0.19.1, vLLM could utilize around 96 CPU threads. Starting with version 0.20.0, however, vLLM utilizes only two CPU threads at 100%. This is a very disappointing performance because the machine I have available for testing has 384 CPU threads (2x AMD Epyc 9655) and 1.5 TB of RAM.

vLLM mentions Triton at startup, so I tried to download and compile the Triton-CPU backend ( GitHub - triton-lang/triton-cpu: An experimental CPU backend for Triton · GitHub ), but the compilation process terminated with an error that I was unable to resolve.

For the sake of completeness, I am attaching the log from the vLLM startup on the aforementioned machine.

$ vllm serve Qwen/Qwen3-8B --port 8000 --max-model-len 32768
INFO 06-09 16:43:11 [importing.py:45] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 06-09 16:43:11 [importing.py:69] Triton not installed or not compatible; certain GPU-related functions will not be available.
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344] 
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344]        █     █     █▄   ▄█
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.22.1
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344]   █▄█▀ █     █     █     █  model   Qwen/Qwen3-8B
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:344] 
(APIServer pid=72390) INFO 06-09 16:43:12 [utils.py:278] non-default args: {'model_tag': 'Qwen/Qwen3-8B', 'model': 'Qwen/Qwen3-8B', 'max_model_len': 32768}
(APIServer pid=72390) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=72390) INFO 06-09 16:43:13 [model.py:617] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=72390) INFO 06-09 16:43:13 [model.py:1752] Using max model len 32768
(APIServer pid=72390) INFO 06-09 16:43:13 [vllm.py:977] Asynchronous scheduling is enabled.
(APIServer pid=72390) INFO 06-09 16:43:13 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=72390) WARNING 06-09 16:43:13 [vllm.py:502] Model runner v2 requires Triton; using the v1 model runner instead.
INFO 06-09 16:43:18 [importing.py:45] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 06-09 16:43:18 [importing.py:69] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore pid=72591) INFO 06-09 16:43:19 [core.py:112] Initializing a V1 LLM engine (v0.22.1) with config: model='Qwen/Qwen3-8B', speculative_config=None, tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.DYNAMO_TRACE_ONCE: 2>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': None, 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'dce': True, 'nan_asserts': False, 'epilogue_fusion': True, 'cpp.dynamic_threads': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=72591) INFO 06-09 16:43:19 [multiproc_executor.py:139] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.10.51.190 (local), world_size=1, local_world_size=1
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185] OpenMP thread binding info: 
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185]  VLLM_CPU_OMP_THREADS_BIND='auto', auto_setup=True, skip_setup=False
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185]  local_world_size=1, reserve_cpu_num=1
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185]  local_rank=0, core ids=[192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286]
(EngineCore pid=72591) INFO 06-09 16:43:20 [ompmultiprocessing.py:185]  reserved_cpus=[287]
INFO 06-09 16:43:22 [importing.py:45] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 06-09 16:43:22 [importing.py:69] Triton not installed or not compatible; certain GPU-related functions will not be available.
[W609 16:43:24.492166409 utils.cpp:68] Warning: NUMA binding: Using MEMBIND policy for memory allocation on the NUMA nodes (0). Memory allocations will be strictly bound to these NUMA nodes. (function init_cpu_memory_env)
WARNING 06-09 16:43:24 [vllm.py:502] Model runner v2 requires Triton; using the v1 model runner instead.
(Worker pid=72791) INFO 06-09 16:43:24 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36457 backend=gloo
(Worker pid=72791) INFO 06-09 16:43:24 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=72791) INFO 06-09 16:43:24 [cpu_model_runner.py:104] Starting to load model Qwen/Qwen3-8B...
(Worker pid=72791) INFO 06-09 16:43:24 [selector.py:130] Using HND KV cache layout for CPU_ATTN backend.
(Worker pid=72791) INFO 06-09 16:43:25 [weight_utils.py:922] Filesystem type for checkpoints: NFS4. Checkpoint size: 15.26 GiB. Available RAM: 1508.73 GiB.
(Worker pid=72791) INFO 06-09 16:43:25 [weight_utils.py:884] Prefetching checkpoint files into page cache started (in background, num_threads=8, block_size=16777216 bytes)
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 10% (1/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 20% (2/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 30% (3/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 40% (4/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:856] Prefetching checkpoint files: 50% (5/5)
(Worker pid=72791) INFO 06-09 16:43:26 [weight_utils.py:879] Prefetching checkpoint files into page cache finished in 0.89s
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:05,  1.39s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:02<00:02,  1.02it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.19it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.42it/s]
(Worker pid=72791) 
(Worker pid=72791) INFO 06-09 16:43:29 [default_loader.py:397] Loading weights took 3.54 seconds
(Worker pid=72791) INFO 06-09 16:43:37 [cpu_model_runner.py:121] Warming up model for the compilation...
(Worker pid=72791) INFO 06-09 16:44:10 [decorators.py:708] saved AOT compiled function to /home/michalek/.cache/vllm/torch_compile_cache/torch_aot_compile/a84770d3632a2f224a48b63c886c0d28c9b2dfa7750641820143c524c373e06b/rank_0_0/model
(EngineCore pid=72591) INFO 06-09 16:44:38 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker pid=72791) INFO 06-09 16:45:04 [monitor.py:81] Initial profiling/warmup run took 53.78 s
(Worker pid=72791) INFO 06-09 16:45:09 [cpu_model_runner.py:125] Warming up done.
(Worker pid=72791) INFO 06-09 16:45:10 [cpu_worker.py:211] Auto set (675.87/755.19) GiB for KV cache on node 0, with 694.77 GiB requested memory for the worker. 18.9 GiB memory was consumed by non-kv usages.
(EngineCore pid=72591) INFO 06-09 16:45:10 [kv_cache_utils.py:1733] GPU KV cache size: 4,921,472 tokens
(EngineCore pid=72591) INFO 06-09 16:45:10 [kv_cache_utils.py:1734] Maximum concurrency for 32,768 tokens per request: 150.19x
(EngineCore pid=72591) INFO 06-09 16:46:11 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=72591) INFO 06-09 16:46:42 [core.py:309] init engine (profile, create kv cache, warmup model) took 184.74 s
(EngineCore pid=72591) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=72591) WARNING 06-09 16:46:44 [vllm.py:502] Model runner v2 requires Triton; using the v1 model runner instead.
(EngineCore pid=72591) INFO 06-09 16:46:44 [vllm.py:977] Asynchronous scheduling is disabled.
(EngineCore pid=72591) WARNING 06-09 16:46:44 [vllm.py:1058] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=72591) INFO 06-09 16:46:44 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=72390) INFO 06-09 16:46:44 [api_server.py:592] Supported tasks: ['generate']
(APIServer pid=72390) WARNING 06-09 16:46:44 [model.py:1509] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=72390) INFO 06-09 16:46:46 [hf.py:488] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=72390) INFO 06-09 16:46:47 [api_server.py:596] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:37] Available routes are:
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=72390) INFO 06-09 16:46:47 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=72390) INFO:     Started server process [72390]
(APIServer pid=72390) INFO:     Waiting for application startup.
(APIServer pid=72390) INFO:     Application startup complete.

Is there a way to improve the performance of the current vLLM releases on CPU-only systems?

Thanks in advance.

vLLM v0.20.0+ introduced changes to CPU thread management, but it should still utilize many threads if configured correctly. The log shows vLLM is binding a large set of CPU cores (core ids 192–286) and reserving one for the control plane, which is expected on a multi-socket NUMA system. However, if you observe only two threads at 100% and poor scaling, this may be due to thread binding, OpenMP settings, or the lack of Triton CPU backend support (which is still experimental and not required for vLLM CPU inference) — and not a hard-coded thread limit in vLLM itself. The Triton-CPU backend is not required for vLLM CPU backend and is not officially supported in vLLM’s documentation for CPU inference as of now (see here).

To maximize CPU utilization, ensure you set VLLM_CPU_OMP_THREADS_BIND to cover all physical cores you want to use (e.g., 0-383 for 384 threads), and check that OMP_NUM_THREADS is not set to a low value. Also, verify that your NUMA configuration is optimal and that each rank’s threads are bound to the same NUMA node (see here). If you still see poor scaling, try explicitly setting VLLM_CPU_OMP_THREADS_BIND and monitor with htop to confirm thread usage. Triton-related warnings can be ignored for CPU-only inference. Would you like a step-by-step example of how to set these environment variables for your system?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I have already tried to set VLLM_CPU_OMP_THREADS_BIND environment variable but with no effect. When I specify e.g. VLLM_CPU_OMP_THREADS_BIND='192-287', vLLM utilizes core 192 and 288 (as viewed in htop), in case of VLLM_CPU_OMP_THREADS_BIND='0-383', vLLM utilizes core 0 and 97.

This behavior suggests that vLLM is not correctly binding OpenMP threads to the specified CPU cores, and is instead defaulting to using only a couple of threads. This issue can occur if the OpenMP runtime or vLLM’s thread binding logic is not functioning as expected, possibly due to environment, library, or NUMA configuration issues. It is not expected or documented that vLLM should only use two threads when VLLM_CPU_OMP_THREADS_BIND is set to a large range (see docs).

Additionally, some users have reported similar issues when the OpenMP library (e.g., Intel OpenMP or libgomp) is not properly preloaded, or when there are conflicts between different OpenMP runtimes (see issue). Try setting LD_PRELOAD to include your system’s OpenMP library (e.g., libiomp5.so or libgomp.so.1) and ensure no conflicting OpenMP libraries are loaded. Also, make sure OMP_NUM_THREADS is unset or set to match the number of threads you want. Would you like detailed troubleshooting steps for this setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I have also already tried to set path to libiomp5.so in LD_PRELOAD with no effect.

If setting LD_PRELOAD to include libiomp5.so has no effect and vLLM still only uses two CPU threads, this suggests a deeper issue with OpenMP initialization or vLLM’s thread binding logic. The official documentation and recent issues confirm that, when properly configured, vLLM should utilize all specified cores (see docs). There is no documented limitation that would restrict vLLM to two threads in your scenario.

This problem may be related to the OpenMP environment being set too late (after the library is loaded), or to a mismatch between the allowed CPU set and the NUMA node mapping, especially in containerized or resource-constrained environments (see issue). Try launching vLLM from a clean shell with only the essential environment variables set, and verify that no other OpenMP libraries are loaded. If possible, test with VLLM_CPU_OMP_THREADS_BIND=nobind to see if OpenMP uses more threads automatically. Would you like a step-by-step diagnostic checklist to further isolate the cause?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM With VLLM_CPU_OMP_THREADS_BIND=nobind vLLM utilizes approximately one half of available CPU threads.

When you set VLLM_CPU_OMP_THREADS_BIND=nobind, vLLM disables explicit thread binding and relies on the OpenMP runtime to determine thread placement and count. In this mode, the number of threads is controlled by OMP_NUM_THREADS (or the OpenMP default), which often defaults to the number of physical cores, not logical threads, especially on systems with hyperthreading (see docs; PR discussion).

To utilize all available threads, explicitly set OMP_NUM_THREADS to the desired value (e.g., OMP_NUM_THREADS=384 for your system) before launching vLLM. Would you like a code example or further details on how to set this up?

Sources:

Hint: Mention @RunLLM in the post for followups.

I tried:
export OMP_NUM_THREADS=380
VLLM_CPU_OMP_THREADS_BIND='nobind' vllm serve Qwen/Qwen3-8B --port 8000 --max-model-len 32768
and vLLM still utilizes approximately one half of CPU threads.