vLLM Tensor Parallel Workers Not Completing Initialization

I can’t seem to get inference to work, and it seems that workers might be stuck compiling?

I’m using the docker image vllm/vllm-openai:latest. I’m running on 4 32gb Nvidia V100s.

When I run:

docker run -d --name vllm-server --runtime nvidia --gpus all -v /home/blank/workspace/huggingface:/root/.cache/huggingface -p 8080:8000 --ipc=host -e CUDA_VISIBLE_DEVICES=1,2,3,4 vllm/vllm-openai:latest --model /root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 --trust-remote-code --tensor-parallel-size 4 --max-num-seqs 8 --gpu-memory-utilization 0.8 --max-model-len 8192

The logs look like:

WARNING 01-21 11:58:59 [argparse_utils.py:195] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 01-21 11:58:59 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=1) INFO 01-21 11:58:59 [utils.py:263] non-default args: {'model_tag': '/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', 'model': '/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', 'trust_remote_code': True, 'max_model_len': 8192, 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.8, 'max_num_seqs': 8}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-21 11:59:12 [model.py:530] Resolved architecture: NemotronHForCausalLM
(APIServer pid=1) WARNING 01-21 11:59:12 [model.py:1817] Your device 'Tesla V100-SXM2-32GB' (with compute capability 7.0) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=1) WARNING 01-21 11:59:12 [model.py:1869] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 01-21 11:59:12 [model.py:1545] Using max model len 8192
(APIServer pid=1) INFO 01-21 11:59:12 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 01-21 11:59:12 [config.py:543] Updating mamba_ssm_cache_dtype to 'float16' for NemotronH model
(APIServer pid=1) INFO 01-21 11:59:12 [config.py:476] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 01-21 11:59:12 [config.py:500] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 01-21 11:59:12 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-21 11:59:12 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(EngineCore_DP0 pid=226) INFO 01-21 11:59:25 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', speculative_config=None, tokenizer='/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=226) WARNING 01-21 11:59:25 [multiproc_executor.py:880] Reducing Torch parallelism from 40 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-21 11:59:38 [parallel_state.py:1214] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 11:59:50 [parallel_state.py:1214] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 12:00:02 [parallel_state.py:1214] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 12:00:14 [parallel_state.py:1214] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 12:00:14 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(Worker_TP0 pid=299) INFO 01-21 12:00:17 [gpu_model_runner.py:3808] Starting to load model /root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16...
(Worker_TP3 pid=322) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP1 pid=302) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP0 pid=299) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP2 pid=311) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP0 pid=299) INFO 01-21 12:00:52 [cuda.py:351] Using TRITON_ATTN attention backend out of potential backends: ('TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/13 [00:01<00:20,  1.73s/it]
Loading safetensors checkpoint shards:  15% Completed | 2/13 [00:03<00:19,  1.78s/it]
Loading safetensors checkpoint shards:  23% Completed | 3/13 [00:09<00:38,  3.87s/it]
Loading safetensors checkpoint shards:  31% Completed | 4/13 [00:11<00:26,  3.00s/it]
Loading safetensors checkpoint shards:  38% Completed | 5/13 [00:13<00:19,  2.50s/it]
Loading safetensors checkpoint shards:  46% Completed | 6/13 [00:15<00:15,  2.28s/it]
Loading safetensors checkpoint shards:  54% Completed | 7/13 [00:16<00:12,  2.12s/it]
Loading safetensors checkpoint shards:  62% Completed | 8/13 [00:18<00:10,  2.03s/it]
Loading safetensors checkpoint shards:  69% Completed | 9/13 [00:19<00:07,  1.76s/it]
Loading safetensors checkpoint shards:  77% Completed | 10/13 [00:21<00:05,  1.72s/it]
Loading safetensors checkpoint shards:  85% Completed | 11/13 [00:23<00:03,  1.70s/it]
Loading safetensors checkpoint shards:  92% Completed | 12/13 [00:24<00:01,  1.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:26<00:00,  1.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:26<00:00,  2.05s/it]
(Worker_TP0 pid=299) 
(Worker_TP0 pid=299) INFO 01-21 12:01:19 [default_loader.py:291] Loading weights took 26.75 seconds
(Worker_TP0 pid=299) INFO 01-21 12:01:20 [gpu_model_runner.py:3905] Model loading took 14.76 GiB memory and 62.245573 seconds
(Worker_TP0 pid=299) INFO 01-21 12:01:55 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/3678f38b72/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=299) INFO 01-21 12:01:55 [backends.py:704] Dynamo bytecode transform time: 8.08 s
(Worker_TP0 pid=299) INFO 01-21 12:02:02 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP3 pid=322) INFO 01-21 12:02:02 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP2 pid=311) INFO 01-21 12:02:02 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP1 pid=302) INFO 01-21 12:02:03 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP0 pid=299) WARNING 01-21 12:02:06 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=464,device_name=Tesla_V100-SXM2-32GB.json
(Worker_TP0 pid=299) INFO 01-21 12:02:46 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 46.93 s
(Worker_TP0 pid=299) INFO 01-21 12:02:46 [monitor.py:34] torch.compile takes 55.01 s in total
(EngineCore_DP0 pid=226) INFO 01-21 12:02:47 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=299) INFO 01-21 12:02:48 [gpu_worker.py:358] Available KV cache memory: 10.3 GiB
(EngineCore_DP0 pid=226) WARNING 01-21 12:02:48 [kv_cache_utils.py:1047] Add 1 padding layers, may waste at most 4.35% KV cache memory
(EngineCore_DP0 pid=226) INFO 01-21 12:02:48 [kv_cache_utils.py:1305] GPU KV cache size: 719,712 tokens
(EngineCore_DP0 pid=226) INFO 01-21 12:02:48 [kv_cache_utils.py:1310] Maximum concurrency for 8,192 tokens per request: 330.80x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  20%|██        | 1/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  40%|████      | 2/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  60%|██████    | 3/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  80%|████████  | 4/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5 [00:03<00:00,  1.46it/s]
Capturing CUDA graphs (decode, FULL):  25%|██▌       | 1/4 [00:05<00:15,  5.3Capturing CUDA graphs (decode, FULL):  50%|█████     | 2/4 [00:05<00:04,  2.3Capturing CUDA graphs (decode, FULL):  75%|███████▌  | 3/4 [00:05<00:01,  1.3Capturing CUDA graphs (decode, FULL): 100%|██████████| 4/4 [00:07<00:00,  1.6Capturing CUDA graphs (decode, FULL): 100%|██████████| 4/4 [00:07<00:00,  1.97s/it]
(Worker_TP0 pid=299) INFO 01-21 12:03:01 [gpu_model_runner.py:4856] Graph capturing finished in 12 secs, took 0.19 GiB
(EngineCore_DP0 pid=226) INFO 01-21 12:03:01 [core.py:273] init engine (profile, create kv cache, warmup model) took 74.04 seconds
(EngineCore_DP0 pid=226) INFO 01-21 12:03:02 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-21 12:03:03 [api_server.py:1014] Supported tasks: ['generate']
(APIServer pid=1) INFO 01-21 12:03:03 [serving_chat.py:182] Warming up chat template processing...
(APIServer pid=1) INFO 01-21 12:03:04 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] Chat template warmup failed
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] Traceback (most recent call last):
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 201, in warmup
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222]     await self._preprocess_chat(
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 1227, in _preprocess_chat
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222]     request_prompt = apply_hf_chat_template(
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222]                      ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 1826, in apply_hf_chat_template
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222]     raise ChatTemplateResolutionError(
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(APIServer pid=1) /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py:222: RuntimeWarning: coroutine 'AsyncMultiModalItemTracker.all_mm_data' was never awaited
(APIServer pid=1)   logger.exception("Chat template warmup failed")
(APIServer pid=1) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(APIServer pid=1) INFO 01-21 12:03:04 [api_server.py:1346] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Note that I only get updates for Worker_TP0. When I then attempt to make a inference call, I never get an output, and the logs show repeatedly:

(EngineCore_DP0 pid=230) INFO 01-21 12:28:14 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

Did I not wait long enough for everything to compile? Or is there some other issue at hand?

Your issue is not with compilation time, but with the model’s lack of a chat template. The error ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one. means vLLM cannot process chat requests for this model until you provide a chat template. This is a hard requirement for OpenAI-compatible /v1/chat/completions endpoints, and inference will hang or fail without it. The repeated “No available shared memory broadcast block found” messages are a symptom of the server waiting for input that never gets processed due to this error, not a sign of slow compilation or hardware issues. See here, here, and here.

To fix, you must provide a chat template via the --chat-template argument when launching vLLM. If your model is not a chat/instruction-tuned model, you may need to create a custom template or use a completions endpoint instead. Example templates for popular models are available in the vLLM examples directory. If you only need completions (not chat), use the /v1/completions endpoint, which does not require a chat template.

Would you like more detail on how to create or specify a chat template for your model?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I put in the chat template but things still aren’t working. Here are the startup logs:

WARNING 01-21 12:49:35 [argparse_utils.py:195] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 01-21 12:49:35 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=1) INFO 01-21 12:49:35 [utils.py:263] non-default args: {'model_tag': '/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', 'model': '/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', 'trust_remote_code': True, 'max_model_len': 8192, 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.8, 'max_num_seqs': 8}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-21 12:49:48 [model.py:530] Resolved architecture: NemotronHForCausalLM
(APIServer pid=1) WARNING 01-21 12:49:48 [model.py:1817] Your device 'Tesla V100-SXM2-32GB' (with compute capability 7.0) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=1) WARNING 01-21 12:49:48 [model.py:1869] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 01-21 12:49:48 [model.py:1545] Using max model len 8192
(APIServer pid=1) INFO 01-21 12:49:48 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 01-21 12:49:48 [config.py:543] Updating mamba_ssm_cache_dtype to 'float16' for NemotronH model
(APIServer pid=1) INFO 01-21 12:49:48 [config.py:476] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 01-21 12:49:48 [config.py:500] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 01-21 12:49:48 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-21 12:49:48 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(EngineCore_DP0 pid=227) INFO 01-21 12:50:03 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', speculative_config=None, tokenizer='/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=227) WARNING 01-21 12:50:03 [multiproc_executor.py:880] Reducing Torch parallelism from 40 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-21 12:50:16 [parallel_state.py:1214] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:59637 backend=nccl
INFO 01-21 12:50:28 [parallel_state.py:1214] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:59637 backend=nccl
INFO 01-21 12:50:41 [parallel_state.py:1214] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:59637 backend=nccl
INFO 01-21 12:50:53 [parallel_state.py:1214] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:59637 backend=nccl
INFO 01-21 12:50:53 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 01-21 12:50:54 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:50:54 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:50:54 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:50:54 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:50:54 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:50:54 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:50:54 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:50:54 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 01-21 12:50:54 [parallel_state.py:1425] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
INFO 01-21 12:50:54 [parallel_state.py:1425] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2
INFO 01-21 12:50:54 [parallel_state.py:1425] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
INFO 01-21 12:50:54 [parallel_state.py:1425] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3
(Worker_TP0 pid=300) INFO 01-21 12:50:56 [gpu_model_runner.py:3808] Starting to load model /root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16...
(Worker_TP3 pid=323) ERROR 01-21 12:51:32 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP1 pid=303) ERROR 01-21 12:51:32 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP2 pid=312) ERROR 01-21 12:51:32 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP0 pid=300) ERROR 01-21 12:51:32 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP0 pid=300) INFO 01-21 12:51:32 [cuda.py:351] Using TRITON_ATTN attention backend out of potential backends: ('TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/13 [00:01<00:19,  1.59s/it]
Loading safetensors checkpoint shards:  15% Completed | 2/13 [00:03<00:17,  1.61s/it]
Loading safetensors checkpoint shards:  23% Completed | 3/13 [00:05<00:16,  1.70s/it]
Loading safetensors checkpoint shards:  31% Completed | 4/13 [00:06<00:15,  1.71s/it]
Loading safetensors checkpoint shards:  38% Completed | 5/13 [00:08<00:13,  1.74s/it]
Loading safetensors checkpoint shards:  46% Completed | 6/13 [00:10<00:11,  1.69s/it]
Loading safetensors checkpoint shards:  54% Completed | 7/13 [00:11<00:09,  1.65s/it]
Loading safetensors checkpoint shards:  62% Completed | 8/13 [00:13<00:08,  1.63s/it]
Loading safetensors checkpoint shards:  69% Completed | 9/13 [00:14<00:06,  1.50s/it]
Loading safetensors checkpoint shards:  77% Completed | 10/13 [00:16<00:04,  1.55s/it]
Loading safetensors checkpoint shards:  85% Completed | 11/13 [00:17<00:03,  1.58s/it]
Loading safetensors checkpoint shards:  92% Completed | 12/13 [00:19<00:01,  1.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:20<00:00,  1.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:20<00:00,  1.61s/it]
(Worker_TP0 pid=300) 
(Worker_TP0 pid=300) INFO 01-21 12:51:53 [default_loader.py:291] Loading weights took 21.11 seconds
(Worker_TP0 pid=300) INFO 01-21 12:51:54 [gpu_model_runner.py:3905] Model loading took 14.76 GiB memory and 56.989528 seconds
(Worker_TP0 pid=300) INFO 01-21 12:52:03 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/3678f38b72/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=300) INFO 01-21 12:52:03 [backends.py:704] Dynamo bytecode transform time: 8.21 s
(Worker_TP3 pid=323) INFO 01-21 12:52:11 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP2 pid=312) INFO 01-21 12:52:11 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP0 pid=300) INFO 01-21 12:52:11 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP1 pid=303) INFO 01-21 12:52:11 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP0 pid=300) WARNING 01-21 12:52:14 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=464,device_name=Tesla_V100-SXM2-32GB.json
(Worker_TP0 pid=300) INFO 01-21 12:52:52 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 45.13 s
(Worker_TP0 pid=300) INFO 01-21 12:52:52 [monitor.py:34] torch.compile takes 53.35 s in total
(Worker_TP0 pid=300) INFO 01-21 12:52:54 [gpu_worker.py:358] Available KV cache memory: 10.3 GiB
(EngineCore_DP0 pid=227) WARNING 01-21 12:52:54 [kv_cache_utils.py:1047] Add 1 padding layers, may waste at most 4.35% KV cache memory
(EngineCore_DP0 pid=227) INFO 01-21 12:52:54 [kv_cache_utils.py:1305] GPU KV cache size: 719,712 tokens
(EngineCore_DP0 pid=227) INFO 01-21 12:52:54 [kv_cache_utils.py:1310] Maximum concurrency for 8,192 tokens per request: 330.80x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|      Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  20%|██    Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  40%|████  Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  60%|██████Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  80%|██████Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5 [00:04<00:00,  1.24it/s]
Capturing CUDA graphs (decode, FULL):   0%|          | 0/4 [00:00<?,Capturing CUDA graphs (decode, FULL):  25%|██▌       | 1/4 [00:04<00Capturing CUDA graphs (decode, FULL):  50%|█████     | 2/4 [00:04<00Capturing CUDA graphs (decode, FULL):  75%|███████▌  | 3/4 [00:05<00Capturing CUDA graphs (decode, FULL): 100%|██████████| 4/4 [00:07<00Capturing CUDA graphs (decode, FULL): 100%|██████████| 4/4 [00:07<00:00,  1.87s/it]
(Worker_TP0 pid=300) INFO 01-21 12:53:07 [gpu_model_runner.py:4856] Graph capturing finished in 13 secs, took 0.19 GiB
(EngineCore_DP0 pid=227) INFO 01-21 12:53:07 [core.py:273] init engine (profile, create kv cache, warmup model) took 72.53 seconds
(EngineCore_DP0 pid=227) INFO 01-21 12:53:09 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-21 12:53:09 [api_server.py:1014] Supported tasks: ['generate']
(APIServer pid=1) INFO 01-21 12:53:09 [serving_chat.py:182] Warming up chat template processing...
(APIServer pid=1) INFO 01-21 12:53:11 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 01-21 12:53:11 [serving_chat.py:218] Chat template warmup completed in 1419.3ms
(APIServer pid=1) INFO 01-21 12:53:11 [api_server.py:1346] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:53:11 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Here is my request:

curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" --data '{"model": "/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16", "messages": [{"role": "user", "content": "Hello, how are you?"}], "max_tokens": 50, "temperature": 0.7}' 

Here are the logs post request:

(EngineCore_DP0 pid=227) INFO 01-21 12:55:20 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=227) INFO 01-21 12:56:20 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=227) INFO 01-21 12:57:20 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=227) INFO 01-21 12:58:20 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

Your logs show the server and model loaded successfully, and the chat template warmup completed, but your inference request hangs with repeated “No available shared memory broadcast block found” messages. This usually means the model is not processing requests, often due to a model or template mismatch, or a hardware/compatibility issue during execution—not compilation time.

The root cause is that NVIDIA Nemotron-3-Nano-30B-A3B-Base-BF16 is a base model, not a chat/instruction-tuned model, so it does not natively support chat-style prompts or templates. Even if you provide a chat template, the model may not respond as expected, and vLLM may hang waiting for output. For base models, you should use the /v1/completions endpoint, not /v1/chat/completions (see docs). Alternatively, use an instruction-tuned or chat-tuned variant of Nemotron if you want chat-style interaction.

Would you like step-by-step guidance on switching to the completions endpoint or on using a different model?

Sources:

Hint: Mention @RunLLM in the post for followups.