I can’t seem to get inference to work, and it seems that workers might be stuck compiling?
I’m using the docker image vllm/vllm-openai:latest. I’m running on 4 32gb Nvidia V100s.
When I run:
docker run -d --name vllm-server --runtime nvidia --gpus all -v /home/blank/workspace/huggingface:/root/.cache/huggingface -p 8080:8000 --ipc=host -e CUDA_VISIBLE_DEVICES=1,2,3,4 vllm/vllm-openai:latest --model /root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 --trust-remote-code --tensor-parallel-size 4 --max-num-seqs 8 --gpu-memory-utilization 0.8 --max-model-len 8192
The logs look like:
WARNING 01-21 11:58:59 [argparse_utils.py:195] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 01-21 11:58:59 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=1) INFO 01-21 11:58:59 [utils.py:263] non-default args: {'model_tag': '/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', 'model': '/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', 'trust_remote_code': True, 'max_model_len': 8192, 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.8, 'max_num_seqs': 8}
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-21 11:59:12 [model.py:530] Resolved architecture: NemotronHForCausalLM
(APIServer pid=1) WARNING 01-21 11:59:12 [model.py:1817] Your device 'Tesla V100-SXM2-32GB' (with compute capability 7.0) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=1) WARNING 01-21 11:59:12 [model.py:1869] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 01-21 11:59:12 [model.py:1545] Using max model len 8192
(APIServer pid=1) INFO 01-21 11:59:12 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 01-21 11:59:12 [config.py:543] Updating mamba_ssm_cache_dtype to 'float16' for NemotronH model
(APIServer pid=1) INFO 01-21 11:59:12 [config.py:476] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 01-21 11:59:12 [config.py:500] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 01-21 11:59:12 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-21 11:59:12 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(EngineCore_DP0 pid=226) INFO 01-21 11:59:25 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', speculative_config=None, tokenizer='/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=226) WARNING 01-21 11:59:25 [multiproc_executor.py:880] Reducing Torch parallelism from 40 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-21 11:59:38 [parallel_state.py:1214] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 11:59:50 [parallel_state.py:1214] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 12:00:02 [parallel_state.py:1214] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 12:00:14 [parallel_state.py:1214] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:57347 backend=nccl
INFO 01-21 12:00:14 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.0 not supported, communicator is not available.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-21 12:00:15 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3
INFO 01-21 12:00:15 [parallel_state.py:1425] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(Worker_TP0 pid=299) INFO 01-21 12:00:17 [gpu_model_runner.py:3808] Starting to load model /root/.cache/huggingface/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16...
(Worker_TP3 pid=322) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP1 pid=302) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP0 pid=299) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP2 pid=311) ERROR 01-21 12:00:52 [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(Worker_TP0 pid=299) INFO 01-21 12:00:52 [cuda.py:351] Using TRITON_ATTN attention backend out of potential backends: ('TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards: 0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 8% Completed | 1/13 [00:01<00:20, 1.73s/it]
Loading safetensors checkpoint shards: 15% Completed | 2/13 [00:03<00:19, 1.78s/it]
Loading safetensors checkpoint shards: 23% Completed | 3/13 [00:09<00:38, 3.87s/it]
Loading safetensors checkpoint shards: 31% Completed | 4/13 [00:11<00:26, 3.00s/it]
Loading safetensors checkpoint shards: 38% Completed | 5/13 [00:13<00:19, 2.50s/it]
Loading safetensors checkpoint shards: 46% Completed | 6/13 [00:15<00:15, 2.28s/it]
Loading safetensors checkpoint shards: 54% Completed | 7/13 [00:16<00:12, 2.12s/it]
Loading safetensors checkpoint shards: 62% Completed | 8/13 [00:18<00:10, 2.03s/it]
Loading safetensors checkpoint shards: 69% Completed | 9/13 [00:19<00:07, 1.76s/it]
Loading safetensors checkpoint shards: 77% Completed | 10/13 [00:21<00:05, 1.72s/it]
Loading safetensors checkpoint shards: 85% Completed | 11/13 [00:23<00:03, 1.70s/it]
Loading safetensors checkpoint shards: 92% Completed | 12/13 [00:24<00:01, 1.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:26<00:00, 1.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:26<00:00, 2.05s/it]
(Worker_TP0 pid=299)
(Worker_TP0 pid=299) INFO 01-21 12:01:19 [default_loader.py:291] Loading weights took 26.75 seconds
(Worker_TP0 pid=299) INFO 01-21 12:01:20 [gpu_model_runner.py:3905] Model loading took 14.76 GiB memory and 62.245573 seconds
(Worker_TP0 pid=299) INFO 01-21 12:01:55 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/3678f38b72/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=299) INFO 01-21 12:01:55 [backends.py:704] Dynamo bytecode transform time: 8.08 s
(Worker_TP0 pid=299) INFO 01-21 12:02:02 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP3 pid=322) INFO 01-21 12:02:02 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP2 pid=311) INFO 01-21 12:02:02 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP1 pid=302) INFO 01-21 12:02:03 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(Worker_TP0 pid=299) WARNING 01-21 12:02:06 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=464,device_name=Tesla_V100-SXM2-32GB.json
(Worker_TP0 pid=299) INFO 01-21 12:02:46 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 46.93 s
(Worker_TP0 pid=299) INFO 01-21 12:02:46 [monitor.py:34] torch.compile takes 55.01 s in total
(EngineCore_DP0 pid=226) INFO 01-21 12:02:47 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=299) INFO 01-21 12:02:48 [gpu_worker.py:358] Available KV cache memory: 10.3 GiB
(EngineCore_DP0 pid=226) WARNING 01-21 12:02:48 [kv_cache_utils.py:1047] Add 1 padding layers, may waste at most 4.35% KV cache memory
(EngineCore_DP0 pid=226) INFO 01-21 12:02:48 [kv_cache_utils.py:1305] GPU KV cache size: 719,712 tokens
(EngineCore_DP0 pid=226) INFO 01-21 12:02:48 [kv_cache_utils.py:1310] Maximum concurrency for 8,192 tokens per request: 330.80x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 20%|██ | 1/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 40%|████ | 2/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 60%|██████ | 3/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 80%|████████ | 4/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 5/5 [00:03<00:00, 1.46it/s]
Capturing CUDA graphs (decode, FULL): 25%|██▌ | 1/4 [00:05<00:15, 5.3Capturing CUDA graphs (decode, FULL): 50%|█████ | 2/4 [00:05<00:04, 2.3Capturing CUDA graphs (decode, FULL): 75%|███████▌ | 3/4 [00:05<00:01, 1.3Capturing CUDA graphs (decode, FULL): 100%|██████████| 4/4 [00:07<00:00, 1.6Capturing CUDA graphs (decode, FULL): 100%|██████████| 4/4 [00:07<00:00, 1.97s/it]
(Worker_TP0 pid=299) INFO 01-21 12:03:01 [gpu_model_runner.py:4856] Graph capturing finished in 12 secs, took 0.19 GiB
(EngineCore_DP0 pid=226) INFO 01-21 12:03:01 [core.py:273] init engine (profile, create kv cache, warmup model) took 74.04 seconds
(EngineCore_DP0 pid=226) INFO 01-21 12:03:02 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-21 12:03:03 [api_server.py:1014] Supported tasks: ['generate']
(APIServer pid=1) INFO 01-21 12:03:03 [serving_chat.py:182] Warming up chat template processing...
(APIServer pid=1) INFO 01-21 12:03:04 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] Chat template warmup failed
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] Traceback (most recent call last):
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 201, in warmup
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] await self._preprocess_chat(
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 1227, in _preprocess_chat
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] request_prompt = apply_hf_chat_template(
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] ^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 1826, in apply_hf_chat_template
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] raise ChatTemplateResolutionError(
(APIServer pid=1) ERROR 01-21 12:03:04 [serving_chat.py:222] vllm.entrypoints.chat_utils.ChatTemplateResolutionError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
(APIServer pid=1) /usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py:222: RuntimeWarning: coroutine 'AsyncMultiModalItemTracker.all_mm_data' was never awaited
(APIServer pid=1) logger.exception("Chat template warmup failed")
(APIServer pid=1) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(APIServer pid=1) INFO 01-21 12:03:04 [api_server.py:1346] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 01-21 12:03:04 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
Note that I only get updates for Worker_TP0. When I then attempt to make a inference call, I never get an output, and the logs show repeatedly:
(EngineCore_DP0 pid=230) INFO 01-21 12:28:14 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
Did I not wait long enough for everything to compile? Or is there some other issue at hand?