I am running Qwen3.6-35B-A3B, my setup is 2 DGX Spark. When i try to run this model with tensor paralel 2 the model runs perfectly but when i try to make a prompt to it just sends this message:
(EngineCore pid=2696350) INFO 06-22 08:27:36 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
For me the part that make the less sense is that when i try to run the same model but with pipeline paralel 2 size instead the model runs smothly.
Your issue—“No available shared memory broadcast block found in 60 seconds” when using tensor parallelism (TP=2) across 2 DGX Spark nodes—indicates a distributed communication or synchronization problem, not a model or hardware limitation. This is a common symptom when NCCL (the backend for TP) cannot establish efficient cross-node GPU communication, often due to network, environment variable, or NCCL configuration issues. Pipeline parallelism works because it uses a different communication pattern that may not hit the same bottleneck.
Key troubleshooting steps:
Ensure you set all required NCCL environment variables for cross-node communication, such as NCCL_SOCKET_IFNAME, NCCL_IB_HCA, and possibly NCCL_P2P_DISABLE=1 if there are P2P issues.
Confirm that both nodes can communicate over the correct network interface and that firewalls are not blocking required ports.
Make sure you use the --headless flag on the follower node (--node-rank 1) and that the model is available at the same path on both nodes.
Check that your InfiniBand or high-speed interconnect is configured and working (look for [send] via NET/IB/GDRDMA in NCCL logs).