Getting '502 Bad Gateway' on requesting a long prompt with Qwen 3 Next and 4x4090

ahpxex · October 26, 2025, 12:53pm

My machine:

CPU: 2 x AMD EPYC 7502 32-Core
GPU: 4 x NVIDIA GeForce RTX 4090 (48GB vRAM modified edition)
Memory: 256GB

Model: Qwen/Qwen3-Next-80B-A3B-Instruct BF16

Launch command: `vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --port 3000 --max-model-len 20000`

(vLLM version is the latest, python version is 3.11, didn’t use nginx for this)

I’m trying to run a long prompt and this went into error, this system instruction is about 4,000 tokens, and the input is about 3,000 tokens, I’m sure they are under 10k tokens.

What’s interesting is that when I send in this request into the server, I just can’t see any output from the terminal. Nothing like ‘get’ ‘ok’ . And if I use a shorter prompt, then magically, I can send the request without any trouble.

As you can see, I have already set the max model token to 20,000, but still I get this error, and I don’t know how to diagnose it.

RunLLM · October 26, 2025, 12:54pm

Your issue is likely due to GPU memory exhaustion when handling long prompts, even though your --max-model-len is set to 20,000. Large models like Qwen3-Next-80B-A3B-Instruct require significant VRAM for long context lengths, and if the prompt is too long, vLLM may hang or fail silently without logging, especially if the GPU runs out of memory during KV cache allocation or prefill. This is a known behavior; reducing --max-model-len, --max-num-batched-tokens, or --max-num-seqs can help fit within available memory. Also, ensure you are not hitting the model’s internal context limit as defined in its config files, and try enabling more logging for debugging. See this issue and this memory guide for details.

To diagnose, try lowering --max-model-len (e.g., 8192 or 10000), set --max-num-batched-tokens to a lower value, and use export VLLM_LOGGING_LEVEL=DEBUG for more output. If the problem persists, monitor GPU memory usage with nvidia-smi during inference. If you want step-by-step troubleshooting or code examples, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.

ahpxex · October 27, 2025, 6:51am

Actually this wen queit strange, and I do fixed this issue by:

Remove the psedu-links in my system prompt.

Never expect this, but when I simply remove the pseudo links, magic happens, everything’s working.

Topic		Replies	Views
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	1240	May 26, 2025
vLLM extremely slow / no response with max_model_len=8192 and multi-GPU tensor parallel KV-Cache	1	588	October 26, 2025
H20 running Qwen3-30B-A3B-AWQ failed General	2	91	November 27, 2025
Suggestion to improve inferencing speed Model Support	17	299	March 11, 2026
Help with vLLM crashes General	1	389	December 16, 2025

Getting '502 Bad Gateway' on requesting a long prompt with Qwen 3 Next and 4x4090

Related topics