Qwen3.5-27B-FP8 Speculative Decoding

WittyExplorer · March 12, 2026, 1:39pm

I want to test speculative decoding with Qwen3.5-27B-FB8. The only way that works for me is {“method”: “mtp”, “num_speculative_tokens”:1} . Increasing num_speculative_tokens to 2 results in error. Is that expected?

Looking at eagle it appears that one needs special eagle models for it to work. The links from documentation pointing to redhat and some other collection don’t have anything for Qwen3.5. The only model I could find is for Qwen-9B. Does it mean that currently it’s not possible to use eaglefor my model?

Then I tried to use draft model. I tried Qwen-9B and Qwen-4B, however in both cases I got `RuntimeError: The size of tensor a (5120) must match the size of tensor b (4096) at non-singleton dimension 1` or similar error. And I also wasn’t able to find anything resembling a draft model on HF. Does it mean that using draft model is also not possible at the moment? Or is there some other model that could work as draft model?

Does it make sense to test any other methods?

With MTPmy throughput benchmark result went from 120 requests per minute to 140. Which is about 16%. Not bad for adding one config parameter but also not as good as 1.5x to 2x performance gains I saw online. Is that considered to be a good improvement or is there potential to get more?

The command I use to run vllm in a docker container:

Qwen/Qwen3.5-27B-FP8
      --port 8000
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --enable-prefix-caching
      --speculative-config '{"method": "draft_model", "num_speculative_tokens": 5, "model": "Qwen/Qwen3.5-4B"}'

RunLLM · March 12, 2026, 1:39pm

Yes, for Qwen3.5-27B-FP8, using MTP with num_speculative_tokens=1 is expected to work, but increasing it to 2 or more can cause errors due to model limitations or implementation bugs—this is a known issue for Qwen3.5 models in vLLM, especially with MTP and speculative decoding methods beyond 1 token per step. Eagle speculative decoding requires a special “eagle” draft model, and as of now, there is no public Eagle3 draft model for Qwen3.5; only smaller Qwen or Qwen3 models have eagle variants available, so Eagle is not currently usable for your model (source, source).

Draft model speculative decoding also does not work for Qwen3.5-27B-FP8, as the draft and target models must have matching architecture and sequence length (max_model_len); otherwise, you get tensor shape mismatch errors like the one you saw (source, source). There are currently no compatible draft models for Qwen3.5-27B-FP8 on Hugging Face. Your observed 16% throughput gain with MTP-1 is typical for this setup; higher speedups (1.5x–2x) require higher acceptance rates, more optimized hardware, or models with better speculative support (source). Would you like more detail on speculative decoding methods or tuning for higher throughput?

Sources:

Hint: Mention @RunLLM in the post for followups.

benchislett · March 12, 2026, 5:36pm

RunLLM is wrong here. You should be able to use it with MTP for multiple tokens, and that will be the most performant way to do it.

Please create a github issue and include the error log so we can triage

Topic		Replies	Views
How to use speculative decoding? Speculative Decoding	3	766	May 1, 2025
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	134	July 29, 2025
Cannot handle cases where distributed draft workers generate no tokens General	4	77	July 24, 2025
Spec decode with eagle get very low Draft acceptance rate Speculative Decoding	1	320	April 25, 2025
Speculative decoding is currently incompatible with pipeline parallelism General	3	781	July 8, 2025

Qwen3.5-27B-FP8 Speculative Decoding

Related topics