Qwen3.5-27B-FP8 Speculative Decoding

I want to test speculative decoding with Qwen3.5-27B-FB8. The only way that works for me is {“method”: “mtp”, “num_speculative_tokens”:1} . Increasing num_speculative_tokens to 2 results in error. Is that expected?

Looking at eagle it appears that one needs special eagle models for it to work. The links from documentation pointing to redhat and some other collection don’t have anything for Qwen3.5. The only model I could find is for Qwen-9B. Does it mean that currently it’s not possible to use eaglefor my model?

Then I tried to use draft model. I tried Qwen-9B and Qwen-4B, however in both cases I got `RuntimeError: The size of tensor a (5120) must match the size of tensor b (4096) at non-singleton dimension 1` or similar error. And I also wasn’t able to find anything resembling a draft model on HF. Does it mean that using draft model is also not possible at the moment? Or is there some other model that could work as draft model?

Does it make sense to test any other methods?

With MTPmy throughput benchmark result went from 120 requests per minute to 140. Which is about 16%. Not bad for adding one config parameter but also not as good as 1.5x to 2x performance gains I saw online. Is that considered to be a good improvement or is there potential to get more?

The command I use to run vllm in a docker container:

Qwen/Qwen3.5-27B-FP8
      --port 8000
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --enable-prefix-caching
      --speculative-config '{"method": "draft_model", "num_speculative_tokens": 5, "model": "Qwen/Qwen3.5-4B"}'

RunLLM is wrong here. You should be able to use it with MTP for multiple tokens, and that will be the most performant way to do it.

Please create a github issue and include the error log so we can triage

Qwen3.5 uses hybrid linear attention throughout. Its conv_states and recurrent_states do not have a sequence_length dimension, so they cannot be selectively accepted the way a traditional KV cache can. In other words, if your draft contains 4 tokens but the target LLM only accepts the first 2 during parallel verification, we have no way to restore the corresponding conv_states and recurrent_states.

I think this may be the reason why only MTP-1 is supported.