I want to test speculative decoding with Qwen3.5-27B-FB8. The only way that works for me is {“method”: “mtp”, “num_speculative_tokens”:1} . Increasing num_speculative_tokens to 2 results in error. Is that expected?
Looking at eagle it appears that one needs special eagle models for it to work. The links from documentation pointing to redhat and some other collection don’t have anything for Qwen3.5. The only model I could find is for Qwen-9B. Does it mean that currently it’s not possible to use eaglefor my model?
Then I tried to use draft model. I tried Qwen-9B and Qwen-4B, however in both cases I got `RuntimeError: The size of tensor a (5120) must match the size of tensor b (4096) at non-singleton dimension 1` or similar error. And I also wasn’t able to find anything resembling a draft model on HF. Does it mean that using draft model is also not possible at the moment? Or is there some other model that could work as draft model?
Does it make sense to test any other methods?
With MTPmy throughput benchmark result went from 120 requests per minute to 140. Which is about 16%. Not bad for adding one config parameter but also not as good as 1.5x to 2x performance gains I saw online. Is that considered to be a good improvement or is there potential to get more?
The command I use to run vllm in a docker container:
Qwen/Qwen3.5-27B-FP8
--port 8000
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--enable-prefix-caching
--speculative-config '{"method": "draft_model", "num_speculative_tokens": 5, "model": "Qwen/Qwen3.5-4B"}'