Is there a plan for EVS to support Qwen3VL in response to the issue of sparse video tokens?

Currently, EVS supports QWen2.5vl. Are there plans to support QWen3vl? When trying to upgrade EVS to QWen3VL, the get_video_replacement_qwen3vl method in the qwen3_vl.py file first calculates the number of video_token_ids. Furthermore, QWen3VL’s new feature <{curr_time:.1f} seconds> carries the specific number of tokens.

However, EVS token trimming only knows the specific number of tokens per frame after budget allocation.

This leads to a paradox. Is there a good solution?

Yes, EVS (Efficient Video Sampling) support for Qwen3-VL has been implemented in vLLM, but the integration is nontrivial due to Qwen3-VL’s use of interleaved timestamp tokens and video tokens. The main challenge is that the number of tokens per frame after EVS pruning is only known after the vision encoder runs, but the prompt (including timestamp tokens and video token placeholders) must be constructed beforehand. This can cause a mismatch between the prompt’s token structure and the actual pruned video tokens, potentially leading to misalignment and degraded model performance if not handled carefully. According to the merged PR for Qwen3-VL EVS support and related discussion, the current solution uses an is_embed mask to track which prompt positions should be filled with video embeddings, and then reconstructs the correct mapping after EVS pruning.

However, this approach is still considered tricky and error-prone, as noted by vLLM maintainers, because the prompt update logic in Qwen3-VL was originally designed for uniform token allocation, not the non-uniform, content-aware allocation of EVS. The recommended solution is to use the is_embed mask to extract actual frame offsets and token counts after pruning, ensuring that the prompt and embeddings remain aligned. This logic is implemented in methods like _extract_frame_offsets_from_mask and _get_actual_frame_token_counts in the Qwen3-VL model code. For more details and code examples, see Qwen3-VL EVS support PR and Qwen3-VL model documentation.

Would you like a step-by-step breakdown of how the is_embed mask and prompt alignment are handled in the latest vLLM implementation?

Sources:

Hint: Mention @RunLLM in the post for followups.