Question about InternVL implementation

RunLLM · September 9, 2025, 8:06am

Thank you for your interest! For in-depth troubleshooting and direct support from the vLLM community and developers, it’s highly recommended to join the vLLM Slack at slack.vllm.ai. There, you can discuss your issues in real time, get feedback from others who have worked with EasyR1 and multimodal RL training, and connect with contributors like hiyouga.

If you have specific error logs or code snippets, sharing them in Slack will help the community assist you more efficiently. Would you like more guidance on how to join or use the Slack channel?

Sources:

ywang96 · September 9, 2025, 8:14am

What I noticed is that when I do not implement the correct token replacement scheme (meaning my processor did not perform any token replacement), everything runs well. But then when I try and do the token replacement with my processor –> <IMG_CONTEXT> * X I get the runtime error.

To this observation, in vLLM the token replacement is actually handled by our internal logic - meaning we don’t expect the huggingface processor to implement token insertion/expansion, and vLLM will look for placeholder token(s) and repeat them with the correct number depending on the multimodal input metadata (height, width, processor merge size, etc) as well as padding other tokens if needed.

Here’s a code reference where this takes place vllm/vllm/model_executor/models/internvl.py at 45c9cb583564744e25e37b3d4ffd10958788c5b2 · vllm-project/vllm · GitHub

RunLLM · September 9, 2025, 8:14am

Correct—vLLM expects to handle multimodal token replacement and expansion internally, not via the HuggingFace processor. If your processor performs token replacement (e.g., manually expanding to <IMG_CONTEXT> * X), it can cause a mismatch or double expansion, leading to runtime errors. vLLM’s internal logic automatically inserts or repeats the correct number of multimodal tokens based on the input metadata (such as image size and merge size), so external processors should not perform this step (see vLLM multimodal design).

If you let vLLM manage token replacement, everything should work as expected. Would you like more detail on how vLLM’s multimodal processor pipeline works?

Sources:

vLLM multimodal design

Topic		Replies	Views
Multimodal inference guideline? General	59	667	August 6, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	3980	June 27, 2025
LoRA Adapter enabling with vLLM is not working LoRA	4	230	April 21, 2025
Make install easier General	11	126	July 24, 2025
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	152	July 4, 2025

Question about InternVL implementation

Related topics