Question about InternVL implementation

Thank you for your interest! For in-depth troubleshooting and direct support from the vLLM community and developers, it’s highly recommended to join the vLLM Slack at slack.vllm.ai. There, you can discuss your issues in real time, get feedback from others who have worked with EasyR1 and multimodal RL training, and connect with contributors like hiyouga.

If you have specific error logs or code snippets, sharing them in Slack will help the community assist you more efficiently. Would you like more guidance on how to join or use the Slack channel?

Sources:

What I noticed is that when I do not implement the correct token replacement scheme (meaning my processor did not perform any token replacement), everything runs well. But then when I try and do the token replacement with my processor –> <IMG_CONTEXT> * X I get the runtime error.

To this observation, in vLLM the token replacement is actually handled by our internal logic - meaning we don’t expect the huggingface processor to implement token insertion/expansion, and vLLM will look for placeholder token(s) and repeat them with the correct number depending on the multimodal input metadata (height, width, processor merge size, etc) as well as padding other tokens if needed.

Here’s a code reference where this takes place vllm/vllm/model_executor/models/internvl.py at 45c9cb583564744e25e37b3d4ffd10958788c5b2 · vllm-project/vllm · GitHub

1 Like

Correct—vLLM expects to handle multimodal token replacement and expansion internally, not via the HuggingFace processor. If your processor performs token replacement (e.g., manually expanding to <IMG_CONTEXT> * X), it can cause a mismatch or double expansion, leading to runtime errors. vLLM’s internal logic automatically inserts or repeats the correct number of multimodal tokens based on the input metadata (such as image size and merge size), so external processors should not perform this step (see vLLM multimodal design).

If you let vLLM manage token replacement, everything should work as expected. Would you like more detail on how vLLM’s multimodal processor pipeline works?

Sources: