Question about InternVL implementation

SStoica12 · September 8, 2025, 6:43am

Hello,
I am working with the InternVL model. I am new to vllm and would appreciate clarification on the following questions I have:

Why do you implement your own InternVLProcessor rather than using the one from Transformers like done with Qwen? Is the reason because the InternVLProcessor uses the Got2OcrImageFast processor to do the dynamic patching while in the documentation InternVL does it directly?
I do a pdb starting from

vllm/v1/engine/processor.py line 231and performed the following pdb steps:

processed_inputs: ProcessorInputs = self.input_preprocessor.preprocess( …
entered the preprocess function (vllm/inputs/preprocess.py line 729)
entered the self._process_encoder_decoder_prompt at line 743
entered self._prompt_to_llm_inputs(line 305
entered self._process_multimodal( line 347
entered mm_processor.apply line 275. Note that mm_processor is of type vllm.model_executor.models.internvl.InternVLMultiModalProcessor.So does this mean the processor used is this one rather than the hf processor which is what the names of the following functions I can step into from here are:
entered self._cached_apply_hf_processorline 1614
entered self._apply_hf_processor_main( line 1356 , whose description is “ Apply the HF processor on the prompt text and multi-modal data.”

If my understanding is incorrect and it is in fact the HF processors, why is InternVL using the HF processor rather than the InternVLProcessor vllm implemented?

When testing vllm with FSDP, did anyone ever recieve the error:
*RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet.
If you’re using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: PyTorch Custom Operators — PyTorch Tutorials 2.8.0+cu128 documentation
If you’re using Caffe2, Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
*
This is specifically happening when I am trying to access the weights and bias of the layer norm.

Now, I know that vllm uses its own InternVLProcessor for prompt updates and InternVLMultimodalProcessor for the processing of multimodal prompts, so I tried to also use the InternVLProcessor from vllm to process my prompts befor passing them into the vllm engine; but the error persists. I have also tried using the InternVLProcessor from transformers (version 4.52.2) but the issue remains. I have also tried to deactivate gradient checkpointing.

I am new to vllm and FSDP, so I am unsure what the issue could be or how it could be. I am using vllm 0.8.4, torch 2.6.

RunLLM · September 8, 2025, 6:44am

vLLM implements its own InternVLProcessor and InternVLMultiModalProcessor because the InternVL model does not define its own HuggingFace (HF) processor, unlike Qwen, and requires custom logic for dynamic patching and prompt updates that are not handled by the standard HF processors. The vLLM processor directly implements the image patching and prompt token replacement logic, as described in the InternVL model documentation and source code. When you step through the vLLM code, the processor used is indeed vllm.model_executor.models.internvl.InternVLMultiModalProcessor, which wraps and calls the custom vLLM InternVLProcessor for multimodal input processing, not the HF processor—this is necessary because the HF ecosystem does not provide a suitable processor for InternVL, and vLLM’s implementation is tailored for its inference optimizations and multimodal pipeline (source, source).

Regarding the FSDP error about unallocated tensor data, this is a PyTorch issue that can occur when tracing into custom kernels or using lazy tensor allocation, especially with torch.compile or FSDP. The error is not specific to vLLM or InternVL, and the recommended solution is to wrap custom kernels as opaque custom ops, as described in the PyTorch custom ops documentation. Disabling gradient checkpointing or switching between different processor implementations will not resolve this, as it is a lower-level memory allocation issue in PyTorch. Would you like a step-by-step breakdown of how vLLM’s InternVL processing pipeline works or more detail on the FSDP error?

Sources:

SStoica12 · September 8, 2025, 7:06am

You mention “model does not define its own HuggingFace (HF) processor, unlike Qwen”; however, as mentioned in my post, transformers does have a InternVLprocessor for transformer version 5.42.2. Please see link: transformers/src/transformers/models/internvl/processing_internvl.py at v4.52.2 · huggingface/transformers · GitHub . And the image processor use is GotOcr2ImageProcessorFast.from_pretrained(model_path), which appears to implement the same dynamic patching ( transformers/src/transformers/models/got_ocr2/image_processing_got_ocr2_fast.py at v4.52.2 · huggingface/transformers · GitHub ), specifically the cropt_image function.
2. “Would you like a step-by-step breakdown of how vLLM’s InternVL processing pipeline works or more detail on the FSDP error?” Yes please. And can you please ellaborate more on what you mean that is it a intrinsic pytorch issue?

RunLLM · September 8, 2025, 7:07am

You are correct: as of transformers v4.52.2, there is a HuggingFace InternVLProcessor that uses GotOcr2ImageProcessorFast for dynamic patching (processing_internvl.py). However, vLLM still implements its own InternVLProcessor and multimodal pipeline because its inference engine requires precise control over prompt token updates, multimodal placeholder handling, and batch processing optimizations that are not fully compatible with the HuggingFace processor interface. The vLLM implementation directly manages prompt updates and image patching to ensure correct alignment between placeholder tokens and image features, as described in vLLM InternVLProcessor docs and source code. This is necessary for vLLM’s chunked prefill, prefix caching, and efficient multimodal batching (see design doc).

Step-by-step breakdown of vLLM’s InternVL processing pipeline:

The user prompt and images are passed to vLLM.
vLLM’s InternVLMultiModalProcessor (not the HF processor) is called, which:
- Calls the custom InternVLProcessor to preprocess images (dynamic patching, resizing, etc.) and update the prompt with the correct number of placeholder tokens.
- Tokenizes the updated prompt and prepares the multimodal tensors (e.g., pixel_values_flat, image_num_patches).
The processed prompt tokens and multimodal tensors are passed to the model for inference, ensuring that the number and position of image tokens match the model’s expectations.
This pipeline is designed to work with vLLM’s optimizations for batching, chunked prefill, and prefix caching (see design details).

About the FSDP error:
The error The tensor has a non-zero number of elements, but its data is not allocated yet... is a PyTorch runtime error that can occur when using FSDP, torch.compile, or tracing with custom kernels. It is not specific to vLLM or InternVL, but rather a result of how PyTorch handles lazy tensor allocation and tracing. If a custom kernel or operation is not properly wrapped as an opaque custom op, PyTorch’s tracing machinery may attempt to access unallocated memory, leading to this error. The recommended solution is to wrap such kernels as custom ops, as described in the PyTorch custom ops documentation. This is an intrinsic PyTorch issue because it relates to how the framework manages memory and tracing, not to vLLM or the model implementation.

Would you like a more detailed walkthrough of the vLLM InternVL code, or more troubleshooting steps for the FSDP error?

Sources:

SStoica12 · September 8, 2025, 7:33am

Can you be a little more detailed about what you mean by this? The only differences I could find between the transformer InternVLProcessor and vllm InternVLProcessor from vllm was:

the method to do the dynamic patching
the image token that was expected for input — i.e., "“<IMG_CONTEXT>” for transformer and “” for vllm. However, simply creating a subclass that inherited from the transformers InternVLProcessor to modify the input token expectation could work. But I can see that because vllm loads the processor directly from HF if needed and not the processor that the user passes in (actually, it is possible to give vllm the processor to use? Even if it is not adviced?)

If I understand the link (Multi-Modal Data Processing - vLLM) correctly, InternVLMultimodalProcessor is the main processor and this calls InternVLProcessor to update the prompts (update the mm_placeholders) and to also process the multimodal prompt.

And yes, help me trouble shoot the FSDP error. Where should I perform the wrapping in the custom ops.

RunLLM · September 8, 2025, 7:33am

The main reason vLLM implements its own InternVLProcessor and multimodal pipeline, rather than using the HuggingFace (HF) InternVLProcessor directly, is to ensure compatibility with vLLM’s unique inference optimizations—specifically, chunked prefill, prefix caching, and efficient multimodal batching. These require precise, model-specific control over how prompt tokens and multimodal placeholders are inserted, tracked, and aligned with the processed image features. The HF processor interface is designed for generality and may not expose the fine-grained hooks or deterministic token alignment needed for vLLM’s batching and scheduling logic. For example, vLLM’s processor must guarantee that the number and position of placeholder tokens in the prompt exactly match the number and order of image features, which is critical for correct attention mask construction and memory management during high-throughput inference (see design doc).

While it is technically possible to subclass the HF InternVLProcessor and adapt it for vLLM (e.g., to change the expected image token), vLLM does not currently support passing a user-defined processor instance directly; it loads and manages processors internally to ensure consistency and compatibility with its scheduling and memory profiling. The InternVLMultiModalProcessor in vLLM orchestrates both prompt updates and multimodal data processing, calling the custom InternVLProcessor as needed to maintain this alignment (source). This design also allows vLLM to cache processor outputs and handle tokenized prompts, dummy text, and prompt updates in a way that is tightly integrated with its engine.

FSDP error troubleshooting:
The error you’re seeing is a PyTorch runtime issue that occurs when a tensor’s data is not allocated before access, often due to tracing or checkpointing with custom kernels. To resolve this, you need to wrap any custom CUDA/C++ kernels (or Python functions that call into them) as opaque custom ops, so PyTorch’s tracing machinery does not attempt to step into their internals. This is typically done at the point where you define or register the custom op, not in the model code itself. For more details and examples, see the PyTorch custom ops documentation. If you’re not using custom CUDA/C++ kernels directly, check if any third-party library or model component (e.g., a custom layer norm or vision op) is doing so, and ensure it is properly registered as a custom op.

Would you like a code example of how to wrap a custom op, or more guidance on debugging which op is causing the FSDP error?

Sources:

ywang96 · September 9, 2025, 5:48am

Hey @SStoica12! A few clarifications:

The support for InternVL was added way before transformers added support for the model (meaning you couldn’t even use trust_remote_code to run the model with transformers). That’s why there’s the difference between the input processing logic and has actually nothing do with the batching logic. We documented about it here when we originally added the support for the model vllm/vllm/model_executor/models/internvl.py at ed16d0f26fbd1fecde8285b82d898bfd79ceaa87 · vllm-project/vllm · GitHub
This means all the default logic of calling “huggingface processor” is in fact calling the overrided implementation by us. You can see this override here vllm/vllm/model_executor/models/internvl.py at ed16d0f26fbd1fecde8285b82d898bfd79ceaa87 · vllm-project/vllm · GitHub that we’re actually returning InternVLProcessor that we defined instead of loading the AutoProcessor class from transformers (which will be the default option).

Could you clarify what you mean by “testing vllm with FSDP”? Could you give us a repro example so we can see if the error is coming from some other places? FWIW, today input processing for multimodal models takes place only on CPU so I wonder why this would affect model topology on GPU

RunLLM · September 9, 2025, 5:49am

It seems likely that the main reason for vLLM’s custom InternVLProcessor is historical: vLLM added InternVL support before HuggingFace Transformers did, so vLLM needed its own processor implementation. As documented in vllm/model_executor/models/internvl.py#L299-L306, vLLM overrides the default processor loading logic to always use its own InternVLProcessor, even when the code path appears to call a “HuggingFace processor.” This override is implemented at vllm/model_executor/models/internvl.py#L883-L890, where vLLM returns its own processor instead of the default AutoProcessor from transformers.

Regarding FSDP, could you clarify what you mean by “testing vllm with FSDP”? If you can provide a minimal reproducible example, it would help determine if the error is related to vLLM or something else. Also, as of now, all multimodal input processing in vLLM is performed on the CPU, so it should not affect the GPU model topology or FSDP directly.

Would you like more detail on how the processor override works, or help with a specific FSDP error?

Sources:

https://github.com/vllm-project/vllm/issues/17801
https://github.com/vllm-project/vllm/issues/17634
Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links.

RunLLM · September 9, 2025, 6:51am

Thank you for the detailed context. Based on your description and the code structure in PAPO/EasyR1, the key issue seems to be that when your processor performs token replacement (e.g., replacing with <IMG_CONTEXT>), you encounter a PyTorch FSDP runtime error related to unallocated tensor data, especially when accessing layer norm weights/biases. When you skip token replacement, the error does not occur.

This suggests that the problem is not with vLLM’s or Ray’s integration per se, but with how the processor’s token replacement logic interacts with the model’s expected input format and FSDP’s memory management. If the prompt tokens and multimodal placeholders do not exactly match the model’s expectations (e.g., number, order, or type of image tokens), it can lead to shape mismatches or uninitialized tensors during distributed training or inference, which FSDP is sensitive to. This is especially true for custom normalization layers or models with strict multimodal token alignment requirements (see vLLM multimodal design).

To debug:

Ensure your processor’s token replacement logic produces prompts and multimodal tensors that exactly match the model’s requirements (see vLLM InternVLProcessor source).
Check that the number of <IMG_CONTEXT> tokens matches the number of image patches/features expected by the model.
If using FSDP, make sure all model parameters (including custom norm layers) are initialized before any forward pass, and that no tensors are lazily allocated after FSDP wrapping.
If possible, test with vLLM’s built-in InternVLProcessor to see if the error persists, as this implementation is known to be compatible with vLLM’s batching and token alignment.

If you can share a minimal code snippet that reproduces the error (especially the processor logic and how prompts are constructed), it would help pinpoint the root cause. Would you like a step-by-step example of correct token replacement for InternVL in vLLM, or more FSDP-specific debugging tips?

Sources:

see vLLM multimodal design

SStoica12 · September 9, 2025, 6:55am

Hello @ywang96
Thank you so much for taking the time to answer my questions and look at my issue. I truly appreciate your insights and clarifications.

Thank you; this really helps me better understand vllm’s InternVL setup.
Absolutely. For my vllm, FSDP pipeline, please refer to GitHub - MikeWangWZHL/PAPO: Official repo for "PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning" , which is built off of EasyR1. Specifically, it uses ray and FSDP for parallel training and vllm for efficient rollouts.

I see. Yes, I am also puzzled as to why this issue is occurring. In the code PAPO, you will notice the use of a processor to filter overlong prompts in PAPO/verl/utils/dataset.py , initialize the RayPPO trainer in PAPO/verl/trainer/ray_trainer.py, and to get the logit bias in PAPO/verl/workers/rollout/vllm_rollout_spmd.py. (Apologies for not providing direct links. I can only give a maximum of 2 as a new user).

I have played with 3 different processors: one that I implemented from scratch inspired by the Qwen2Processor (the processor EasyR1 uses with Qwen. I processed the images directly using dynamic patching as implemented in InternVL’s documentation and the tokenizer loaded from AutoToeknizer); a processor that inherited from the InternVLProcessor but corrected the token replacement code to, similar to vllm, expect rather than <IMG_CONEXT> (note: transformers created a chat template to convert to <IMG_CONTEXT> rather than leaving it alone like InternVL’s original chat template does and the one vllm uses); and VLLM’s own InternVLProcessor).

What I noticed is that when I do not implement the correct token replacement scheme (meaning my processor did not perform any token replacement), everything runs well. But then when I try and do the token replacement with my processor –> <IMG_CONTEXT> * X I get the runtime error.

RunLLM · September 9, 2025, 6:56am

Your observation is correct: when your processor does not perform token replacement, everything runs, but when you implement token replacement ( → <IMG_CONTEXT> * X), you get a runtime error. This strongly suggests that the issue is with the alignment between your prompt token replacement logic and the model’s expected input format, especially under FSDP and distributed settings.

In vLLM, the number and position of multimodal placeholder tokens (like <IMG_CONTEXT>) must exactly match the number and order of image features expected by the model. If your processor’s token replacement logic does not produce this exact alignment, it can cause shape mismatches or uninitialized tensors, which FSDP is sensitive to. This is especially true for custom normalization layers or models with strict multimodal token alignment (see vLLM multimodal design). Using vLLM’s built-in InternVLProcessor is recommended, as it is designed to ensure this alignment and is known to be compatible with vLLM’s batching and token management (source).

To debug:

Double-check that your processor’s token replacement produces the exact number and order of <IMG_CONTEXT> tokens required by the model for each image.
Ensure that all model parameters are initialized before FSDP wrapping, and that no tensors are lazily allocated after FSDP is applied.
If possible, use vLLM’s InternVLProcessor directly to see if the error persists.

Would you like a step-by-step example of correct token replacement for InternVL in vLLM, or more FSDP-specific debugging tips?

Sources:

vLLM multimodal design

SStoica12 · September 9, 2025, 6:58am

I am currently using InternVLProcessor from vllm.

RunLLM · September 9, 2025, 6:59am

If you are using vLLM’s built-in InternVLProcessor and still encountering runtime errors during FSDP training or inference, the issue is likely not with the processor’s token replacement logic itself. vLLM’s InternVLProcessor is designed to ensure correct alignment between prompt tokens and image features, and is compatible with vLLM’s batching and memory management (source).

If the error persists, it may be due to:

A mismatch between the number of image tokens and the model’s expected input (e.g., prompt or image size).
An FSDP-related issue, such as parameters being lazily initialized or accessed before allocation, especially in custom layers or when using torch.compile/tracing (see PyTorch custom ops docs).
Integration issues between Ray, FSDP, and vLLM, particularly if model parameters are not fully initialized before FSDP wrapping.

Would you like step-by-step debugging tips for FSDP with vLLM, or a checklist to verify your input pipeline?

Sources:

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/internvl.py

SStoica12 · September 9, 2025, 7:02am

1.The input is the same input as described in the InternVL documentation of the form [text]….
2. Where would you know whether model parameters were fully initialized?
3. Yes, please provide debugging tips and a checklist.

RunLLM · September 9, 2025, 7:03am

Input Format: If your input strictly follows the InternVL documentation (e.g., [text]<IMG_CONTEXT>... with the correct number of image tokens for each image), and you use vLLM’s InternVLProcessor, then input format is likely not the issue—unless the number of <IMG_CONTEXT> tokens does not match the number of image patches/features expected by the model (see vLLM InternVLProcessor source).
Parameter Initialization: To ensure model parameters are fully initialized before FSDP wrapping, instantiate your model and call its .to(device) or .cuda() method before applying FSDP. All model weights (including custom layers) should be allocated and not lazily created after FSDP is applied. If you use custom layers or dynamic module creation, make sure these are constructed before FSDP wrapping (see PyTorch FSDP docs).
Debugging Tips & Checklist for FSDP with vLLM:
- a. Input Pipeline:
  - Confirm the number of <IMG_CONTEXT> tokens in the prompt matches the number of image features/patches.
  - Ensure all image preprocessing (patching, resizing) is consistent with model expectations.
- b. Model Initialization:
  - Instantiate the model and move it to the correct device before FSDP wrapping.
  - Avoid creating new parameters or layers after FSDP is applied.
  - If using torch.compile or tracing, ensure all custom ops are registered as opaque (see PyTorch custom ops docs).
- c. FSDP Wrapping:
  - Wrap the model with FSDP only after all parameters are initialized.
  - Use FSDP’s model = FSDP(model) pattern after model construction.
- d. Integration:
  - If using Ray, ensure each worker initializes the model and FSDP correctly and independently.
  - Check for any warnings or errors about parameter allocation or device mismatch in logs.
- e. Runtime Checks:
  - If you see errors about unallocated tensors, add print/debug statements after model instantiation and after FSDP wrapping to check parameter shapes and device allocation.
  - Use for name, param in model.named_parameters(): print(name, param.device, param.shape, param.is_leaf) to verify all parameters are allocated and on the correct device.

Would you like a minimal code example for FSDP initialization with vLLM, or more troubleshooting steps for a specific error message?

Sources:

PyTorch custom ops docs

SStoica12 · September 9, 2025, 7:09am

What do you mean “lazily created”? How would we be able to see if this is indeed the case? Also, do you know of any reason as to why specific data may be freed? Someone told me that this was pytorch’s way of getting rid of large data when it thought it didn’t need it anymore, so I tried to do gradient_checkpointing to false. However, the error persists. Can you give more intuition? Again, I am new to parallel training, so would appreciate specificity and detail.

RunLLM · September 9, 2025, 7:09am

It could be possible that “lazily created” means parameters or buffers are not actually allocated in memory until they are first used (e.g., inside a forward pass or when a method is called), rather than being allocated at model construction time. In PyTorch, this can happen if you define parameters inside a method or use certain custom layers that delay allocation. To check for this, you might print out all model parameters and their .data or .device attributes right after model instantiation and before/after FSDP wrapping. If any parameter’s .data_ptr() is zero or you get an error accessing .data, it might not be allocated yet (see PyTorch FSDP docs).

Though I’m not entirely certain, it might be that PyTorch sometimes frees data (e.g., parameter storage) if it thinks it’s not needed, especially under memory pressure or with features like gradient checkpointing, activation checkpointing, or when using torch.compile/tracing. However, disabling gradient checkpointing does not always prevent this, as FSDP and PyTorch’s internal memory management may still deallocate tensors if they are not referenced or if the computation graph is broken. If you see errors about unallocated or freed tensors, it could be due to a mismatch between model usage and FSDP’s expectations, or due to custom ops/layers not being properly registered or initialized.

Would you like a code snippet to check parameter allocation, or more detail on PyTorch’s memory management with FSDP?

Sources:

https://github.com/vllm-project/vllm/issues/17117
https://github.com/vllm-project/vllm/pull/9352
Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links.

SStoica12 · September 9, 2025, 7:11am

That would be very helpful; thank you.

RunLLM · September 9, 2025, 7:11am

It seems likely that you want to check whether all model parameters are allocated before FSDP wrapping. Here’s a simple code snippet to help you verify parameter allocation and device placement in PyTorch:

for name, param in model.named_parameters():
    print(f"{name}: device={param.device}, shape={param.shape}, allocated={param.data_ptr() != 0}")

If any parameter shows allocated=False (i.e., param.data_ptr() == 0), it means the parameter is not yet allocated in memory—this could cause FSDP errors. You should run this check right after model instantiation and before applying FSDP. After FSDP wrapping, you can repeat the check to ensure all parameters remain allocated and on the correct device.

For more detail on PyTorch’s memory management with FSDP: FSDP expects all parameters to be allocated and on the correct device before wrapping. If parameters are created lazily (e.g., inside a forward pass or via custom layers), or if memory is freed due to checkpointing or tracing, FSDP may encounter errors when trying to access them. Always ensure model construction is complete and all parameters are materialized before FSDP is applied (see PyTorch FSDP docs).

Would you like more detail on debugging FSDP or PyTorch memory management?

Sources:

https://github.com/vllm-project/vllm/pull/16868

ywang96 · September 9, 2025, 8:06am

Thanks for getting back to me! I’m not too familiar with this project but I can take a look. I can also ask hiyouga about this model since we briefly talked back when EasyR1 uses vLLM for multimodal RL training.

If it makes it easier, I strongly recommend you join our slack channel via slack.vllm.ai so it’s easier to discuss what problems you’re running into!

Topic		Replies	Views
Multimodal inference guideline? General	59	667	August 6, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	3982	June 27, 2025
LoRA Adapter enabling with vLLM is not working LoRA	4	230	April 21, 2025
Make install easier General	11	126	July 24, 2025
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	152	July 4, 2025

Question about InternVL implementation

Related topics