Question about profile run

I have some questions about profile run. As I known, the profile_run performs a naive inference with max_num_batched_tokens tokens to get the peak memory during inference.

I wonder why doesn’t the profile_run need to consider about the possible computed tokens?
If chunked prefill is enabled, there input may be like [computed_tokens + max_num_batched_tokens of new tokens].
Let’s say N=max_num_batched_tokens and we have a request contains 2*N input tokens.
The first chunked prefill will prefill for the first N tokens, then the second chunked prefill will prefill for the last N tokens while having N computed tokens.

So, it seems like profile_run will only run for the circumstances like the First Chunked Prefill, but isn’t the Second Chunked Prefill will use more memory because of the longer sequence length?
If so, if we have small max_num_batched_tokens and big max_model_len when enabled chunked prefill, vLLM maybe OOM?

You are correct: when chunked prefill is enabled, the second (and later) chunked prefill steps process both the previously computed tokens (up to max_model_len) and a new chunk of max_num_batched_tokens tokens. However, the profile_run in vLLM only simulates a single batch of max_num_batched_tokens tokens, not the scenario where a sequence has both a long prefix of computed tokens and a new chunk, which could underestimate peak memory usage in some cases. This means that if max_model_len is much larger than max_num_batched_tokens, and chunked prefill is enabled, there is a risk of OOM during actual inference, since the memory required for the full sequence (computed + new tokens) may exceed what was profiled and reserved during profile_run (source, source).

This limitation is acknowledged in the community, and the current implementation does not fully account for the memory usage of long sequences with chunked prefill. As a result, users should be cautious when setting a small max_num_batched_tokens and a large max_model_len together, as this can indeed lead to OOM errors in practice (source). Would you like a step-by-step breakdown of how memory usage can exceed the profiled amount in this scenario?

Sources:

Could you show me the former discussion of this limitation in the community?

Yes. This limitation has been discussed in the vLLM community. For example, in this GitHub issue, a user observed that the number of allocated GPU blocks (KV cache) depends on max_model_len, and the explanation was that during profile_run, vLLM profiles memory usage with max_num_batched_tokens (defaulting to max_model_len) to ensure at least one sequence of max_model_len tokens can be processed. This means that if max_model_len is much larger than max_num_batched_tokens, the actual memory usage during chunked prefill with long sequences may exceed what was profiled, potentially causing OOM.

Additionally, in this issue, it was noted that chunked prefill and prefix caching together can lead to unexpected memory usage, and the interaction between these features and memory profiling is still being improved.

Would you like more details or direct quotes from these discussions?

Sources:

I don’t think these issues are relative to my questions. So what’s the plan of vllm to deal with this limitation that may cause OOM?

Currently, vLLM does not have a built-in mechanism to fully prevent OOM in this scenario; users must manually set parameters like max_model_len and max_num_batched_tokens to avoid OOM (source). There is ongoing discussion and design work to address this, such as proposals for “chunked logits processing” and improvements to memory profiling, but no finalized solution has been merged yet (source).

For now, the recommended workaround is to use --enable-chunked-prefill and set --max_num_batched_tokens to match --max_num_seqs, which can help avoid OOM but may reduce prefill speed. The vLLM team acknowledges this as a significant issue and plans to revisit it after current performance optimizations are complete.

Sources:

I still don’t think the source above is relate to my questions, do you have any roadmap or milestone that vllm team has to deal with this?

There is no explicit roadmap item or milestone in the official vLLM roadmaps that directly addresses improving profile_run to account for the combined memory usage of computed tokens and new tokens during chunked prefill, or to fully prevent OOM in this scenario. The current roadmaps focus on performance optimizations, chunked prefill (which is now always enabled), and memory management improvements, but do not mention a specific plan to fix this profiling limitation (Q1 2025 Roadmap, Q4 2024 Roadmap).

If this limitation is critical for your use case, the vLLM team encourages users to open a feature request or comment on the roadmap issues to raise its priority.

Sources: