How can I disable the model forward pass to measure host-only (CPU) overhead?

spsagar13 · October 21, 2025, 3:27pm

I’m trying to benchmark the CPU/host overhead of vLLM serving independent of the model’s compute. In other words, I’d like to keep the normal serving pipeline (HTTP, tokenization, batching/scheduling, queuing, KV-cache mgmt, etc.) but skip the actual model inference (no matmuls, no decode compute) so I can measure TTFT/ITL that come purely from host-side work.

Doubts,

Is there a built-in flag or mode that bypasses the model forward pass?
If not, what’s the recommended hook to implement a no-op runner that returns dummy logits but preserves engine contracts? (e.g., overriding CPUModelRunner.execute_model / forward, or another extension point?)
Any other caveats I should worry about when the model is bypassed?

Ideally, I am looking for

Keep vllm serve running normally and drive it with vllm bench serve.
Return immediately from the model execution path so TTFT reflects only host overhead.

RunLLM · October 21, 2025, 3:27pm

There is no built-in flag or mode in vLLM to bypass the model forward pass and return dummy logits for benchmarking host-side overhead only. However, the recommended approach is to implement a custom model that overrides the forward or compute_logits method to return dummy outputs, as shown in the vLLM test plugins. For example, you can subclass an existing model (e.g., OPTForCausalLM) and override compute_logits to return a fixed tensor, as demonstrated in my_opt.py. This preserves the engine contracts and allows the rest of the serving pipeline (HTTP, batching, KV-cache, etc.) to run normally.

To use this, register your dummy model and launch vllm serve as usual, then benchmark with vllm bench serve. This approach ensures TTFT/ITL reflect only host-side overhead. There are no documented caveats, but ensure your dummy model returns tensors of the correct shape and type to avoid downstream errors. Overriding CPUModelRunner or GPUModelRunner is not necessary if you override the model’s compute_logits or forward as shown in the plugin example. Would you like a step-by-step code example for this setup?

Sources:

my_opt.py (dummy model example)

Hint: Mention @RunLLM in the post for followups.

spsagar13 · October 21, 2025, 3:40pm

@RunLLM, yes if there are any, share the step-by-step code example

RunLLM · October 21, 2025, 3:41pm

Certainly! Here’s a step-by-step example for creating a no-op/dummy model in vLLM that bypasses the actual model computation but preserves the serving pipeline:

Subclass an Existing Model and Override compute_logits
Create a Python file (e.g., my_dummy_model.py) and subclass a vLLM model (e.g., LlavaForConditionalGeneration). Override the compute_logits method to return a fixed tensor, as in the official plugin example:

import torch
from vllm.model_executor.models.llava import (
    LlavaDummyInputsBuilder,
    LlavaForConditionalGeneration,
    LlavaMultiModalProcessor,
    LlavaProcessingInfo,
)
from vllm.multimodal import MULTIMODAL_REGISTRY

@MULTIMODAL_REGISTRY.register_processor(
    LlavaMultiModalProcessor,
    info=LlavaProcessingInfo,
    dummy_inputs=LlavaDummyInputsBuilder,
)
class MyLlava(LlavaForConditionalGeneration):
    def compute_logits(self, hidden_states: torch.Tensor) -> torch.Tensor | None:
        # Always predict the first token
        logits = super().compute_logits(hidden_states)
        if logits is not None:
            logits.zero_()
            logits[:, 0] += 1.0
        return logits

Register the Model
Make sure your model is registered with vLLM’s model registry so it can be loaded by name.
Serve the Model
Start the vLLM server with your dummy model:

vllm serve path/to/my_dummy_model.py --model MyLlava

Benchmark
Use vllm bench serve to drive the server and measure TTFT/ITL, which will now reflect only host-side overhead.

This approach ensures the full vLLM serving pipeline is exercised, but the model computation is skipped. Let me know if you need a more detailed setup or help with a different model class.

Sources:

my_llava.py (dummy model example)

Hint: Mention @RunLLM in the post for followups.

spsagar13 · October 21, 2025, 3:46pm

@RunLLM , any example for meta-llama/Llama-3.1-8B-Instruct model?

RunLLM · October 21, 2025, 3:46pm

Yes, you can create a dummy version of the Llama-3.1-8B-Instruct model by subclassing LlamaForCausalLM and overriding its compute_logits method, similar to the OPT and Llava dummy examples. Here’s a minimal example:

import torch
from vllm.model_executor.models.llama import LlamaForCausalLM

class DummyLlamaForCausalLM(LlamaForCausalLM):
    def compute_logits(self, hidden_states: torch.Tensor) -> torch.Tensor | None:
        logits = super().compute_logits(hidden_states)
        if logits is not None:
            logits.zero_()
            logits[:, 0] += 1.0  # Always predict the first token
        return logits

You would then register this class in your plugin or test setup, and launch vLLM with this model. This will preserve the full vLLM serving pipeline but skip all real model computation, as shown in the dummy model plugin example. Would you like more detail on registration or usage?

Sources:

my_opt.py (dummy model example)

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Running vllm bench serve from CPU-only node Benchmarking	3	502	August 29, 2025
Does vllm support inference or service startup of CPU small model? Hardware Support	3	141	May 30, 2025
How to control number of CPUs used for GPU inference? General	5	98	September 27, 2025
Customized Model Generation Verification General	1	40	July 14, 2025
How to benchmark concurrency General	26	478	September 15, 2025

How can I disable the model forward pass to measure host-only (CPU) overhead?

Related topics