I’m trying to benchmark the CPU/host overhead of vLLM serving independent of the model’s compute. In other words, I’d like to keep the normal serving pipeline (HTTP, tokenization, batching/scheduling, queuing, KV-cache mgmt, etc.) but skip the actual model inference (no matmuls, no decode compute) so I can measure TTFT/ITL that come purely from host-side work.
Doubts,
Is there a built-in flag or mode that bypasses the model forward pass?
If not, what’s the recommended hook to implement a no-op runner that returns dummy logits but preserves engine contracts? (e.g., overriding CPUModelRunner.execute_model / forward, or another extension point?)
Any other caveats I should worry about when the model is bypassed?
Ideally, I am looking for
Keep vllm serve running normally and drive it with vllm bench serve.
Return immediately from the model execution path so TTFT reflects only host overhead.
There is no built-in flag or mode in vLLM to bypass the model forward pass and return dummy logits for benchmarking host-side overhead only. However, the recommended approach is to implement a custom model that overrides the forward or compute_logits method to return dummy outputs, as shown in the vLLM test plugins. For example, you can subclass an existing model (e.g., OPTForCausalLM) and override compute_logits to return a fixed tensor, as demonstrated in my_opt.py. This preserves the engine contracts and allows the rest of the serving pipeline (HTTP, batching, KV-cache, etc.) to run normally.
To use this, register your dummy model and launch vllm serve as usual, then benchmark with vllm bench serve. This approach ensures TTFT/ITL reflect only host-side overhead. There are no documented caveats, but ensure your dummy model returns tensors of the correct shape and type to avoid downstream errors. Overriding CPUModelRunner or GPUModelRunner is not necessary if you override the model’s compute_logits or forward as shown in the plugin example. Would you like a step-by-step code example for this setup?
Certainly! Here’s a step-by-step example for creating a no-op/dummy model in vLLM that bypasses the actual model computation but preserves the serving pipeline:
Subclass an Existing Model and Override compute_logits
Create a Python file (e.g., my_dummy_model.py) and subclass a vLLM model (e.g., LlavaForConditionalGeneration). Override the compute_logits method to return a fixed tensor, as in the official plugin example:
import torch
from vllm.model_executor.models.llava import (
LlavaDummyInputsBuilder,
LlavaForConditionalGeneration,
LlavaMultiModalProcessor,
LlavaProcessingInfo,
)
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_processor(
LlavaMultiModalProcessor,
info=LlavaProcessingInfo,
dummy_inputs=LlavaDummyInputsBuilder,
)
class MyLlava(LlavaForConditionalGeneration):
def compute_logits(self, hidden_states: torch.Tensor) -> torch.Tensor | None:
# Always predict the first token
logits = super().compute_logits(hidden_states)
if logits is not None:
logits.zero_()
logits[:, 0] += 1.0
return logits
Register the Model
Make sure your model is registered with vLLM’s model registry so it can be loaded by name.
Serve the Model
Start the vLLM server with your dummy model:
Benchmark
Use vllm bench serve to drive the server and measure TTFT/ITL, which will now reflect only host-side overhead.
This approach ensures the full vLLM serving pipeline is exercised, but the model computation is skipped. Let me know if you need a more detailed setup or help with a different model class.
Yes, you can create a dummy version of the Llama-3.1-8B-Instruct model by subclassing LlamaForCausalLM and overriding its compute_logits method, similar to the OPT and Llava dummy examples. Here’s a minimal example:
import torch
from vllm.model_executor.models.llama import LlamaForCausalLM
class DummyLlamaForCausalLM(LlamaForCausalLM):
def compute_logits(self, hidden_states: torch.Tensor) -> torch.Tensor | None:
logits = super().compute_logits(hidden_states)
if logits is not None:
logits.zero_()
logits[:, 0] += 1.0 # Always predict the first token
return logits
You would then register this class in your plugin or test setup, and launch vLLM with this model. This will preserve the full vLLM serving pipeline but skip all real model computation, as shown in the dummy model plugin example. Would you like more detail on registration or usage?