VLLM Engine Metrics

Hi I m using VLLM 0.6.3 for RL training. I want to benchmark the performance of the token generation from the model with VLLM. I have setup $PROMETHEUS_MULTIPROC_DIR and i do see data files like counter_2771563.db counter_2771565.db gauge_all_2771563.db gauge_all_2771565.db histogram_2771563.db histogram_2771565.db.
I executed:
curl -i http://127.0.0.1:8000/metrics
HTTP/1.1 404 Not Found
date: Wed, 11 Jun 2025 19:27:00 GMT
server: uvicorn
content-length: 22
content-type: application/json
Response received:
{“detail”:“Not Found”}

I also setup a ray prometheus dashboard and fetching the metrics at localhost:9090/metrics endpoint but i m not observing any VLLM related metrics. Can you share documentation on changes to make to see the VLLM metrics?

To expose vLLM metrics at the /metrics endpoint, you must run the vLLM OpenAI-compatible API server (e.g., with vllm serve or python -m vllm.entrypoints.openai.api_server). If you are running vLLM inside Ray Serve or another custom setup, you need to manually expose the /metrics endpoint and ensure the PrometheusStatLogger is active. For Ray Serve, you must add the RayPrometheusStatLogger to the engine using add_logger, as the default PrometheusStatLogger is not automatically used in Ray deployments. The /metrics endpoint is not available if you are not running the API server directly or have not set up the logger correctly. See this Slack thread and official docs for details.

If you see a 404 on /metrics, it means the endpoint is not being served—likely because the API server is not running, or the metrics middleware is not mounted. For Ray Serve, see the RayPrometheusStatLogger API reference and ensure you add the logger and expose the endpoint in your Ray deployment. The Prometheus dashboard at :9090/metrics will not show vLLM metrics unless the vLLM server exposes them at /metrics and Prometheus is configured to scrape that endpoint.

Would you like a step-by-step example for Ray Serve or for the standard vLLM API server?

Sources:

Yes step by step example of enabling the metrics endpoint within ray, configuring prometheus to scrap the endpoint will be helpful

I also see the .db files created under the PROMETHEUS_MULTIPROC_DIR . Is this expected? What kind of db is it? How can i parse it

It seems likely that the .db files under PROMETHEUS_MULTIPROC_DIR are expected when using Prometheus multiprocess mode with vLLM. These files are created by the prometheus_client library to store metric data from each process, and their names (e.g., counter_.db, gauge_all_.db, histogram_*.db) correspond to different metric types. The format is internal to prometheus_client and not intended for direct parsing; you should access metrics via the /metrics HTTP endpoint, not by reading these files directly. For more, see the Prometheus multiprocess mode docs, which vLLM relies on as described in vLLM metrics design.

Would you like more detail on how Prometheus aggregates these files or how vLLM interacts with them?

Sources:

Can i parse with promtool?

No, you cannot parse the .db files directly with promtool. These files are internal to the prometheus_client Python library and are not in the Prometheus TSDB or text exposition format; promtool only works with Prometheus scrape targets or TSDB blocks, not these .db files. You should always access metrics via the /metrics HTTP endpoint, as described in the Prometheus multiprocess mode docs and vLLM metrics documentation.

Would you like more detail on how to use promtool or how to expose metrics for scraping?

Sources:

Yes details on how to expose the metrics

To expose vLLM metrics for Prometheus when running with Ray, you must (1) add the RayPrometheusStatLogger to your engine, and (2) create an HTTP endpoint that serves the metrics in Prometheus format. The .db files are only for internal use; Prometheus scrapes metrics from an HTTP endpoint, not from these files directly. See the vLLM Ray metrics API reference and Slack discussion.

Step-by-step:

  1. Add RayPrometheusStatLogger to the engine:

    from vllm.engine.metrics import RayPrometheusStatLogger
    engine.add_logger("ray", RayPrometheusStatLogger(...))
    

    (Fill in the constructor as needed for your setup.)

  2. Expose the /metrics endpoint:
    You must create a FastAPI (or similar) HTTP server in your Ray deployment that, when /metrics is called, collects and returns the current metrics from RayPrometheusStatLogger. vLLM does not do this automatically in Ray Serve mode; you must implement this endpoint yourself.

  3. Configure Prometheus to scrape:
    In your prometheus.yml, add:

    scrape_configs:
      - job_name: 'vllm-ray'
        static_configs:
          - targets: ['<host>:<port>']  # The host/port of your /metrics endpoint
    

For a full example and more context, see the Slack thread and vLLM metrics docs.

Would you like a code example for the FastAPI /metrics endpoint?

Sources:

Do we need to setup PROMETHEUS_MULTIPROC_DIR

No, you do not need to set up PROMETHEUS_MULTIPROC_DIR when using RayPrometheusStatLogger. That environment variable is only required for the standard PrometheusStatLogger (which uses prometheus_client in multiprocess mode). RayPrometheusStatLogger uses Ray’s own metrics system and does not rely on PROMETHEUS_MULTIPROC_DIR or the prometheus_client multiprocess backend, as confirmed in the vLLM Ray metrics API docs and source code.

Would you like more detail on how RayPrometheusStatLogger works or how to configure it?

Sources:

Why is PrometheusStatLogger not sufficient and we need RayPrometheusStatLogger

PrometheusStatLogger is designed for standard (single-process or multiprocessing) vLLM deployments and uses the prometheus_client library, which does not natively support Ray’s distributed metrics collection. RayPrometheusStatLogger is required for Ray-based deployments because it uses Ray’s metrics API (ray.util.metrics), ensuring metrics are correctly aggregated and exposed in Ray clusters. This distinction is documented in the vLLM metrics API and source code.

Would you like more detail on the technical differences or usage examples?

Sources:

I m not using RayServe. How can i expose the prometheus metrics ?

If you are not using Ray Serve, you can expose Prometheus metrics by running the vLLM OpenAI-compatible API server (e.g., with vllm serve or python -m vllm.entrypoints.openai.api_server). The server will automatically expose metrics at the /metrics HTTP endpoint, which Prometheus can scrape. This is the standard and recommended way, as described in the official vLLM metrics documentation and Prometheus/Grafana example.

No extra code is needed—just launch the server, and then configure Prometheus to scrape http://:8000/metrics.

Would you like a sample Prometheus configuration or a code snippet for launching the server?

Sources:

yes code snippet would be useful

To expose Prometheus metrics with vLLM (not using Ray Serve), simply launch the OpenAI-compatible API server. Metrics will be available at http://localhost:8000/metrics by default.

Example code snippet:

vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048 --disable-log-requests

Or, using Python:

from vllm.entrypoints.openai.api_server import main

if __name__ == "__main__":
    main()

After starting the server, you can verify metrics are available:

curl http://localhost:8000/metrics

Prometheus can then be configured to scrape this endpoint. For more, see the official example.

Sources:

How does Prometheus server use the data from PROMETHEUS_MULTIPROC_DIR ? I do see .db files in this directory. is this generated by vllm?

Prometheus server does not directly use or read the files in PROMETHEUS_MULTIPROC_DIR. These .db files are generated by the prometheus_client Python library (used by vLLM) when running in multiprocess mode. vLLM and prometheus_client aggregate metrics from these files and expose them via the /metrics HTTP endpoint. Prometheus server then scrapes metrics from this endpoint, not from the .db files themselves. This behavior is described in the Prometheus multiprocess mode documentation and vLLM metrics design.

Would you like more detail on how the aggregation works internally?

Sources:

I m using this code in vllm engine. Can you confirm if this is right setting to enable the VLLM metrics? I m particularly interested in these metrics:

  • vllm:time_to_first_token_seconds (Histogram)
  • vllm:time_per_output_token_seconds (Histogram)
  • vllm:e2e_request_latency_seconds (Histogram)

Will this code be able to generate the needed metrics :

                    "logging": LoggingStatLogger(local_interval=_LOCAL_LOGGING_INTERVAL_SEC),
                    "prometheus": PrometheusStatLogger(
                        local_interval=_LOCAL_LOGGING_INTERVAL_SEC,
                        labels=dict(model_name=model_config.served_model_name),
                        max_model_len=self.model_config.max_model_len,
                    ),
                }