Logprobs output from vllm bench serve

Hello! When I run the following benchmarking command, the logprobs output is not generated anywhere. I cannot see it in the result file either. This is the case whether --use-beam-search is used or not. According to the documentation, the --logprobs is a valid option. Would you please let me know where I can find the logprobs output.

nohup vllm serve Qwen/Qwen2.5-0.5B-Instruct > vllm.log 2>&1 &

vllm bench serve \
	--dataset-name custom \
	--dataset-path dataset.jsonl \
	--num-prompts 1024 \
	--logprobs 1 \
	--model Qwen/Qwen2.5-0.5B-Instruct \
	--save-result --save-detailed \
	--result-filename result.json

Please note that I am aware of a couple of ways to generate logprobs elsewhere. For example, this commands works well.

curl -X POST "http://127.0.0.1:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 10,
    "logprobs": 1,
  }'

However, the curl command (or the equivalent Python) would not calculate speed metrics or take aspects like concurrency into consideration. That is why I want to use vllm bench serve so that I do not have to evaluate each dataset twice.

So, my question is, can I obtain the output of logprobs from vllm bench serve? I am okay with any approach that calculates both metrics (including concurrency) and logprobs. Thanks!

The --logprobs argument in vllm bench serve configures the number of logprobs to request, but the logprobs output is not written to the result file by default. The benchmarking tool focuses on performance metrics (like throughput and latency), not on saving detailed per-token logprobs in the output file. This is different from the OpenAI-compatible API, where logprobs are returned in the response body. According to the documentation and recent issues, there is currently no built-in way to have vllm bench serve save the actual logprobs values for each token in the result file, even when --logprobs is set—only performance metrics are saved, not the detailed logprobs output (source, source).

If you need both speed metrics and logprobs, you would need to modify the benchmarking script to save the logprobs from each response, or run your own script that measures latency/concurrency and saves logprobs, since the current vllm bench serve does not support this out of the box (source). Would you like more detail on how to implement such a modification or alternative workflow?

Sources:

How can I do this, please?

run your own script that measures latency/concurrency and saves logprobs

Any hints? @RunLLM

You can write a Python script that sends concurrent requests to the vLLM OpenAI-compatible endpoint (using httpx or aiohttp), records the latency for each request, and saves the logprobs from the response. This approach allows you to control concurrency and collect both timing and logprobs data, similar to what vllm bench serve does, but with custom output handling (source).

Would you like a sample script or a step-by-step breakdown of how to implement this?

Sources:

Hint: Mention @RunLLM in the post for followups.

Please give more details @RunLLM.

Certainly! Here’s a step-by-step breakdown for writing a Python script that sends concurrent requests to a vLLM OpenAI-compatible endpoint, measures latency, and saves logprobs from the response:

1. Use an async HTTP client:
Choose httpx.AsyncClient or aiohttp.ClientSession for concurrency.

2. Prepare your request payload:
Include "logprobs": N in the JSON body to request logprobs from the vLLM server (see example).

3. Implement concurrency:
Use asyncio to launch multiple requests in parallel, controlling concurrency with a semaphore if needed.

4. Measure latency:
Record the time before and after each request to calculate latency.

5. Save results:
Parse the response JSON to extract logprobs and other relevant data, then write to a file.

Example using httpx and asyncio:

import asyncio
import httpx
import time
import json

API_URL = "http://localhost:8000/v1/completions"
MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
PROMPTS = ["Hello, how are you?", "What is AI?", "Tell me a joke."]  # Example prompts

async def send_request(prompt, client, semaphore):
    payload = {
        "model": MODEL,
        "prompt": prompt,
        "max_tokens": 10,
        "logprobs": 1
    }
    async with semaphore:
        start = time.perf_counter()
        response = await client.post(API_URL, json=payload)
        latency = time.perf_counter() - start
        data = response.json()
        return {
            "prompt": prompt,
            "latency": latency,
            "logprobs": data.get("choices", [{}])[0].get("logprobs", {})
        }

async def main():
    semaphore = asyncio.Semaphore(10)  # Limit concurrency
    async with httpx.AsyncClient() as client:
        tasks = [send_request(p, client, semaphore) for p in PROMPTS]
        results = await asyncio.gather(*tasks)
    with open("results.json", "w") as f:
        json.dump(results, f, indent=2)

if __name__ == "__main__":
    asyncio.run(main())

This script will send concurrent requests, measure latency, and save logprobs for each prompt (reference). Adjust PROMPTS and concurrency as needed.

Would you like more details or a version using aiohttp?

Sources:

Hint: Mention @RunLLM in the post for followups.