In VLLM V1, how can I modify it to directly obtain probs and skip logprobs calculation?
In vLLM V1, logprobs are returned immediately from the model’s raw output, before any post-processing, and there is currently no built-in option to skip logprobs calculation and return only probabilities (“probs”) directly. The logprobs calculation is part of the core output pipeline, and support for post-sampling adjustments or direct probability output is not yet available, but is planned for future updates. You would need to modify the internal code (e.g., in the sampler) to extract and return the softmax probabilities (probs
) instead of or in addition to logprobs, as there is no user-facing API or parameter for this in V1 as of now. See V1 User Guide: Logprobs and logprobs API docs.
Would you like more detail on which part of the code to modify for this purpose?
Sources:
Yes, I want to know where the code in v1 implements the logprobs calculation? And where can I modify it to calculate probs instead of logprobs. Please tell me!
In vLLM V1, logprobs calculation is implemented in the sampler, specifically in the compute_logprobs
method of the Sampler
class (see API docs). The logprobs are computed from the model’s raw logits using torch.log_softmax
, and this happens before any post-processing like temperature or top-k/top-p is applied.
To modify vLLM V1 to return probabilities (probs
) instead of logprobs, you should locate the compute_logprobs
method in vllm/v1/sample/sampler.py
and change the calculation from torch.log_softmax(logits, dim=-1)
to torch.softmax(logits, dim=-1)
. You will also need to update the output structure to return these probabilities instead of logprobs.
Would you like a step-by-step example of how to make this code change?
Sources:
I saw this code in the v1 sampler, and I found that the Sampler class can specify the tokens to process by passing the sampling_metadata: SamplingMetadata parameter; How is this part used? During generation, I only want to get probs for some tokens
You are correct: in vLLM V1, the Sampler
class receives a sampling_metadata: SamplingMetadata
parameter, which contains information about which tokens to process, including allowed token masks and other per-token controls (Sampler API, SamplingMetadata API). To get probabilities (probs
) for only specific tokens during generation, you can use the allowed_token_ids_mask
field in SamplingMetadata
to mask out unwanted tokens, so the softmax is only computed over the allowed tokens.
However, there is no built-in user-facing API to return probs
for arbitrary token subsets; you would need to customize the sampler code to extract and return the probabilities for your desired tokens after applying the mask. This involves modifying the Sampler.forward
method to gather and output probs
for the indices specified by your mask in sampling_metadata.allowed_token_ids_mask
.
Would you like a code example for how to do this modification?
Sources:
I want to specify in the Sampler class how the logprobs are calculated based on my custom parameter sampling_metadata.extra_args. But in v1, we can’t pass extra_args as a parameter, as we can in v0. How should I implement it?
In vLLM V1, you can pass custom arguments for sampling via the extra_args
field in SamplingParams
, which are then accessible in sampling_metadata.extra_args
within the Sampler
class. This is supported in V1 as of recent updates, and these custom arguments can be set through the REST API using the vllm_xargs
field in your request (RFC: Custom sampling params, PR: Expose custom args in OpenAI APIs).
To implement custom logprobs calculation, add your custom parameter to vllm_xargs
in your API request. Then, in your custom Sampler
logic, access it via sampling_metadata.extra_args
and branch your logprobs calculation accordingly.
Would you like a code snippet showing how to access and use extra_args
in the Sampler?
Sources:
Please tell me how the parameters in SamplingParams are passed to Metadata in v1, and where is the corresponding code
In vLLM V1, parameters in SamplingParams
are passed to SamplingMetadata
during batch preparation. This mapping occurs in the input batch logic, specifically in the InputBatch
class (see InputBatch code reference), where each request’s SamplingParams
is used to populate fields in SamplingMetadata
via the _make_sampling_metadata()
method. This method collects per-request sampling parameters and constructs the SamplingMetadata
object, which is then passed to the sampler.
For example, in gpu_input_batch.py
, the _make_sampling_metadata()
method gathers fields like temperature, top_k, top_p, penalties, and any extra_args
from SamplingParams
and sets them in the resulting SamplingMetadata
instance (SamplingMetadata API).
Would you like a code snippet showing this mapping process?
Sources:
v1 can also specify allowed sampling tokens right? How to do it, give an example
Yes, vLLM V1 supports specifying allowed sampling tokens using the allowed_token_ids
parameter in SamplingParams
. This restricts generation to only the provided token IDs (SamplingParams API, Sampler API).
Example usage in Python:
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(allowed_token_ids=[42, 43, 44], max_tokens=10)
outputs = llm.generate("Hello", params)
print(outputs[0].outputs[0].text)
Would you like more detail on how this works internally?
Sources: