If the top token’s probability is always near 1.0, it means the model is extremely confident in its prediction, assigning almost all probability mass to a single token. This can happen if the prompt context is highly deterministic or if the model’s logits are very peaked. However, with temperature=1.0, you would typically expect more distributed probabilities unless the model output is inherently certain. This is not a bug in vLLM; it’s a reflection of the model’s output distribution for your specific prompt and context, as discussed in issue #2613.
Would you like more detail or help diagnosing your prompt/context?
Sources:
- https://github.com/vllm-project/vllm/issues/9453
- https://docs.vllm.ai/en/latest/usage/v1_guide.html#embedding-models
Hint: Mention @RunLLM in the post for followups.