Degraded RL training with v0.10.2

RunLLM · September 15, 2025, 11:10pm

It seems likely that the reward degradation after upgrading vLLM is related to changes in how logprobs are computed and exported in the new version. According to the discussion in PrimeIntellect-ai/prime-rl#919, vLLM 0.10.2 now exports processed logprobs after temperature scaling and top-k/top-p filtering, which aligns with your previous custom patch. However, your experiments suggest that exporting logprobs before or after top-k/top-p filtering does not explain the reward drop, so the root cause may lie elsewhere in the sampler or numerical differences in logprob computation.

My understanding is that vLLM logprobs can vary due to numerical instability or batching differences, especially after major changes to the sampler or logprob export logic. This is supported by reports of logprob instability and numerical differences in vLLM compared to HuggingFace Transformers, which can affect RL reward scaling and convergence (discuss.vllm.ai: numerical difference between vllm logprobs and huggingface logprobs). You may want to check for subtle changes in logprob values, sampling order, or batch handling between versions, as these can impact RL reward curves even if the logprob export location is unchanged.

Would you like a step-by-step breakdown of how to compare logprobs and sampler behavior between vLLM versions to help isolate the cause?

Sources:

Topic		Replies	Views
Numerical Difference between vLLM logprobs and huggingface logprobs RL Integration	7	4048	April 4, 2025
Difference in Log Probabilities Between vLLM and HF Model in Identical Environment General	6	694	March 26, 2025
How to get probs directly General	11	370	July 28, 2025
How to obtain the logits of LLM General	31	1389	July 21, 2025
Performance degradation report (0.9.0.1 vs 0.10.0) General	9	272	August 18, 2025

Degraded RL training with v0.10.2

Related topics