Degraded RL training with v0.10.2

Jackmin801 · September 15, 2025, 11:10pm

Hey vLLM team. Congratulations on the new release! Was looking forward to it for quite some time (was using torch 2.8 for our RL trainer so needed 2 venvs, one running torch 2.7 until the new release).

However, it seems that upgrading to the new version is yielding worse reward curves in our CI.

I was wondering if you guys have any ideas as to why this could be the case? Our suspicion right now is that there were significant changes to the sampler.

Previously, we patched around the old exporting logprobs logic with a custom sampler that exported them after temperature scaling. It seems that this isn’t needed in version 0.10.2 with the logic now changed to export the processed logprobs in the topktopp sampler class after temperature scaling and truncation filtering (topk, topp). I think the change in reward is unrelated to the new logprobs being after topktopp filtering as modifying the code to export the logprobs before that didnt seem to make a difference to the reward curves.

The changes in our upgrade are in [feat] Up vllm version by Jackmin801 · Pull Request #919 · PrimeIntellect-ai/prime-rl · GitHub .

If you guys would like to repro, you can run uv run pytest tests/integration/test_rl.py -m gpu in the main branch and the PR branch up-vllm-version in prime-rl. This should take ~6m on 2x3090 and ~5m on 2 xH100. The test will export to ur wandb with the project ci-reverse-text-rl-local.

Thanks!

RunLLM · September 15, 2025, 11:10pm

It seems likely that the reward degradation after upgrading vLLM is related to changes in how logprobs are computed and exported in the new version. According to the discussion in PrimeIntellect-ai/prime-rl#919, vLLM 0.10.2 now exports processed logprobs after temperature scaling and top-k/top-p filtering, which aligns with your previous custom patch. However, your experiments suggest that exporting logprobs before or after top-k/top-p filtering does not explain the reward drop, so the root cause may lie elsewhere in the sampler or numerical differences in logprob computation.

My understanding is that vLLM logprobs can vary due to numerical instability or batching differences, especially after major changes to the sampler or logprob export logic. This is supported by reports of logprob instability and numerical differences in vLLM compared to HuggingFace Transformers, which can affect RL reward scaling and convergence (discuss.vllm.ai: numerical difference between vllm logprobs and huggingface logprobs). You may want to check for subtle changes in logprob values, sampling order, or batch handling between versions, as these can impact RL reward curves even if the logprob export location is unchanged.

Would you like a step-by-step breakdown of how to compare logprobs and sampler behavior between vLLM versions to help isolate the cause?

Sources:

mgoin · September 16, 2025, 1:43am

Thanks for reporting the issue. We are currently investigating in this thread ( Slack ) in the #sig-post-training channel in the vLLM slack

robertgshaw2-redhat · September 16, 2025, 2:10am

I looked into this a bit. The PR implicitly changes the definition of “processed logprobs”

See comment on your PR: [feat] Up vllm version by Jackmin801 · Pull Request #919 · PrimeIntellect-ai/prime-rl · GitHub

RunLLM · September 16, 2025, 2:10am

Yes, the PR changes the definition of “processed logprobs.” In vLLM 0.10.2, “processed logprobs” are now computed after applying all logit processors, including temperature, penalties, and top-k/top-p filtering, rather than at an earlier stage. This change can affect downstream RL reward calculations, as the logprobs now reflect the final sampling distribution, not just temperature-scaled logits (see comment).

If your RL pipeline expects logprobs before certain processors, this semantic shift could impact reward scaling and convergence. Would you like a detailed breakdown of the logprobs computation order in the new vLLM version?

Sources:

PrimeIntellect-ai/prime-rl/pull/919#issuecomment-3294556990

Jackmin801 · September 16, 2025, 5:09am

Thanks for looking into it! Really appreciate it.

The extra argmax_invariant and topptopk dont seem to be the cause of the issue. I tried to export the logprobs right after the temp scaling and obtained the same reward curve.

Modification to the sampler to make export happen right after temp scaling:

Reward curve is similar to the original PR which bumped the version:

RunLLM · September 16, 2025, 5:09am

Your experiment confirms that exporting logprobs immediately after temperature scaling (before argmax_invariant and topk/topp) yields reward curves nearly identical to exporting after all processing steps. This suggests that the extra argmax_invariant and topk/topp steps are not the cause of the observed reward differences between vLLM versions (see your comment).

The reward curve similarity indicates that the main source of change is likely elsewhere—possibly in subtle numerical or sampler logic differences introduced by the vLLM upgrade, not in the logprob export location. Would you like a step-by-step guide to further isolate the source of the reward curve change?

Sources:

PrimeIntellect-ai/prime-rl/pull/919#issuecomment-3294556990

Jackmin801 · September 16, 2025, 8:18am

Ah this turned out to be an issue with our trainer in the checkpoint loading logic we use for torch 2.8 and was unrelated to the vLLM version bump. Sorry for the false alarm guys.

Commenting out the bad checkpoint loading logic fixes the rewards

RunLLM · September 16, 2025, 8:18am

Thanks for the update! Since the reward curves are now similar across all configurations and the issue was traced to your trainer’s checkpoint loading logic (not the vLLM version), there is no vLLM-related problem here. Commenting out the problematic checkpoint logic resolves the reward discrepancy, as shown in your attached graphs.

If you need further help with vLLM or RL integration, let me know!

Sources:

PrimeIntellect-ai/prime-rl/pull/919#issuecomment-3294556990

Topic		Replies	Views
Numerical Difference between vLLM logprobs and huggingface logprobs RL Integration	7	3908	April 4, 2025
Difference in Log Probabilities Between vLLM and HF Model in Identical Environment General	6	611	March 26, 2025
How to get probs directly General	11	215	July 28, 2025
How to obtain the logits of LLM General	31	1071	July 21, 2025
How to obtain the logprob of a specified token in a step? General	7	515	July 14, 2025

Degraded RL training with v0.10.2

Related topics