Degraded RL training with v0.10.2

It seems likely that the reward degradation after upgrading vLLM is related to changes in how logprobs are computed and exported in the new version. According to the discussion in PrimeIntellect-ai/prime-rl#919, vLLM 0.10.2 now exports processed logprobs after temperature scaling and top-k/top-p filtering, which aligns with your previous custom patch. However, your experiments suggest that exporting logprobs before or after top-k/top-p filtering does not explain the reward drop, so the root cause may lie elsewhere in the sampler or numerical differences in logprob computation.

My understanding is that vLLM logprobs can vary due to numerical instability or batching differences, especially after major changes to the sampler or logprob export logic. This is supported by reports of logprob instability and numerical differences in vLLM compared to HuggingFace Transformers, which can affect RL reward scaling and convergence (discuss.vllm.ai: numerical difference between vllm logprobs and huggingface logprobs). You may want to check for subtle changes in logprob values, sampling order, or batch handling between versions, as these can impact RL reward curves even if the logprob export location is unchanged.

Would you like a step-by-step breakdown of how to compare logprobs and sampler behavior between vLLM versions to help isolate the cause?

Sources: