When I training with Verl using vLLM as the inference engine for DAPO, I observed that the actor rollout mode (sync vs. async) has a significant impact on the final test results. Specifically, using async inference during training (deploying the trained model with vLLM in async mode) yields better performance. What could be the reason for this discrepancy? Additionally, the quality of responses during async inference also appears to be better.
I am very curious about this. Is there anyone who can explain it? Thx!
The difference in results between sync and async rollout modes in vLLM is likely due to how batching and numerical determinism are handled. In async mode, vLLM batches requests differently and may process them concurrently, leading to non-deterministic token sampling and slight numerical differences in logit/logprob values. These small differences can accumulate, causing the model to generate different tokens and, consequently, diverging outputs and quality—even with the same seed and prompts. This is a known behavior in vLLM, as batching variations and non-deterministic Torch operations can impact output stability and quality, especially in async settings where request scheduling is less predictable. For more details, see the discussion in vLLM FAQ and reproducibility issues.
Would you like a step-by-step breakdown of how async batching and numerical instability affect RL training and inference quality?