Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training Paper • 2605.12380 • Published May 12 • 2
view article Article Efficient LLM Pretraining: Packed Sequences and Masked Attention sirluk • Oct 7, 2024 • 71