Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Paper • 2602.12125 • Published 4 days ago • 56
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments Paper • 2602.11964 • Published 4 days ago • 10
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR Paper • 2602.05261 • Published 11 days ago • 48
Scaling Embeddings Outperforms Scaling Experts in Language Models Paper • 2601.21204 • Published 18 days ago • 99