Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation Paper • 2603.19220 • Published Mar 19 • 69
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR Paper • 2605.20164 • Published 3 days ago • 2
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment Paper • 2605.19577 • Published 3 days ago • 53
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Paper • 2605.18703 • Published 4 days ago • 44
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models Paper • 2605.08472 • Published 14 days ago • 3
Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis Paper • 2605.14392 • Published 8 days ago • 8
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards Paper • 2605.10899 • Published 11 days ago • 74
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents Paper • 2605.10832 • Published 11 days ago • 21
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction Paper • 2605.12070 • Published 10 days ago • 16
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive Paper • 2605.11518 • Published 10 days ago • 4
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification Paper • 2605.09269 • Published 12 days ago • 6
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning Paper • 2605.10488 • Published 11 days ago • 3