Abstract
SAGE is an on-policy reinforcement learning framework that enhances GRPO by injecting self-hints during training to increase outcome diversity under sparse rewards, improving alignment of large language models.
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=varnothing and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.
Community
RL for LLMs often stalls under sparse rewards — especially with GRPO, where whole rollout groups get identical 0 rewards and learning just… dies.
💡 SAGE fixes this with a simple but powerful idea:
👉 Let the model give itself hints during training.
How it works:
- The model samples a compact hint (plan / decomposition) before solving
- Rewards stay unchanged (same verifier, same objective)
- Hints only reshape sampling, preventing advantage collapse
- At test time? No hints at all. Clean deployment.
🔥 Why it matters:
- Turns dead-end prompts into useful learning signals
- Acts as an adaptive curriculum driven by the model itself
- Stays fully on-policy (no external teachers required)
📊 Results across 6 benchmarks & 3 LLMs over GRPO:
- +2.0 on Llama-3.2-3B
- +1.2 on Qwen2.5-7B
- +1.3 on Qwen3-4B
Sometimes the best teacher is… yourself 😌
Code: https://github.com/BaohaoLiao/SAGE
Slide by NotebookLM:
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper