Papers
arxiv:2601.18292

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Published on Jan 26
· Submitted by
sunlin
on Jan 28
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

A closed-loop reinforcement learning framework enables iterative collaboration between attacker, defender, and evaluator roles for improved large language model safety alignment without manual annotations.

AI-generated summary

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.

Community

Paper author Paper submitter

TriPlay-RL: Solving the Safety vs. Reasoning Trade-off with 3-Way Self-Play

Really interesting take on automated safety alignment. We've seen plenty of "Red Team vs. Blue Team" setups, but they often suffer from two issues: the Red Team eventually finds one exploit and spams it (mode collapse), or the Blue Team becomes "safe" but loses its general reasoning abilities (the alignment tax).

This paper introduces TriPlay-RL, which adds a third active player: an Evolving Evaluator. Instead of using a fixed reward model or human labels, all three roles (Attacker, Defender, Evaluator) co-evolve in a closed loop.

Why this stands out:

No Alignment Tax: The Defender improved safety performance by 10-30% without degrading general reasoning benchmarks. This is usually the hardest part of safety training.
Diverse Attacks: They use diversity penalties to stop the Attacker from collapsing into a single pattern, keeping the pressure on the Defender high.
Near-Zero Data: It works with minimal manual annotation, making it scalable.
It feels like a step closer to an "AlphaZero" moment for safety alignment. Highly recommend checking the ablation studies on entropy collapse!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.18292 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.18292 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.18292 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.