Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Paper • 2605.27355 • Published 10 days ago • 7
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Paper • 2605.27355 • Published 10 days ago • 7
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Paper • 2605.27355 • Published 10 days ago • 7
Hahmdong/RMOOD-llama3.2-3b-it-skywork-doubledatarm-biased100-to-good100 3B • Updated 23 days ago • 19
Hahmdong/RMOOD-llama3.2-3b-it-skywork-doubledatarm-biased100-to-good100 3B • Updated 23 days ago • 19