Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 3.02k -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 11 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 6