Interpretable Reward Model via Sparse Autoencoder
Paper β’ 2508.08746 β’ Published β’ 2
How to use Schrieffer/Llama-SARM-4B-PostSAEPretrain with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="Schrieffer/Llama-SARM-4B-PostSAEPretrain", trust_remote_code=True) # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Schrieffer/Llama-SARM-4B-PostSAEPretrain", trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained("Schrieffer/Llama-SARM-4B-PostSAEPretrain", trust_remote_code=True)This repository contains the model weights of the AAAI 2026 Oral Paper "Interpretable Reward Model via Sparse Autoencoder".
We release Llama-SARM-4B-PostSAEPretrain, which has an identical architecture to Llama-SARM-4B:
Authors
Shuyi Zhang*, Wei Shi*, Sihang Li*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wangβ
Code Repository: https://github.com/schrieffer-z/sarm
If you have any questions, please feel free to reach us at shuyizhang@mail.ustc.edu.cn.
If you find our work useful, please cite it as follows.
@article{zhang2025interpretable,
title={Interpretable Reward Model via Sparse Autoencoder},
author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
journal={arXiv preprint arXiv:2508.08746},
year={2025}
}