Update reward model (best AUROC 0.989, trained on spc-pick-stuff 200 ep) 8a42716 verified binhpham commited on 14 days ago