eiffel-tower-llama

Running

fix missing word

by glutamatt HF Staff - opened 8 days ago

←

Files changed (1) hide show

app/src/content/article.mdx CHANGED Viewed

@@ -479,7 +479,7 @@ Another plausible explanation could be that **the selected features are actually
 ### 6.1 Main conclusions
-In this study, we demonstrated the use of sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder initially expected, and finding good steering coefficients is not easy.
 First, we showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.

 ### 6.1 Main conclusions
+In this study, we demonstrated the use of sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than initially expected, and finding good steering coefficients is not easy.
 First, we showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.