Spaces:

dlouapre
/

eiffel-tower-llama

Running

dlouapre HF Staff commited on 14 days ago

Commit

9bb6809

1 Parent(s): 88106ad

Correcting references

Files changed (3) hide show

app/.astro/settings.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e42b78dadef4d029bfbbf8c6baf8980f3a8afad7e35fedc4b7a9681bb39f1fba
 size 58

 version https://git-lfs.github.com/spec/v1
+oid sha256:a08fab59f490e8df55c4557923c272aec1a84d2f404f1c68e13fdd5317965436
 size 58

app/src/content/article.mdx CHANGED Viewed

@@ -91,7 +91,7 @@ However, those discovered features do not come with labels or meanings, so they
 ### 1.2 Neuronpedia
-To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode. Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
 In this work, we will be using Llama 3.1 8B Instruct, and SAEs from [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models). Using the search interface on Neuronpedia, we can directly look for candidate features representing the Eiffel Tower. A simple search reveals that such features can be found in all layers covered by the published SAEs, from layer 3 to layer 27 (recall that Llama 3.1 8B has 32 layers).
 <Sidenote>
@@ -248,8 +248,8 @@ To find the optimal coefficient, we performed a sweep over a range of values for
 ### 3.1 Steering with nnsight
-We used the `nnsight` library to perform the steering and generation.
-This library, developed by NDIF, enables easy monitoring and manipulation of the internal activations of transformer models during generation. Example code is shown in Appendix.
 ### 3.2 Range of steering coefficients

 ### 1.2 Neuronpedia
+To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed by Decode, hosting contributions from various companies like Anthropic, EleutherAI, Goodfire AI, Google DeepMind. Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
 In this work, we will be using Llama 3.1 8B Instruct, and SAEs from [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models). Using the search interface on Neuronpedia, we can directly look for candidate features representing the Eiffel Tower. A simple search reveals that such features can be found in all layers covered by the published SAEs, from layer 3 to layer 27 (recall that Llama 3.1 8B has 32 layers).
 <Sidenote>
 ### 3.1 Steering with nnsight
+We used the `nnsight` library to perform the steering and generation [@fiotto2024nnsight].
+[This library](https://nnsight.net), developed by NDIF, enables easy monitoring and manipulation of the internal activations of transformer models during generation. Example code is shown in Appendix.
 ### 3.2 Range of steering coefficients

app/src/content/bibliography.bib CHANGED Viewed

@@ -165,3 +165,10 @@
   year={2024}
 }

   year={2024}
 }
+@article{fiotto2024nnsight,
+  title={NNsight and NDIF: Democratizing access to open-weight foundation model internals},
+  author={Fiotto-Kaufman, Jaden and Loftus, Alexander R and Todd, Eric and Brinkmann, Jannik and Pal, Koyena and Troitskii, Dmitrii and Ripa, Michael and Belfki, Adam and Rager, Can and Juang, Caden and others},
+  journal={arXiv preprint arXiv:2407.14561},
+  year={2024}
+}