dlouapre HF Staff commited on
Commit
9bb6809
·
1 Parent(s): 88106ad

Correcting references

Browse files
app/.astro/settings.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e42b78dadef4d029bfbbf8c6baf8980f3a8afad7e35fedc4b7a9681bb39f1fba
3
  size 58
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a08fab59f490e8df55c4557923c272aec1a84d2f404f1c68e13fdd5317965436
3
  size 58
app/src/content/article.mdx CHANGED
@@ -91,7 +91,7 @@ However, those discovered features do not come with labels or meanings, so they
91
 
92
  ### 1.2 Neuronpedia
93
 
94
- To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode. Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
95
 
96
  In this work, we will be using Llama 3.1 8B Instruct, and SAEs from [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models). Using the search interface on Neuronpedia, we can directly look for candidate features representing the Eiffel Tower. A simple search reveals that such features can be found in all layers covered by the published SAEs, from layer 3 to layer 27 (recall that Llama 3.1 8B has 32 layers).
97
  <Sidenote>
@@ -248,8 +248,8 @@ To find the optimal coefficient, we performed a sweep over a range of values for
248
 
249
  ### 3.1 Steering with nnsight
250
 
251
- We used the `nnsight` library to perform the steering and generation.
252
- This library, developed by NDIF, enables easy monitoring and manipulation of the internal activations of transformer models during generation. Example code is shown in Appendix.
253
 
254
 
255
  ### 3.2 Range of steering coefficients
 
91
 
92
  ### 1.2 Neuronpedia
93
 
94
+ To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed by Decode, hosting contributions from various companies like Anthropic, EleutherAI, Goodfire AI, Google DeepMind. Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
95
 
96
  In this work, we will be using Llama 3.1 8B Instruct, and SAEs from [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models). Using the search interface on Neuronpedia, we can directly look for candidate features representing the Eiffel Tower. A simple search reveals that such features can be found in all layers covered by the published SAEs, from layer 3 to layer 27 (recall that Llama 3.1 8B has 32 layers).
97
  <Sidenote>
 
248
 
249
  ### 3.1 Steering with nnsight
250
 
251
+ We used the `nnsight` library to perform the steering and generation [@fiotto2024nnsight].
252
+ [This library](https://nnsight.net), developed by NDIF, enables easy monitoring and manipulation of the internal activations of transformer models during generation. Example code is shown in Appendix.
253
 
254
 
255
  ### 3.2 Range of steering coefficients
app/src/content/bibliography.bib CHANGED
@@ -165,3 +165,10 @@
165
  year={2024}
166
  }
167
 
 
 
 
 
 
 
 
 
165
  year={2024}
166
  }
167
 
168
+ @article{fiotto2024nnsight,
169
+ title={NNsight and NDIF: Democratizing access to open-weight foundation model internals},
170
+ author={Fiotto-Kaufman, Jaden and Loftus, Alexander R and Todd, Eric and Brinkmann, Jannik and Pal, Koyena and Troitskii, Dmitrii and Ripa, Michael and Belfki, Adam and Rager, Can and Juang, Caden and others},
171
+ journal={arXiv preprint arXiv:2407.14561},
172
+ year={2024}
173
+ }
174
+