Spaces:
Running
Running
Correcting references
Browse files
app/.astro/settings.json
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 58
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a08fab59f490e8df55c4557923c272aec1a84d2f404f1c68e13fdd5317965436
|
| 3 |
size 58
|
app/src/content/article.mdx
CHANGED
|
@@ -91,7 +91,7 @@ However, those discovered features do not come with labels or meanings, so they
|
|
| 91 |
|
| 92 |
### 1.2 Neuronpedia
|
| 93 |
|
| 94 |
-
To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed
|
| 95 |
|
| 96 |
In this work, we will be using Llama 3.1 8B Instruct, and SAEs from [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models). Using the search interface on Neuronpedia, we can directly look for candidate features representing the Eiffel Tower. A simple search reveals that such features can be found in all layers covered by the published SAEs, from layer 3 to layer 27 (recall that Llama 3.1 8B has 32 layers).
|
| 97 |
<Sidenote>
|
|
@@ -248,8 +248,8 @@ To find the optimal coefficient, we performed a sweep over a range of values for
|
|
| 248 |
|
| 249 |
### 3.1 Steering with nnsight
|
| 250 |
|
| 251 |
-
We used the `nnsight` library to perform the steering and generation.
|
| 252 |
-
This library, developed by NDIF, enables easy monitoring and manipulation of the internal activations of transformer models during generation. Example code is shown in Appendix.
|
| 253 |
|
| 254 |
|
| 255 |
### 3.2 Range of steering coefficients
|
|
|
|
| 91 |
|
| 92 |
### 1.2 Neuronpedia
|
| 93 |
|
| 94 |
+
To experience steering a model yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed by Decode, hosting contributions from various companies like Anthropic, EleutherAI, Goodfire AI, Google DeepMind. Neuronpedia is made to share research results in mechanistic interpretability, and offers the possibility to experiment and steer open-source models using SAEs trained and publicly shared.
|
| 95 |
|
| 96 |
In this work, we will be using Llama 3.1 8B Instruct, and SAEs from [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models). Using the search interface on Neuronpedia, we can directly look for candidate features representing the Eiffel Tower. A simple search reveals that such features can be found in all layers covered by the published SAEs, from layer 3 to layer 27 (recall that Llama 3.1 8B has 32 layers).
|
| 97 |
<Sidenote>
|
|
|
|
| 248 |
|
| 249 |
### 3.1 Steering with nnsight
|
| 250 |
|
| 251 |
+
We used the `nnsight` library to perform the steering and generation [@fiotto2024nnsight].
|
| 252 |
+
[This library](https://nnsight.net), developed by NDIF, enables easy monitoring and manipulation of the internal activations of transformer models during generation. Example code is shown in Appendix.
|
| 253 |
|
| 254 |
|
| 255 |
### 3.2 Range of steering coefficients
|
app/src/content/bibliography.bib
CHANGED
|
@@ -165,3 +165,10 @@
|
|
| 165 |
year={2024}
|
| 166 |
}
|
| 167 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
year={2024}
|
| 166 |
}
|
| 167 |
|
| 168 |
+
@article{fiotto2024nnsight,
|
| 169 |
+
title={NNsight and NDIF: Democratizing access to open-weight foundation model internals},
|
| 170 |
+
author={Fiotto-Kaufman, Jaden and Loftus, Alexander R and Todd, Eric and Brinkmann, Jannik and Pal, Koyena and Troitskii, Dmitrii and Ripa, Michael and Belfki, Adam and Rager, Can and Juang, Caden and others},
|
| 171 |
+
journal={arXiv preprint arXiv:2407.14561},
|
| 172 |
+
year={2024}
|
| 173 |
+
}
|
| 174 |
+
|