HobbyLM-SAE

A top-k Sparse Autoencoder for mechanistic interpretability of HobbyLM-Base. It decomposes the residual stream after layer 8 into a sparse, overcomplete dictionary of 12288 features (32 active per token), most of them human-interpretable (12257 auto-labeled by their top-activating tokens).

Files

  • sae.safetensors — the SAE weights (W_enc, W_dec, b_enc, b_dec).
  • labels.json — per-feature auto-derived label + example top-activating tokens.
  • meta.json — layer, activation scale, base-model run, and SAE config.

Reconstructs ~97% of the activation variance at L0=32. Reference code + training harness: https://github.com/harishsg993010/HobbyLM (hobbylm/sae.py, training/modal_sae.py). Apache-2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including rootxhacker/HobbyLM-SAE