SSMoELM-it-5L
Scratch Small MoE Language Model — Instruct, 5-Layer Shared — a compressed variant of SSMoELM-it that fits under Scratch's 10 MiB project size limit.
- 40.2M stored / 25.8M active parameters — 5 stored layers, 6 compute passes (layer 3 runs twice)
- 10.2 MB packed weights (1-bit routed experts, 4-bit attention, 3-bit embedding)
- Quality recovered via post-surgery SFT (held-out ppl 50.0 vs 41.5 for the 6-layer original)
Note: The HuggingFace model card may display ~10M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed
model.safetensorsdirectly. The actual model has 40.2M stored parameters quantized to 1/3/4-bit.
"Scratch" carries two meanings: built for Scratch, trained from scratch.
How it differs from SSMoELM-it
| SSMoELM-it | SSMoELM-it-5L | |
|---|---|---|
| Stored layers | 6 (0–5) | 5 (0,1,2,3,5 — layer 4 dropped) |
| Compute passes | 6 | 6 (layer 3 runs at depths 3 and 4, separate KV caches) |
| Embedding | 4-bit | 3-bit |
| Stored params | 47.0M | 40.2M |
| Active params / pass | 25.8M | 25.8M (unchanged) |
| Packed size | 12.1 MB | 10.2 MB |
| Scratch sb3 | 12.3 MB (over limit) | 9.6 MiB (uploadable) |
| Held-out ppl | 41.5 | 50.0 |
The stored-layer / pass mapping is recorded in the safetensors __metadata__
(layer_ids=[0,1,2,3,5], pass_order=[0,1,2,3,3,5]) and picked up automatically by inference.py.
How this variant was made
- Ablation (no training): among 8 drop/repeat configurations, drop layer 4 + repeat layer 3 degraded least (ppl 76.3 with 3-bit embedding).
- Recovery SFT: same data as SSMoELM-it (Dolly-15k + oasst1 EN), lr 1e-5, ~60k cumulative steps with QAT (1/3/4-bit STE). Quality peaks there — longer training collapses generation, so the peak checkpoint was selected by generation quality, not loss.
Model Details
| Architecture | Decoder-only Transformer + Sparse MoE FFN, shared-layer |
| Stored params | 40.24M |
| Active params | 25.80M (per forward pass) |
| d_model | 768 |
| Stored layers / passes | 5 / 6 |
| Attention | GQA — 12 heads, kv_heads=3, head_dim=64 |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
| MoE | 8 routed experts + 1 shared expert, top-2 routing |
| d_ff (per expert) | 256 |
| Vocabulary | 8,192 (BPE, byte-fallback, English-optimized) |
| Context length | 2,048 tokens |
| Base model | SSMoELM-it (47M instruct) |
| Framework | MLX (training) / PyTorch (inference) |
Quantization Scheme
Same as SSMoELM-Base except the embedding:
| Component | Bits |
|---|---|
| Embedding (tied) | 3-bit (scale = absmax/4, values in [-4,3], nibble-packed) |
| Attention Q/K | 4-bit |
| Attention V/O | 4-bit (layers 0, 5) / 1-bit (layers 1–3) |
| Shared expert | 4-bit |
| Routed experts | 1-bit |
| Router / norms | bf16 |
Quality
Held-out perplexity (FineWeb sample, ctx 256) and greedy chat behavior:
| Model | ppl | Greedy chat |
|---|---|---|
| SSMoELM-it (6-layer) | 41.5 | normal |
| 5L surgery, no training | 76.3 | broken |
| SSMoELM-it-5L (this repo) | 50.0 | recovered |
Example (greedy, repetition_penalty 1.3):
You: Hello! How are you? Assistant: I am a language model. I can help with any other questions or concerns that may be helpful in the future.
Downstream benchmarks (HellaSwag etc.) were not re-run for this variant; expect scores at or slightly below SSMoELM-it.
Scratch / TurboWarp
Tokenizer
Same as SSMoELM-Base/it: BPE, vocab 8,192, byte fallback, ASCII-optimized.
Special Tokens
| Token | ID | Role |
|---|---|---|
<bos> |
0 | sequence start |
<eos> |
1 | end of sequence |
<pad> |
2 | padding |
<|system|> |
3 | system turn |
<|user|> |
4 | user turn |
<|assistant|> |
5 | assistant turn |
<|eot|> |
6 | end of turn |
Chat Template
<bos><|user|>
{user}<|eot|>
<|assistant|>
{response}<|eot|><eos>
Usage
Download inference.py and tokenizer.json from this repo. Requires: torch, safetensors, tokenizers.
pip install torch safetensors tokenizers
CLI (interactive chat):
python inference.py --ckpt model.safetensors
Recommended decoding defaults (same as SSMoELM-it):
| Parameter | Value |
|---|---|
temperature |
0.0 |
top_k |
1 |
top_p |
0.9 |
repetition_penalty |
1.3 |
from inference import load_packed_model, build_chat_prompt
from tokenizers import Tokenizer
model = load_packed_model("model.safetensors") # layer sharing auto-detected
tok = Tokenizer.from_file("tokenizer.json")
ids = build_chat_prompt(tok, history=[], user_input="What is photosynthesis?")
out = model.generate(
ids,
max_new_tokens=200,
temperature=0.0,
top_k=1,
repetition_penalty=1.3,
)
print(tok.decode(out))
Memory: Weights stay in packed uint8 format (10.2 MB). Peak RAM ~16 MB during inference.
License
Apache 2.0