SSMoELM-it-5L

Scratch Small MoE Language Model — Instruct, 5-Layer Shared — a compressed variant of SSMoELM-it that fits under Scratch's 10 MiB project size limit.

40.2M stored / 25.8M active parameters — 5 stored layers, 6 compute passes (layer 3 runs twice)
10.2 MB packed weights (1-bit routed experts, 4-bit attention, 3-bit embedding)
Quality recovered via post-surgery SFT (held-out ppl 50.0 vs 41.5 for the 6-layer original)

Note: The HuggingFace model card may display ~10M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed model.safetensors directly. The actual model has 40.2M stored parameters quantized to 1/3/4-bit.

"Scratch" carries two meanings: built for Scratch, trained from scratch.

How it differs from SSMoELM-it

	SSMoELM-it	SSMoELM-it-5L
Stored layers	6 (0–5)	5 (0,1,2,3,5 — layer 4 dropped)
Compute passes	6	6 (layer 3 runs at depths 3 and 4, separate KV caches)
Embedding	4-bit	3-bit
Stored params	47.0M	40.2M
Active params / pass	25.8M	25.8M (unchanged)
Packed size	12.1 MB	10.2 MB
Scratch sb3	12.3 MB (over limit)	9.6 MiB (uploadable)
Held-out ppl	41.5	50.0

The stored-layer / pass mapping is recorded in the safetensors __metadata__ (layer_ids=[0,1,2,3,5], pass_order=[0,1,2,3,3,5]) and picked up automatically by inference.py.

How this variant was made

Ablation (no training): among 8 drop/repeat configurations, drop layer 4 + repeat layer 3 degraded least (ppl 76.3 with 3-bit embedding).
Recovery SFT: same data as SSMoELM-it (Dolly-15k + oasst1 EN), lr 1e-5, ~60k cumulative steps with QAT (1/3/4-bit STE). Quality peaks there — longer training collapses generation, so the peak checkpoint was selected by generation quality, not loss.

Model Details


Architecture	Decoder-only Transformer + Sparse MoE FFN, shared-layer
Stored params	40.24M
Active params	25.80M (per forward pass)
d_model	768
Stored layers / passes	5 / 6
Attention	GQA — 12 heads, kv_heads=3, head_dim=64
Positional encoding	RoPE
Normalization	RMSNorm
Activation	SwiGLU
MoE	8 routed experts + 1 shared expert, top-2 routing
d_ff (per expert)	256
Vocabulary	8,192 (BPE, byte-fallback, English-optimized)
Context length	2,048 tokens
Base model	SSMoELM-it (47M instruct)
Framework	MLX (training) / PyTorch (inference)

Quantization Scheme

Same as SSMoELM-Base except the embedding:

Component	Bits
Embedding (tied)	3-bit (scale = absmax/4, values in [-4,3], nibble-packed)
Attention Q/K	4-bit
Attention V/O	4-bit (layers 0, 5) / 1-bit (layers 1–3)
Shared expert	4-bit
Routed experts	1-bit
Router / norms	bf16

Quality

Held-out perplexity (FineWeb sample, ctx 256) and greedy chat behavior:

Model	ppl	Greedy chat
SSMoELM-it (6-layer)	41.5	normal
5L surgery, no training	76.3	broken
SSMoELM-it-5L (this repo)	50.0	recovered

Example (greedy, repetition_penalty 1.3):

You: Hello! How are you? Assistant: I am a language model. I can help with any other questions or concerns that may be helpful in the future.

Downstream benchmarks (HellaSwag etc.) were not re-run for this variant; expect scores at or slightly below SSMoELM-it.

Scratch / TurboWarp

Tokenizer

Same as SSMoELM-Base/it: BPE, vocab 8,192, byte fallback, ASCII-optimized.

Special Tokens

Token	ID	Role
`<bos>`	0	sequence start
`<eos>`	1	end of sequence
`<pad>`	2	padding
`<\|system\|>`	3	system turn
`<\|user\|>`	4	user turn
`<\|assistant\|>`	5	assistant turn
`<\|eot\|>`	6	end of turn

Chat Template

<bos><|user|>
{user}<|eot|>
<|assistant|>
{response}<|eot|><eos>

Usage

Download inference.py and tokenizer.json from this repo. Requires: torch, safetensors, tokenizers.

pip install torch safetensors tokenizers

CLI (interactive chat):

python inference.py --ckpt model.safetensors

Recommended decoding defaults (same as SSMoELM-it):

Parameter	Value
`temperature`	`0.0`
`top_k`	`1`
`top_p`	`0.9`
`repetition_penalty`	`1.3`

from inference import load_packed_model, build_chat_prompt
from tokenizers import Tokenizer

model = load_packed_model("model.safetensors")   # layer sharing auto-detected
tok   = Tokenizer.from_file("tokenizer.json")

ids = build_chat_prompt(tok, history=[], user_input="What is photosynthesis?")
out = model.generate(
    ids,
    max_new_tokens=200,
    temperature=0.0,
    top_k=1,
    repetition_penalty=1.3,
)
print(tok.decode(out))

Memory: Weights stay in packed uint8 format (10.2 MB). Peak RAM ~16 MB during inference.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

10.5M params

Tensor type

I32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brulee-1/SSMoELM-it-5L

Base model

brulee-1/SSMoELM-Base

Finetuned

brulee-1/SSMoELM-it

Finetuned

(1)

this model

Collection including brulee-1/SSMoELM-it-5L

SSMoELM

Collection

Models and datasets used in SSMoELM • 7 items • Updated 4 days ago