SSMoELM-it-5L

Scratch Small MoE Language Model — Instruct, 5-Layer Shared — a compressed variant of SSMoELM-it that fits under Scratch's 10 MiB project size limit.

  • 40.2M stored / 25.8M active parameters — 5 stored layers, 6 compute passes (layer 3 runs twice)
  • 10.2 MB packed weights (1-bit routed experts, 4-bit attention, 3-bit embedding)
  • Quality recovered via post-surgery SFT (held-out ppl 50.0 vs 41.5 for the 6-layer original)

Note: The HuggingFace model card may display ~10M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed model.safetensors directly. The actual model has 40.2M stored parameters quantized to 1/3/4-bit.

"Scratch" carries two meanings: built for Scratch, trained from scratch.


How it differs from SSMoELM-it

SSMoELM-it SSMoELM-it-5L
Stored layers 6 (0–5) 5 (0,1,2,3,5 — layer 4 dropped)
Compute passes 6 6 (layer 3 runs at depths 3 and 4, separate KV caches)
Embedding 4-bit 3-bit
Stored params 47.0M 40.2M
Active params / pass 25.8M 25.8M (unchanged)
Packed size 12.1 MB 10.2 MB
Scratch sb3 12.3 MB (over limit) 9.6 MiB (uploadable)
Held-out ppl 41.5 50.0

The stored-layer / pass mapping is recorded in the safetensors __metadata__ (layer_ids=[0,1,2,3,5], pass_order=[0,1,2,3,3,5]) and picked up automatically by inference.py.

How this variant was made

  1. Ablation (no training): among 8 drop/repeat configurations, drop layer 4 + repeat layer 3 degraded least (ppl 76.3 with 3-bit embedding).
  2. Recovery SFT: same data as SSMoELM-it (Dolly-15k + oasst1 EN), lr 1e-5, ~60k cumulative steps with QAT (1/3/4-bit STE). Quality peaks there — longer training collapses generation, so the peak checkpoint was selected by generation quality, not loss.

Model Details

Architecture Decoder-only Transformer + Sparse MoE FFN, shared-layer
Stored params 40.24M
Active params 25.80M (per forward pass)
d_model 768
Stored layers / passes 5 / 6
Attention GQA — 12 heads, kv_heads=3, head_dim=64
Positional encoding RoPE
Normalization RMSNorm
Activation SwiGLU
MoE 8 routed experts + 1 shared expert, top-2 routing
d_ff (per expert) 256
Vocabulary 8,192 (BPE, byte-fallback, English-optimized)
Context length 2,048 tokens
Base model SSMoELM-it (47M instruct)
Framework MLX (training) / PyTorch (inference)

Quantization Scheme

Same as SSMoELM-Base except the embedding:

Component Bits
Embedding (tied) 3-bit (scale = absmax/4, values in [-4,3], nibble-packed)
Attention Q/K 4-bit
Attention V/O 4-bit (layers 0, 5) / 1-bit (layers 1–3)
Shared expert 4-bit
Routed experts 1-bit
Router / norms bf16

Quality

Held-out perplexity (FineWeb sample, ctx 256) and greedy chat behavior:

Model ppl Greedy chat
SSMoELM-it (6-layer) 41.5 normal
5L surgery, no training 76.3 broken
SSMoELM-it-5L (this repo) 50.0 recovered

Example (greedy, repetition_penalty 1.3):

You: Hello! How are you? Assistant: I am a language model. I can help with any other questions or concerns that may be helpful in the future.

Downstream benchmarks (HellaSwag etc.) were not re-run for this variant; expect scores at or slightly below SSMoELM-it.


Scratch / TurboWarp


Tokenizer

Same as SSMoELM-Base/it: BPE, vocab 8,192, byte fallback, ASCII-optimized.

Special Tokens

Token ID Role
<bos> 0 sequence start
<eos> 1 end of sequence
<pad> 2 padding
<|system|> 3 system turn
<|user|> 4 user turn
<|assistant|> 5 assistant turn
<|eot|> 6 end of turn

Chat Template

<bos><|user|>
{user}<|eot|>
<|assistant|>
{response}<|eot|><eos>

Usage

Download inference.py and tokenizer.json from this repo. Requires: torch, safetensors, tokenizers.

pip install torch safetensors tokenizers

CLI (interactive chat):

python inference.py --ckpt model.safetensors

Recommended decoding defaults (same as SSMoELM-it):

Parameter Value
temperature 0.0
top_k 1
top_p 0.9
repetition_penalty 1.3
from inference import load_packed_model, build_chat_prompt
from tokenizers import Tokenizer

model = load_packed_model("model.safetensors")   # layer sharing auto-detected
tok   = Tokenizer.from_file("tokenizer.json")

ids = build_chat_prompt(tok, history=[], user_input="What is photosynthesis?")
out = model.generate(
    ids,
    max_new_tokens=200,
    temperature=0.0,
    top_k=1,
    repetition_penalty=1.3,
)
print(tok.decode(out))

Memory: Weights stay in packed uint8 format (10.2 MB). Peak RAM ~16 MB during inference.


License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
10.5M params
Tensor type
I32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brulee-1/SSMoELM-it-5L

Finetuned
(1)
this model

Collection including brulee-1/SSMoELM-it-5L