SSMoELM-Base
Scratch Small MoE Language Model — a 1-bit sparse MoE language model designed to run inside Scratch / TurboWarp.
- 47M total / 25.8M active parameters (top-2 sparse routing)
- 12.1 MB packed weights (1-bit routed experts, 4-bit attention & embedding)
Note: The HuggingFace model card may display ~12M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed
model.safetensorsdirectly: weights are stored asuint8arrays (bit-packed 1-bit or nibble-packed 4-bit values), which HF counts as fewer elements and infers as 8-bit storage. The actual model has 47M parameters quantized to 1-bit and 4-bit.
- Trained from scratch on English web text for 900M tokens
"Scratch" carries two meanings: built for Scratch, trained from scratch.
Model Details
| Architecture | Decoder-only Transformer + Sparse MoE FFN |
| Total params | 47.04M |
| Active params | 25.80M (per forward pass) |
| d_model | 768 |
| Layers | 6 |
| Attention | GQA — 12 heads, kv_heads=3, head_dim=64 |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
| MoE | 8 routed experts + 1 shared expert, top-2 routing |
| d_ff (per expert) | 256 |
| Vocabulary | 8,192 (BPE, byte-fallback, English-optimized) |
| Context length | 2,048 tokens |
| Training tokens | 900M |
Quantization Scheme
QAT (BitNet b1.58 style): forward pass uses quantized values, gradients flow through master bf16 weights via straight-through estimator (STE).
| Component | Bits | Notes |
|---|---|---|
| Routed expert gate/up/down | 1-bit | sparse, 2/8 active per token |
| Shared expert gate/up/down | 4-bit | runs on every token |
| Attention Q, K (all layers) | 4-bit | precision-sensitive |
| Attention V, O (layers 0, 5) | 4-bit | boundary layers |
| Attention V, O (layers 1–4) | 1-bit | inner layers |
| Embedding | 4-bit | tied with lm_head |
| Router, RMSNorm | bf16 | not quantized |
Packed format per tensor:
- 1-bit:
{key}__scale(fp16, per-row) +{key}__bin(uint8 packbits) - 4-bit:
{key}__scale(fp16, per-row) +{key}__int4(uint8 nibble-packed) - Metadata:
{key}__shape(int32) +{key}__cols(int32)
Keys use / as separator in safetensors (replace with . for model loading).
Training
| Dataset | FineWeb-Edu-score-2 (60%) + FineWeb (40%) |
| Tokens | 900M |
| Optimizer | AdamW, betas=(0.9, 0.95), weight_decay=0.1 |
| Learning rate | 3e-4 → 1e-5 cosine; router lr = 1e-4 |
| Warmup | 5% of total steps |
| Gradient clipping | 1.0 |
| Batch size | 64K tokens effective (bs=2, grad_accum=16, ctx=2048) |
| Hardware | Apple M2 (MLX) |
| Throughput | ~3,000 tokens/sec |
Benchmark Results (0-shot, 500 samples)
| Task | Shot | Metric | Samples | Random | Score |
|---|---|---|---|---|---|
| HellaSwag | 0-shot | acc_norm | 500 | 25% | 33.4% |
| LAMBADA | 0-shot | acc | 500 | N/A | 13.8% |
| PIQA | 0-shot | acc_norm | 500 | 50% | 53.2% |
| WinoGrande | 0-shot | acc | 500 | 50% | 49.6% |
| ARC-Easy | 0-shot | acc_norm | 500 | 25% | 35.0% |
| ARC-Challenge | 0-shot | acc_norm | 500 | 25% | 21.0% |
| BoolQ | 0-shot | acc | 500 | 50% | 36.2% |
| MMLU (57 tasks avg) | 0-shot | acc | up to 500/task (total 12,173) | 25% | 23.4% |
Expert Routing Statistics
Measured on 136 tokens (8 diverse text samples), top-2 routing. Uniform load = 12.5%.
| Layer | E0 | E1 | E2 | E3 | E4 | E5 | E6 | E7 | CV |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.8% | 7.7% | 11.0% | 15.1% | 18.8% | 18.0% | 11.0% | 9.6% | 0.32 |
| 1 | 12.5% | 11.0% | 11.4% | 13.2% | 16.5% | 9.9% | 12.9% | 12.5% | 0.15 |
| 2 | 10.7% | 18.0% | 16.2% | 10.3% | 9.6% | 11.4% | 11.4% | 12.5% | 0.22 |
| 3 | 10.7% | 6.2% | 14.3% | 8.5% | 11.8% | 7.7% | 22.8% | 18.0% | 0.42 |
| 4 | 12.1% | 16.5% | 10.3% | 10.7% | 14.0% | 18.0% | 8.8% | 9.6% | 0.25 |
| 5 | 18.0% | 15.1% | 8.8% | 12.9% | 9.6% | 9.6% | 15.1% | 11.0% | 0.25 |
CV = coefficient of variation (lower = more balanced). No expert collapse observed.
Tokenizer
- BPE, vocabulary size = 8,192
- Byte fallback enabled (no
<unk>) - ASCII/English-optimized segmentation
Special Tokens
| Token | ID | Role |
|---|---|---|
<bos> |
0 | sequence start |
<eos> |
1 | end of sequence |
<pad> |
2 | padding |
<|system|> |
3 | system turn |
<|user|> |
4 | user turn |
<|assistant|> |
5 | assistant turn |
<|eot|> |
6 | end of turn |
Usage
Download inference.py from this repo. Requires: torch, safetensors, tokenizers.
pip install torch safetensors tokenizers
from inference import load_packed_model
from tokenizers import Tokenizer
model = load_packed_model("model.safetensors") # 12 MB, no dequantization
tok = Tokenizer.from_file("tokenizer.json")
ids = [0] + tok.encode("The quick brown fox").ids # 0 = <bos>
out = model.generate(ids, max_new_tokens=100, temperature=0.8)
print(tok.decode(out))
CLI:
python inference.py \
--ckpt model.safetensors \
--prompt "The quick brown fox" \
--max-tokens 100 \
--temperature 0.8
Memory: Weights stay in packed uint8 format (12.2 MB). Unpacking is done on-the-fly per layer during forward — peak RAM usage is ~18 MB (12 MB stored + ~6 MB for the largest layer unpacked).
License
Apache 2.0