SSMoELM-Base

Scratch Small MoE Language Model — a 1-bit sparse MoE language model designed to run inside Scratch / TurboWarp.

  • 47M total / 25.8M active parameters (top-2 sparse routing)
  • 12.1 MB packed weights (1-bit routed experts, 4-bit attention & embedding)

Note: The HuggingFace model card may display ~12M parameters and an "8-bit" quantization badge. Both are artifacts of reading the packed model.safetensors directly: weights are stored as uint8 arrays (bit-packed 1-bit or nibble-packed 4-bit values), which HF counts as fewer elements and infers as 8-bit storage. The actual model has 47M parameters quantized to 1-bit and 4-bit.

  • Trained from scratch on English web text for 900M tokens

"Scratch" carries two meanings: built for Scratch, trained from scratch.


Model Details

Architecture Decoder-only Transformer + Sparse MoE FFN
Total params 47.04M
Active params 25.80M (per forward pass)
d_model 768
Layers 6
Attention GQA — 12 heads, kv_heads=3, head_dim=64
Positional encoding RoPE
Normalization RMSNorm
Activation SwiGLU
MoE 8 routed experts + 1 shared expert, top-2 routing
d_ff (per expert) 256
Vocabulary 8,192 (BPE, byte-fallback, English-optimized)
Context length 2,048 tokens
Training tokens 900M

Quantization Scheme

QAT (BitNet b1.58 style): forward pass uses quantized values, gradients flow through master bf16 weights via straight-through estimator (STE).

Component Bits Notes
Routed expert gate/up/down 1-bit sparse, 2/8 active per token
Shared expert gate/up/down 4-bit runs on every token
Attention Q, K (all layers) 4-bit precision-sensitive
Attention V, O (layers 0, 5) 4-bit boundary layers
Attention V, O (layers 1–4) 1-bit inner layers
Embedding 4-bit tied with lm_head
Router, RMSNorm bf16 not quantized

Packed format per tensor:

  • 1-bit: {key}__scale (fp16, per-row) + {key}__bin (uint8 packbits)
  • 4-bit: {key}__scale (fp16, per-row) + {key}__int4 (uint8 nibble-packed)
  • Metadata: {key}__shape (int32) + {key}__cols (int32)

Keys use / as separator in safetensors (replace with . for model loading).


Training

Dataset FineWeb-Edu-score-2 (60%) + FineWeb (40%)
Tokens 900M
Optimizer AdamW, betas=(0.9, 0.95), weight_decay=0.1
Learning rate 3e-4 → 1e-5 cosine; router lr = 1e-4
Warmup 5% of total steps
Gradient clipping 1.0
Batch size 64K tokens effective (bs=2, grad_accum=16, ctx=2048)
Hardware Apple M2 (MLX)
Throughput ~3,000 tokens/sec

Benchmark Results (0-shot, 500 samples)

Task Shot Metric Samples Random Score
HellaSwag 0-shot acc_norm 500 25% 33.4%
LAMBADA 0-shot acc 500 N/A 13.8%
PIQA 0-shot acc_norm 500 50% 53.2%
WinoGrande 0-shot acc 500 50% 49.6%
ARC-Easy 0-shot acc_norm 500 25% 35.0%
ARC-Challenge 0-shot acc_norm 500 25% 21.0%
BoolQ 0-shot acc 500 50% 36.2%
MMLU (57 tasks avg) 0-shot acc up to 500/task (total 12,173) 25% 23.4%

Expert Routing Statistics

Measured on 136 tokens (8 diverse text samples), top-2 routing. Uniform load = 12.5%.

Layer E0 E1 E2 E3 E4 E5 E6 E7 CV
0 8.8% 7.7% 11.0% 15.1% 18.8% 18.0% 11.0% 9.6% 0.32
1 12.5% 11.0% 11.4% 13.2% 16.5% 9.9% 12.9% 12.5% 0.15
2 10.7% 18.0% 16.2% 10.3% 9.6% 11.4% 11.4% 12.5% 0.22
3 10.7% 6.2% 14.3% 8.5% 11.8% 7.7% 22.8% 18.0% 0.42
4 12.1% 16.5% 10.3% 10.7% 14.0% 18.0% 8.8% 9.6% 0.25
5 18.0% 15.1% 8.8% 12.9% 9.6% 9.6% 15.1% 11.0% 0.25

CV = coefficient of variation (lower = more balanced). No expert collapse observed.


Tokenizer

  • BPE, vocabulary size = 8,192
  • Byte fallback enabled (no <unk>)
  • ASCII/English-optimized segmentation

Special Tokens

Token ID Role
<bos> 0 sequence start
<eos> 1 end of sequence
<pad> 2 padding
<|system|> 3 system turn
<|user|> 4 user turn
<|assistant|> 5 assistant turn
<|eot|> 6 end of turn

Usage

Download inference.py from this repo. Requires: torch, safetensors, tokenizers.

pip install torch safetensors tokenizers
from inference import load_packed_model
from tokenizers import Tokenizer

model = load_packed_model("model.safetensors")  # 12 MB, no dequantization
tok   = Tokenizer.from_file("tokenizer.json")

ids = [0] + tok.encode("The quick brown fox").ids  # 0 = <bos>
out = model.generate(ids, max_new_tokens=100, temperature=0.8)
print(tok.decode(out))

CLI:

python inference.py \
  --ckpt model.safetensors \
  --prompt "The quick brown fox" \
  --max-tokens 100 \
  --temperature 0.8

Memory: Weights stay in packed uint8 format (12.2 MB). Unpacking is done on-the-fly per layer during forward — peak RAM usage is ~18 MB (12 MB stored + ~6 MB for the largest layer unpacked).


License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
11.9M params
Tensor type
I32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brulee-1/SSMoELM-Base

Finetunes
1 model

Collection including brulee-1/SSMoELM-Base