Harold v0.7
Harold is a 3.2B-parameter continuous diffusion language model built for edge and IoT deployment. Unlike standard small LLMs β which are dense autoregressive Transformers shrunk down β Harold is designed from first principles for constrained hardware: subquadratic SSM layers, sparse MoE activation, and parallel diffusion decoding.
Developed by Minya AI Β· GitHub Β· Apache 2.0
β οΈ Training in progress. Harold v0.7 is currently completing its 100k-iteration pretraining run. Weights will be released upon completion. This card documents architecture and design decisions.
Architecture
Harold v0.7 combines three ideas:
1. Jamba β Hybrid SSM-Transformer Backbone
3 out of every 4 layers use Mamba3 State Space Models instead of attention. Only 1 in 4 is standard attention. This gives subquadratic complexity on long sequences β critical for edge devices where memory bandwidth is the bottleneck.
[Mamba3, Mamba3, Mamba3, Attention] Γ 10 β 40 layers total
Mamba3 (Lahoti et al., ICLR 2026) improves on Mamba2 with exponential-trapezoidal discretization, complex-valued state updates, and MIMO formulation.
2. Sparse Mixture of Experts
DeepSeek-style MoE with 2 shared + 16 routed experts, top-2 selection. Harold has 3.2B total parameters but activates ~800M per forward pass β the compute cost of a much smaller model with the capacity of a larger one.
Routing is timestep-conditioned: the router receives [token_repr; timestep_emb], allowing different experts to specialize at different noise levels.
3. Continuous Flow Matching Diffusion
Harold does not predict the next token. Instead, it refines an entire sequence from Gaussian noise toward coherent text using x0-prediction Flow Matching with logit-normal timestep sampling. This enables:
- Parallel decoding β all tokens refined simultaneously
- Native infill β fill-in-the-middle without tricks
- Iterative decoding β high-confidence tokens are frozen progressively, reducing unnecessary computation
Model Details
| Property | Value |
|---|---|
| Architecture | Jamba (Mamba3 + Attention) + DeepSeek MoE + Flow Matching |
| Parameters | ~3.20B total / ~800M active |
| d_model | 1792 |
| Layers | 40 (30 Mamba3 + 10 Attention) |
| Attention | GQA 4:1 (28 heads, 7 KV) + MLA (latent dim 224) |
| Attention mask | DSA β local window 256 + global every 64 |
| Mamba3 d_state | 128 |
| MoE | 2 shared SwiGLU + 16 routed GELU (top-2) |
| Max seq len | 4096 (YaRN RoPE scale=4.0) |
| Tokenizer | LLaMA-2 BPE (32,000 vocab) |
| Diffusion | x0-prediction CFM, logit-normal t ~ Ο(N(0, 0.5)) |
| Self-conditioning | enabled (p=0.5) |
| CFG | enabled at inference |
Pretraining Dataset Mix
Edge/IoT-oriented mix β prioritizes code and systems content over long-form narrative:
| Dataset | Weight | Purpose |
|---|---|---|
| FineWeb-Edu | 20% | High-quality web text |
| SlimPajama | 15% | Thematic diversity |
| GitHub Code | 25% | General code (30+ languages) |
| The Stack (C/C++/Rust) | 10% | Systems & embedded languages |
| C4 | 10% | Web text |
| Wikipedia EN | 10% | Factual grounding |
| Open-Web-Math | 5% | Mathematical reasoning |
| arXiv | 3% | Technical writing |
| PG-19 | 2% | Long-form coherence |
Hardware: Vast.ai 8ΓB200
Precision: bfloat16
Optimizer: MuonAdamW
Learning rate: 1e-4 β 1e-5 cosine, 1000 warmup steps
Effective batch: 4 Γ 32 grad accum Γ 4096 tokens
Benchmarks (Harold v0.6, 1.5B)
Throughput vs equivalent dense Transformer baseline (single GPU, bfloat16):
| Seq Len | Harold tok/s | Transformer tok/s | Speedup |
|---|---|---|---|
| 256 | 1,250 | 1,374 | 0.91Γ |
| 512 | 2,450 | 2,726 | 0.90Γ |
| 1024 | 4,826 | 5,426 | 0.89Γ |
| 2048 | 9,171 | 9,924 | 0.92Γ |
| 4096 | 14,940 | 13,982 | 1.07Γ |
The Mamba3 advantage compounds beyond 4096 tokens. Harold v0.7 with 3.2B params is expected to show a larger advantage due to the increased MoE sparsity.
Usage
Weights not yet released. The following assumes v0.7 checkpoint is available.
import torch
from transformers import AutoTokenizer
from core.config import get_model_config
from core.model import build_model
from sampler import build_sampler
state = torch.load("harold-v0.7-3B.pt", map_location="cpu", weights_only=False)
model_cfg = state.get("model_cfg", get_model_config())
model = build_model(model_cfg).cuda().bfloat16()
model.load_state_dict(state["model_state"])
model.eval()
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
sampler = build_sampler(model, n_steps=32, freeze_threshold=0.9, cfg_scale=3.0)
tokens = sampler.generate(batch_size=1, seq_len=256)
print(tokenizer.decode(tokens[0].tolist(), skip_special_tokens=True))
python sampler.py --prompt "Write a C function to read a sensor value" --max_steps 32
python sampler.py --prompt "..." --no_iterative # uniform denoising
Installation
git clone https://github.com/JHNMACHINE/harold
cd harold
pip install -r requirements.txt
# Mamba3 requires build from source
pip install git+https://github.com/state-spaces/mamba.git --no-build-isolation --no-deps
pip install einops
Changelog
| Version | Params | Key changes |
|---|---|---|
| v0.4 | 733M | VP-SDE, GPT-2 tokenizer, pure Transformer |
| v0.5 | 1.25B | Flow Matching, LLaMA-2 tokenizer |
| v0.6 | 1.51B | Jamba (Mamba2), MoE, SFT |
| v0.7 | 3.2B | Mamba3, x0-prediction, iterative decoding, FSDP, edge/IoT focus |
Limitations
- Training in progress β generation quality metrics pending full 100k-step run
- Diffusion latency β requires N forward passes; ~1s on H200 for 256 tokens at 32 steps
- No SFT yet β instruction following planned post-pretraining
- Mamba3 β requires building from source, MIMO mode disabled
Citation
@article{vecchione2026harold,
title = {Harold v0.7: Edge-Optimized Diffusion Language Model with Jamba, Flow Matching and Sparse MoE},
author = {Vecchione, Jonathan},
year = {2026},
url = {https://huggingface.co/JHN-MACHINE/harold}
}
License
Apache 2.0 β see LICENSE
Built in Naples, Italy Β· minya.ai