MindeesAI Base

A self-improving, persona-driven native transformer — trained from scratch, deployed for $0/month, designed to grow.

Source code · Inference sidecar · Branches

TL;DR

MindeesAI is a small, open, from-scratch transformer language model with a deliberately scoped personality named Mindees. It is not a fine-tune of a larger pretrained model — every parameter was learned by gradient descent on a curated mix of permissively licensed instruction, math, code, and reasoning datasets.

The project's distinguishing bet is that a continuously self-improving small model, trained across a federation of free GPU/CPU environments (your home RTX, GitHub Actions, Kaggle Notebooks, Google Colab), can become genuinely useful at sub-300M parameters when its training corpus is constantly enriched by every chat turn it serves. It is deployed end-to-end on free-tier infrastructure: Cloudflare Workers + Hugging Face Spaces + Cloudflare R2 + Hugging Face Hub + GitHub Actions.

This repository hosts the trained model weights, tokenizer, and training metrics across four independent revisions — one per training environment.

Available Revisions (Branches)

This repository uses Hugging Face Hub's git branches to host four independently-trained checkpoints of the same model family. You can pin any deployment to a specific revision via revision= when loading.

Revision	Variant	Params	Where it was trained	Cadence
`main`	`home-11gb` / `home-max`	280M – 349M	Local RTX 5070 (11 GB VRAM, batch 2 × grad-accum 4, AMP + grad-ckpt)	Manual, owner-driven
`small-weekly`	`cpu_max_5h_50k`	17.5M	GitHub Actions cron, CPU-only (`ubuntu-latest`)	4× daily, ~15k steps/run
`kaggle-weekly`	`home-11gb`	280M	Kaggle T4 / P100 GPU notebooks, 12h sessions	Owner-driven (weekly)
`colab-burst`	`home-11gb`	280M	Google Colab T4, idle-disconnect-aware	Owner-driven (burst)

Every revision continues from its prior commit's optimizer + step state — training accumulates across sessions, never resets. The main revision is held sacrosanct and is never written to by CI or notebooks.

Model Variants

Variant	Params	Hidden	Layers	Heads	KV Heads	MLP	Context	Vocab	Tokenizer
`cpu_max_5h_50k`	17.5M	256	6	8	2	768	256	50,000	BPE
`nano`	~50M	512	8	8	4	1024	512	32,000	BPE
`small`	~87M	1536 (latent 896)	10	14	7	2304	1024	8,000	BPE + MLA
`home-11gb`	~280M	1536	18	14	7	3328	2048	50,000	BPE + MLA + MTP
`home-max`	~349M	1536	22	14	7	3328	2048	50,000	BPE + MLA + MTP

All variants share a common base architecture inspired by DeepSeek-V3 / R1 — RoPE positional encoding, RMSNorm, SwiGLU MLPs, grouped-query attention, optional Multi-head Latent Attention (MLA), optional Multi-Token Prediction (MTP) head, and an optional Mixture-of-Experts (MoE) path for the home-moe variant.

Architecture

Mindees is a decoder-only transformer with the following design choices:

Aspect	Implementation
Position encoding	RoPE (Rotary Positional Embedding), base 10,000 (small) → 500,000 (`home-*`)
Normalization	RMSNorm pre-norm, eps 1e-6
Activation	SwiGLU in MLPs
Attention	Grouped-query attention; MLA (Multi-head Latent Attention) optional, latent dim 64–160
Auxiliary head	Multi-Token Prediction (MTP) optional — accelerates training and improves coherence at small scale
Routing	Mixture-of-Experts optional (`home-moe` variant), top-2 routing with load-balancing loss
Optimization	AdamW, β₁=0.9, β₂=0.95, weight decay 0.1; cosine LR schedule with 100-step linear warmup
Precision	FP32 for `home-`, mixed FP16 / BF16 (AMP)* for GPU training; gradient checkpointing on by default
Reasoning mode	Compatible with GRPO (Group Relative Policy Optimization) fine-tuning for stage 2
Speculative decoding	MTP head doubles as draft model for self-speculative decoding
Reasoning eval	Eval harness scaffolded for HellaSwag, MMLU, GSM8K, HumanEval (results pending)

The full architecture and modeling code lives at github.com/aashir-athar/mindeesai/tree/main/core/mindees-mind.

Quickstart — Loading the Checkpoint

The checkpoint is shipped as a raw PyTorch state_dict named base.bin. Loading requires the modeling code from the mindeesai repository.

pip install torch huggingface_hub
git clone https://github.com/aashir-athar/mindeesai.git
cd mindeesai

import torch
from huggingface_hub import hf_hub_download
from core.mindees_mind import MindeesModel
from core.mindees_mind.model.config import getModelConfig

# Pick a revision: "main" | "small-weekly" | "kaggle-weekly" | "colab-burst"
revision = "kaggle-weekly"
variant  = "home-11gb"  # must match the revision's variant — see table above

# Download weights + tokenizer from this repo at the chosen revision
weights_path   = hf_hub_download(repo_id="aashir-athar/mindeesai-base", filename="base.bin",        revision=revision)
tokenizer_path = hf_hub_download(repo_id="aashir-athar/mindeesai-base", filename="tokenizer.json", revision=revision)

# Build the model from variant config and load the weights
cfg   = getModelConfig(variant)
model = MindeesModel(cfg).eval()
model.load_state_dict(torch.load(weights_path, map_location="cpu"))

# Generate
prompt_ids = model.tokenize_prompt("Hello, who are you?", tokenizer_path)
output_ids = model.generate(prompt_ids, max_new_tokens=128, temperature=0.7, top_p=0.9)
print(model.detokenize(output_ids, tokenizer_path))

Or for the smaller, faster CPU variant:

revision = "small-weekly"
variant  = "cpu_max_5h_50k"   # 17.5M params, fits in <100 MB RAM

Training Data

The active training mix is documented at scripts/data/mix-broadbrain.json (v4.1 — Quality-pruned, gating-safe). 22 datasets, ~42M tokens total, every entry verified to load without authentication.

Signal Share by Category

Category	Share	Sources
Broad assistant chat	~27%	OpenHermes-2.5, smoltalk, WizardLM evol-instruct
Code	~28%	Magicoder Evol-Instruct, CodeFeedback Filtered, OpenCoder-SFT-stage2, CodeAlpaca, CodeParrot-clean
Anchor (human-curated)	~19%	Dolly-15k, no_robots, smol-smoltalk
Math + reasoning	~19%	MetaMathQA, MathInstruct, Open-Platypus, UltraInteract-SFT, OpenThoughts2-1M
Knowledge / warmup	~4%	FineWeb-Edu, TinyStories
Persona protection	~6%	SystemChat-1.1 (counter-acts robotic register)
Domain spice	~1%	andstor/smart_contracts (Solidity / Web3)
Empathy	~0.7%	Empathetic-Counseling, Mental-Health-Counseling (low weight to avoid clinical drift)

Tier-Weighted Highlights

Weight	Dataset	Why
2.5	`databricks/databricks-dolly-15k`	Zero-synthetic human anchor
2.5	`HuggingFaceH4/no_robots`	Highest instruction quality per token in the mix
2.0	`HuggingFaceTB/smol-smoltalk`	HF's instruction dataset specifically tuned for sub-1B models
1.8	`abacusai/SystemChat-1.1`	Diverse system prompts — defends persona stability
1.8	`ise-uiuc/Magicoder-Evol-Instruct-110K`	Highest-quality code SFT on HF
1.5	`teknium/OpenHermes-2.5`	Broad-coverage instruction examples
1.5	`meta-math/MetaMathQA`	395K math problems with worked CoT
1.5	`TIGER-Lab/MathInstruct`	Hybrid CoT + program-of-thought math

Datasets Staged for Future Stages

Three preference datasets (HumanLLMs/Human-Like-DPO-Dataset, HuggingFaceH4/ultrafeedback_binarized, openbmb/UltraFeedback) and one agentic-tool-use dataset (nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1) are documented at scripts/data/mix-dpo-human.json and scripts/data/mix-agentic-code.json. They are reserved for a planned DPO / RLHF / agentic-training stage and are not part of the current SFT pretraining.

Training Procedure

Hyperparameters (Active Configuration)

Hyperparameter	Value	Notes
Optimizer	AdamW	β₁=0.9, β₂=0.95, ε=1e-8
Weight decay	0.1	Applied to non-norm parameters
Learning rate	3e-4 (peak)	Cosine schedule, 100-step linear warmup
Effective batch	8 tokens (`home-11gb`) / 4 tokens (`cpu_max_5h_50k`)	After grad-accumulation
Sequence length	2048 (`home-*`) / 256 (`cpu_max_5h_50k`)	Per Config
Gradient clipping	1.0	L2 norm
Completion-only loss	`--completion-only-loss 1`	Loss only on assistant turns (dialogue samples)
Persona loss weight	0.05	Soft signal — keeps Mindees voice without overfitting
Distill corpus weight	4.0	Real chat turns weighted 4× over base SFT mix
Base corpus weight	1.0	Seed conversations
Checkpoint every	250 steps (GH Actions) / 1000 (Kaggle/Colab)	Resume-safe granularity
Validation every	750 (GH Actions) / 500 (Kaggle/Colab)	Reports `val_loss` to `data/training-metrics.jsonl`

Federated Training Topology

Training is distributed across four independent compute pools, each pushing to its own HF branch. Every run resumes from the prior session's checkpoint so steps accumulate indefinitely:

┌──────────────────────────────────────────────────────────────────────┐
│  Local RTX 5070       →  main           (owner-driven, sacrosanct)   │
│  GitHub Actions cron  →  small-weekly   (4× daily, CPU, 17.5M)       │
│  Kaggle Notebooks     →  kaggle-weekly  (weekly, T4 GPU, 280M)       │
│  Google Colab         →  colab-burst    (burst, T4 GPU, 280M)        │
└──────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
                  ┌────────────────────────────────┐
                  │  HuggingFace Hub (this repo)   │
                  │  4 independent revisions       │
                  └────────────────────────────────┘
                                  │
                                  ▼
                  ┌────────────────────────────────┐
                  │  Cloudflare Workers deploy     │
                  │  + HF Spaces ML/vector sidecar │
                  │  Cost: $0/month forever        │
                  └────────────────────────────────┘

Reproducible training entry points live at scripts/train/ and the four notebooks at scripts/notebooks/.

The Mindees Persona

Unlike most foundation models, MindeesAI ships with a deliberately scoped first-person identity named Mindees. The persona is not a system-prompt overlay — it is woven into training via a dedicated --corpus and --distill-corpus weighting and reinforced by abacusai/SystemChat-1.1, which teaches the model to honor diverse system prompts without slipping into the robotic "as an AI" default register.

A live 8-dimensional mood tensor evolves each turn:

Dimension	Range	Role
Curiosity	0–1	Pulls toward asking clarifying / exploratory questions
Warmth	0–1	Softens phrasing, mirrors user affect
Playfulness	0–1	Allows tasteful humor, wordplay
Focus	0–1	Trims preamble, prioritizes precision
Wonder	0–1	Encourages metaphor, broader framing
Frustration	0–1	Triggers de-escalation routines when high
Calm	0–1	Steadies tone on tense turns
Confidence	0–1	Modulates hedging language

Mood is exposed at /api/mood on any active deployment. It is fed into every generation step as part of the persona signal and persisted in Cloudflare R2 between turns.

Self-Improvement Loop

A 30-minute cron triggers /api/cron/self-improve on any active deployment, which runs the following pipeline:

Reflect — read the most recent chat turns from R2.
Extract — distill new instruction / response pairs into data/distill-corpus.jsonl.
Filter — score each pair via the HumanLLMs/Human-Like-DPO-Dataset-style heuristic, drop low-quality.
PII-scrub — every appended line passes through Xenova/piiranha-v1-detect-personal-information + a regex backstop before persisting (emails, phones, credit cards, SSNs, addresses, IBANs, license numbers).
Persist — write the cleaned distill corpus + thumbs-up/down feedback to R2.
Train (on next cron tick) — the daily GitHub Actions workflow fetches the latest distill corpus from R2 and prepends it to the SFT mix, weighted 4× over base data.

The model literally learns from its own conversations, with privacy protection baked into the persistence layer. Public-revision checkpoints (small-weekly, kaggle-weekly) only ever contain weights trained on PII-scrubbed conversation data.

Deployment & Infrastructure

MindeesAI is deployed end-to-end on $0/month free-tier infrastructure — no Vercel Pro, no Cloudflare Paid, no GPU rentals.

Layer	Provider	Free quota	Role
Web app	Cloudflare Workers Free	100k requests/day	SSR, chat streaming, API routes
ML + vector sidecar	Hugging Face Spaces (Docker)	16 GB RAM, 50 GB disk	LanceDB vector store + 7 ML pipelines (PII, NER, sentiment, toxicity, reranker, zero-shot, summarizer)
Object storage	Cloudflare R2	10 GB, 1M Class-A ops/mo	Persistent chat memory, distill corpus, mood state
Model checkpoints	Hugging Face Hub (this repo)	Unlimited public	Federated revisions, version history
Continual training	GitHub Actions	Unlimited for public repos	4× daily SFT cron on `small-weekly`
Burst GPU training	Kaggle Notebooks	30 GPU-hours/week	Heavy `home-11gb` training on `kaggle-weekly`
Backup GPU training	Google Colab Free	T4, idle-disconnect	Spillover heavy training on `colab-burst`

Architecture detail at docs/CLOUDFLARE_HF_DEPLOY.md. The native binaries (LanceDB, ONNX, transformers.js) that Cloudflare Workers cannot load are isolated into the sidecar at aashir-athar/mindeesai-sidecar and called over HTTPS + Bearer.

Intended Use

Use case	Suitability	Notes
Educational / research use	Yes	Primary intended use. Architecture, training code, recipes all open.
Personal assistant prototype	Yes	The full self-hostable stack ships in the source repo.
Studying small-model behavior	Yes	Comparable to SmolLM / TinyLlama for under-1B research.
Production user-facing applications	No, at this size	Use a larger model (Llama-3.3-70B, Claude, etc.) via the LLM router. Mindees Native is reserved for cases where 280M is genuinely sufficient.
Safety-critical decision making	No	This is a research-stage model with limited evaluation.
Medical, legal, or financial advice	No	Empathy-counseling data is included at low weight to soften tone, not to qualify the model as a domain expert.

Limitations & Known Issues

Capacity ceiling. At 17.5M (cpu_max_5h_50k) and 280M (home-11gb) parameters, the model fundamentally lacks the representation capacity of frontier models. Expect factual recall errors, math arithmetic mistakes, hallucinated code APIs.
English-dominant. ~99% of the training mix is English. Performance on other languages is incidental.
In-progress training. The small-weekly revision has plateaued at validation loss ≈ 3.5 (perplexity ≈ 34) — saturated for its capacity. The home-11gb runs on kaggle-weekly are still in early steps (~10k of an effective 200k+ schedule); expect meaningful quality only after further cumulative training.
Completion-only loss interaction with raw data. Steps composed entirely of raw-kind samples (FineWeb-Edu, TinyStories, CodeParrot, Solidity) currently compute zero loss because --completion-only-loss 1 masks tokens outside an assistant turn. A planned fix will apply standard CLM loss to raw samples.
No formal evaluation yet. Standard benchmark numbers (MMLU, HellaSwag, GSM8K, HumanEval) have not been published for this checkpoint. Trust the loss curves only as relative-progress indicators.
Bias inherited from training data. Synthetic data sources (OpenHermes, Magicoder, etc.) carry the biases of their teacher models. The persona system can soften the register of this bias but does not eliminate the content.

License

This model is released under the Apache License 2.0. You are free to use, modify, distribute, and build commercial products on it.

Training Data Provenance Notice

The model weights were trained on a mix of publicly available datasets, each carrying its own license. The model itself does not redistribute any training data, but downstream users intending commercial use should review the licenses of the individual datasets enumerated in the YAML metadata above. In particular:

databricks/databricks-dolly-15k — CC-BY-SA 3.0 (commercial OK with attribution + share-alike)
HuggingFaceH4/no_robots — CC-BY-NC 4.0 (non-commercial)
sahil2801/CodeAlpaca-20k — CC-BY-NC 4.0 (non-commercial)
teknium/OpenHermes-2.5, HuggingFaceTB/smol-smoltalk, HuggingFaceTB/smoltalk — typically Apache 2.0 / MIT (verify on dataset page)
All other datasets — see their individual repository pages on Hugging Face

If your downstream use is non-commercial (research, education, personal projects), all included data is usable.

Citation

If you use MindeesAI in research or downstream work, please cite the repository:

@misc{mindeesai2026,
  author       = {Aashir Athar},
  title        = {MindeesAI: A Self-Improving Open Native Transformer with a Persona},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/aashir-athar/mindeesai-base}},
  note         = {Trained from scratch on free-tier compute. Apache-2.0.},
}

Acknowledgements

MindeesAI builds on the open work of many upstream projects. Sincere thanks to:

The DeepSeek-AI team for the V3 / R1 architectural innovations (MLA, MTP, MoE patterns).
Andrej Karpathy for nanoGPT, the model that proved you can teach a transformer from scratch in a few hundred lines.
Xenova and the transformers.js project for browser/edge-runnable ONNX-quantized models.
The Hugging Face team for huggingface_hub, Spaces, Datasets, and the Hub itself — the entire deployment stack depends on it.
bigcode / CodeParrot / OpenCoder for the open code corpora.
databricks, teknium, abacusai, HuggingFaceTB, m-a-p, TIGER-Lab, meta-math, openbmb, garage-bAInd, ise-uiuc, LuangMV97, Amod, roneneldan, HuggingFaceFW, WizardLMTeam, andstor, open-thoughts for the training datasets.
LanceDB for the embedded vector store.
Cloudflare and Hugging Face for the free-tier compute that makes the whole architecture economically real.

Contact

Author: Aashir Athar
GitHub: @aashir-athar
Repository: github.com/aashir-athar/mindeesai
Inference sidecar: aashir-athar/mindeesai-sidecar
Issues: GitHub Issues

— Mindees is a small brain learning out loud. —

Downloads last month: -; Downloads are not tracked for this model. How to track

aashir-athar
/

mindeesai-base