- MindeesAI Base
MindeesAI Base
A self-improving, persona-driven native transformer — trained from scratch, deployed for $0/month, designed to grow.
TL;DR
MindeesAI is a small, open, from-scratch transformer language model with a deliberately scoped personality named Mindees. It is not a fine-tune of a larger pretrained model — every parameter was learned by gradient descent on a curated mix of permissively licensed instruction, math, code, and reasoning datasets.
The project's distinguishing bet is that a continuously self-improving small model, trained across a federation of free GPU/CPU environments (your home RTX, GitHub Actions, Kaggle Notebooks, Google Colab), can become genuinely useful at sub-300M parameters when its training corpus is constantly enriched by every chat turn it serves. It is deployed end-to-end on free-tier infrastructure: Cloudflare Workers + Hugging Face Spaces + Cloudflare R2 + Hugging Face Hub + GitHub Actions.
This repository hosts the trained model weights, tokenizer, and training metrics across four independent revisions — one per training environment.
Available Revisions (Branches)
This repository uses Hugging Face Hub's git branches to host four independently-trained checkpoints of the same model family. You can pin any deployment to a specific revision via revision= when loading.
| Revision | Variant | Params | Where it was trained | Cadence |
|---|---|---|---|---|
main |
home-11gb / home-max |
280M – 349M | Local RTX 5070 (11 GB VRAM, batch 2 × grad-accum 4, AMP + grad-ckpt) | Manual, owner-driven |
small-weekly |
cpu_max_5h_50k |
17.5M | GitHub Actions cron, CPU-only (ubuntu-latest) |
4× daily, ~15k steps/run |
kaggle-weekly |
home-11gb |
280M | Kaggle T4 / P100 GPU notebooks, 12h sessions | Owner-driven (weekly) |
colab-burst |
home-11gb |
280M | Google Colab T4, idle-disconnect-aware | Owner-driven (burst) |
Every revision continues from its prior commit's optimizer + step state — training accumulates across sessions, never resets. The main revision is held sacrosanct and is never written to by CI or notebooks.
Model Variants
| Variant | Params | Hidden | Layers | Heads | KV Heads | MLP | Context | Vocab | Tokenizer |
|---|---|---|---|---|---|---|---|---|---|
cpu_max_5h_50k |
17.5M | 256 | 6 | 8 | 2 | 768 | 256 | 50,000 | BPE |
nano |
~50M | 512 | 8 | 8 | 4 | 1024 | 512 | 32,000 | BPE |
small |
~87M | 1536 (latent 896) | 10 | 14 | 7 | 2304 | 1024 | 8,000 | BPE + MLA |
home-11gb |
~280M | 1536 | 18 | 14 | 7 | 3328 | 2048 | 50,000 | BPE + MLA + MTP |
home-max |
~349M | 1536 | 22 | 14 | 7 | 3328 | 2048 | 50,000 | BPE + MLA + MTP |
All variants share a common base architecture inspired by DeepSeek-V3 / R1 — RoPE positional encoding, RMSNorm, SwiGLU MLPs, grouped-query attention, optional Multi-head Latent Attention (MLA), optional Multi-Token Prediction (MTP) head, and an optional Mixture-of-Experts (MoE) path for the home-moe variant.
Architecture
Mindees is a decoder-only transformer with the following design choices:
| Aspect | Implementation |
|---|---|
| Position encoding | RoPE (Rotary Positional Embedding), base 10,000 (small) → 500,000 (home-*) |
| Normalization | RMSNorm pre-norm, eps 1e-6 |
| Activation | SwiGLU in MLPs |
| Attention | Grouped-query attention; MLA (Multi-head Latent Attention) optional, latent dim 64–160 |
| Auxiliary head | Multi-Token Prediction (MTP) optional — accelerates training and improves coherence at small scale |
| Routing | Mixture-of-Experts optional (home-moe variant), top-2 routing with load-balancing loss |
| Optimization | AdamW, β₁=0.9, β₂=0.95, weight decay 0.1; cosine LR schedule with 100-step linear warmup |
| Precision | FP32 for home-*, mixed FP16 / BF16 (AMP) for GPU training; gradient checkpointing on by default |
| Reasoning mode | Compatible with GRPO (Group Relative Policy Optimization) fine-tuning for stage 2 |
| Speculative decoding | MTP head doubles as draft model for self-speculative decoding |
| Reasoning eval | Eval harness scaffolded for HellaSwag, MMLU, GSM8K, HumanEval (results pending) |
The full architecture and modeling code lives at github.com/aashir-athar/mindeesai/tree/main/core/mindees-mind.
Quickstart — Loading the Checkpoint
The checkpoint is shipped as a raw PyTorch state_dict named base.bin. Loading requires the modeling code from the mindeesai repository.
pip install torch huggingface_hub
git clone https://github.com/aashir-athar/mindeesai.git
cd mindeesai
import torch
from huggingface_hub import hf_hub_download
from core.mindees_mind import MindeesModel
from core.mindees_mind.model.config import getModelConfig
# Pick a revision: "main" | "small-weekly" | "kaggle-weekly" | "colab-burst"
revision = "kaggle-weekly"
variant = "home-11gb" # must match the revision's variant — see table above
# Download weights + tokenizer from this repo at the chosen revision
weights_path = hf_hub_download(repo_id="aashir-athar/mindeesai-base", filename="base.bin", revision=revision)
tokenizer_path = hf_hub_download(repo_id="aashir-athar/mindeesai-base", filename="tokenizer.json", revision=revision)
# Build the model from variant config and load the weights
cfg = getModelConfig(variant)
model = MindeesModel(cfg).eval()
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
# Generate
prompt_ids = model.tokenize_prompt("Hello, who are you?", tokenizer_path)
output_ids = model.generate(prompt_ids, max_new_tokens=128, temperature=0.7, top_p=0.9)
print(model.detokenize(output_ids, tokenizer_path))
Or for the smaller, faster CPU variant:
revision = "small-weekly"
variant = "cpu_max_5h_50k" # 17.5M params, fits in <100 MB RAM
Training Data
The active training mix is documented at scripts/data/mix-broadbrain.json (v4.1 — Quality-pruned, gating-safe). 22 datasets, ~42M tokens total, every entry verified to load without authentication.
Signal Share by Category
| Category | Share | Sources |
|---|---|---|
| Broad assistant chat | ~27% | OpenHermes-2.5, smoltalk, WizardLM evol-instruct |
| Code | ~28% | Magicoder Evol-Instruct, CodeFeedback Filtered, OpenCoder-SFT-stage2, CodeAlpaca, CodeParrot-clean |
| Anchor (human-curated) | ~19% | Dolly-15k, no_robots, smol-smoltalk |
| Math + reasoning | ~19% | MetaMathQA, MathInstruct, Open-Platypus, UltraInteract-SFT, OpenThoughts2-1M |
| Knowledge / warmup | ~4% | FineWeb-Edu, TinyStories |
| Persona protection | ~6% | SystemChat-1.1 (counter-acts robotic register) |
| Domain spice | ~1% | andstor/smart_contracts (Solidity / Web3) |
| Empathy | ~0.7% | Empathetic-Counseling, Mental-Health-Counseling (low weight to avoid clinical drift) |
Tier-Weighted Highlights
| Weight | Dataset | Why |
|---|---|---|
| 2.5 | databricks/databricks-dolly-15k |
Zero-synthetic human anchor |
| 2.5 | HuggingFaceH4/no_robots |
Highest instruction quality per token in the mix |
| 2.0 | HuggingFaceTB/smol-smoltalk |
HF's instruction dataset specifically tuned for sub-1B models |
| 1.8 | abacusai/SystemChat-1.1 |
Diverse system prompts — defends persona stability |
| 1.8 | ise-uiuc/Magicoder-Evol-Instruct-110K |
Highest-quality code SFT on HF |
| 1.5 | teknium/OpenHermes-2.5 |
Broad-coverage instruction examples |
| 1.5 | meta-math/MetaMathQA |
395K math problems with worked CoT |
| 1.5 | TIGER-Lab/MathInstruct |
Hybrid CoT + program-of-thought math |
Datasets Staged for Future Stages
Three preference datasets (HumanLLMs/Human-Like-DPO-Dataset, HuggingFaceH4/ultrafeedback_binarized, openbmb/UltraFeedback) and one agentic-tool-use dataset (nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1) are documented at scripts/data/mix-dpo-human.json and scripts/data/mix-agentic-code.json. They are reserved for a planned DPO / RLHF / agentic-training stage and are not part of the current SFT pretraining.
Training Procedure
Hyperparameters (Active Configuration)
| Hyperparameter | Value | Notes |
|---|---|---|
| Optimizer | AdamW | β₁=0.9, β₂=0.95, ε=1e-8 |
| Weight decay | 0.1 | Applied to non-norm parameters |
| Learning rate | 3e-4 (peak) | Cosine schedule, 100-step linear warmup |
| Effective batch | 8 tokens (home-11gb) / 4 tokens (cpu_max_5h_50k) |
After grad-accumulation |
| Sequence length | 2048 (home-*) / 256 (cpu_max_5h_50k) |
Per Config |
| Gradient clipping | 1.0 | L2 norm |
| Completion-only loss | --completion-only-loss 1 |
Loss only on assistant turns (dialogue samples) |
| Persona loss weight | 0.05 | Soft signal — keeps Mindees voice without overfitting |
| Distill corpus weight | 4.0 | Real chat turns weighted 4× over base SFT mix |
| Base corpus weight | 1.0 | Seed conversations |
| Checkpoint every | 250 steps (GH Actions) / 1000 (Kaggle/Colab) | Resume-safe granularity |
| Validation every | 750 (GH Actions) / 500 (Kaggle/Colab) | Reports val_loss to data/training-metrics.jsonl |
Federated Training Topology
Training is distributed across four independent compute pools, each pushing to its own HF branch. Every run resumes from the prior session's checkpoint so steps accumulate indefinitely:
┌──────────────────────────────────────────────────────────────────────┐
│ Local RTX 5070 → main (owner-driven, sacrosanct) │
│ GitHub Actions cron → small-weekly (4× daily, CPU, 17.5M) │
│ Kaggle Notebooks → kaggle-weekly (weekly, T4 GPU, 280M) │
│ Google Colab → colab-burst (burst, T4 GPU, 280M) │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ HuggingFace Hub (this repo) │
│ 4 independent revisions │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ Cloudflare Workers deploy │
│ + HF Spaces ML/vector sidecar │
│ Cost: $0/month forever │
└────────────────────────────────┘
Reproducible training entry points live at scripts/train/ and the four notebooks at scripts/notebooks/.
The Mindees Persona
Unlike most foundation models, MindeesAI ships with a deliberately scoped first-person identity named Mindees. The persona is not a system-prompt overlay — it is woven into training via a dedicated --corpus and --distill-corpus weighting and reinforced by abacusai/SystemChat-1.1, which teaches the model to honor diverse system prompts without slipping into the robotic "as an AI" default register.
A live 8-dimensional mood tensor evolves each turn:
| Dimension | Range | Role |
|---|---|---|
| Curiosity | 0–1 | Pulls toward asking clarifying / exploratory questions |
| Warmth | 0–1 | Softens phrasing, mirrors user affect |
| Playfulness | 0–1 | Allows tasteful humor, wordplay |
| Focus | 0–1 | Trims preamble, prioritizes precision |
| Wonder | 0–1 | Encourages metaphor, broader framing |
| Frustration | 0–1 | Triggers de-escalation routines when high |
| Calm | 0–1 | Steadies tone on tense turns |
| Confidence | 0–1 | Modulates hedging language |
Mood is exposed at /api/mood on any active deployment. It is fed into every generation step as part of the persona signal and persisted in Cloudflare R2 between turns.
Self-Improvement Loop
A 30-minute cron triggers /api/cron/self-improve on any active deployment, which runs the following pipeline:
- Reflect — read the most recent chat turns from R2.
- Extract — distill new instruction / response pairs into
data/distill-corpus.jsonl. - Filter — score each pair via the
HumanLLMs/Human-Like-DPO-Dataset-style heuristic, drop low-quality. - PII-scrub — every appended line passes through
Xenova/piiranha-v1-detect-personal-information+ a regex backstop before persisting (emails, phones, credit cards, SSNs, addresses, IBANs, license numbers). - Persist — write the cleaned distill corpus + thumbs-up/down feedback to R2.
- Train (on next cron tick) — the daily GitHub Actions workflow fetches the latest distill corpus from R2 and prepends it to the SFT mix, weighted 4× over base data.
The model literally learns from its own conversations, with privacy protection baked into the persistence layer. Public-revision checkpoints (small-weekly, kaggle-weekly) only ever contain weights trained on PII-scrubbed conversation data.
Deployment & Infrastructure
MindeesAI is deployed end-to-end on $0/month free-tier infrastructure — no Vercel Pro, no Cloudflare Paid, no GPU rentals.
| Layer | Provider | Free quota | Role |
|---|---|---|---|
| Web app | Cloudflare Workers Free | 100k requests/day | SSR, chat streaming, API routes |
| ML + vector sidecar | Hugging Face Spaces (Docker) | 16 GB RAM, 50 GB disk | LanceDB vector store + 7 ML pipelines (PII, NER, sentiment, toxicity, reranker, zero-shot, summarizer) |
| Object storage | Cloudflare R2 | 10 GB, 1M Class-A ops/mo | Persistent chat memory, distill corpus, mood state |
| Model checkpoints | Hugging Face Hub (this repo) | Unlimited public | Federated revisions, version history |
| Continual training | GitHub Actions | Unlimited for public repos | 4× daily SFT cron on small-weekly |
| Burst GPU training | Kaggle Notebooks | 30 GPU-hours/week | Heavy home-11gb training on kaggle-weekly |
| Backup GPU training | Google Colab Free | T4, idle-disconnect | Spillover heavy training on colab-burst |
Architecture detail at docs/CLOUDFLARE_HF_DEPLOY.md. The native binaries (LanceDB, ONNX, transformers.js) that Cloudflare Workers cannot load are isolated into the sidecar at aashir-athar/mindeesai-sidecar and called over HTTPS + Bearer.
Intended Use
| Use case | Suitability | Notes |
|---|---|---|
| Educational / research use | Yes | Primary intended use. Architecture, training code, recipes all open. |
| Personal assistant prototype | Yes | The full self-hostable stack ships in the source repo. |
| Studying small-model behavior | Yes | Comparable to SmolLM / TinyLlama for under-1B research. |
| Production user-facing applications | No, at this size | Use a larger model (Llama-3.3-70B, Claude, etc.) via the LLM router. Mindees Native is reserved for cases where 280M is genuinely sufficient. |
| Safety-critical decision making | No | This is a research-stage model with limited evaluation. |
| Medical, legal, or financial advice | No | Empathy-counseling data is included at low weight to soften tone, not to qualify the model as a domain expert. |
Limitations & Known Issues
- Capacity ceiling. At 17.5M (
cpu_max_5h_50k) and 280M (home-11gb) parameters, the model fundamentally lacks the representation capacity of frontier models. Expect factual recall errors, math arithmetic mistakes, hallucinated code APIs. - English-dominant. ~99% of the training mix is English. Performance on other languages is incidental.
- In-progress training. The
small-weeklyrevision has plateaued at validation loss ≈ 3.5 (perplexity ≈ 34) — saturated for its capacity. Thehome-11gbruns onkaggle-weeklyare still in early steps (~10k of an effective 200k+ schedule); expect meaningful quality only after further cumulative training. - Completion-only loss interaction with raw data. Steps composed entirely of raw-kind samples (FineWeb-Edu, TinyStories, CodeParrot, Solidity) currently compute zero loss because
--completion-only-loss 1masks tokens outside an assistant turn. A planned fix will apply standard CLM loss to raw samples. - No formal evaluation yet. Standard benchmark numbers (MMLU, HellaSwag, GSM8K, HumanEval) have not been published for this checkpoint. Trust the loss curves only as relative-progress indicators.
- Bias inherited from training data. Synthetic data sources (OpenHermes, Magicoder, etc.) carry the biases of their teacher models. The persona system can soften the register of this bias but does not eliminate the content.
License
This model is released under the Apache License 2.0. You are free to use, modify, distribute, and build commercial products on it.
Training Data Provenance Notice
The model weights were trained on a mix of publicly available datasets, each carrying its own license. The model itself does not redistribute any training data, but downstream users intending commercial use should review the licenses of the individual datasets enumerated in the YAML metadata above. In particular:
databricks/databricks-dolly-15k— CC-BY-SA 3.0 (commercial OK with attribution + share-alike)HuggingFaceH4/no_robots— CC-BY-NC 4.0 (non-commercial)sahil2801/CodeAlpaca-20k— CC-BY-NC 4.0 (non-commercial)teknium/OpenHermes-2.5,HuggingFaceTB/smol-smoltalk,HuggingFaceTB/smoltalk— typically Apache 2.0 / MIT (verify on dataset page)- All other datasets — see their individual repository pages on Hugging Face
If your downstream use is non-commercial (research, education, personal projects), all included data is usable.
Citation
If you use MindeesAI in research or downstream work, please cite the repository:
@misc{mindeesai2026,
author = {Aashir Athar},
title = {MindeesAI: A Self-Improving Open Native Transformer with a Persona},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/aashir-athar/mindeesai-base}},
note = {Trained from scratch on free-tier compute. Apache-2.0.},
}
Acknowledgements
MindeesAI builds on the open work of many upstream projects. Sincere thanks to:
- The DeepSeek-AI team for the V3 / R1 architectural innovations (MLA, MTP, MoE patterns).
- Andrej Karpathy for
nanoGPT, the model that proved you can teach a transformer from scratch in a few hundred lines. - Xenova and the transformers.js project for browser/edge-runnable ONNX-quantized models.
- The Hugging Face team for
huggingface_hub, Spaces, Datasets, and the Hub itself — the entire deployment stack depends on it. - bigcode / CodeParrot / OpenCoder for the open code corpora.
- databricks, teknium, abacusai, HuggingFaceTB, m-a-p, TIGER-Lab, meta-math, openbmb, garage-bAInd, ise-uiuc, LuangMV97, Amod, roneneldan, HuggingFaceFW, WizardLMTeam, andstor, open-thoughts for the training datasets.
- LanceDB for the embedded vector store.
- Cloudflare and Hugging Face for the free-tier compute that makes the whole architecture economically real.
Contact
- Author: Aashir Athar
- GitHub: @aashir-athar
- Repository: github.com/aashir-athar/mindeesai
- Inference sidecar: aashir-athar/mindeesai-sidecar
- Issues: GitHub Issues
— Mindees is a small brain learning out loud. —