PowerGQA-778M

Custom small language model (~778M parameters) built around a non-standard attention block (GQA + learnable head-mixing + a local depthwise-conv branch), trained on a Qwen2.5 tokenizer vocabulary.

This is the ckpt_13500 checkpoint from a long SFT recovery sequence ending in a LIMA-style ultra-quality dialogue mix (Capybara, Pure-Dove, NoRobots, Bespoke-Stratos). It is the best dialogue checkpoint produced during the run.

Note: This is an experimental research artifact, not a production assistant. The model is undertrained for its size and is best used as a small teacher for narrow data-generation tasks, not as a free-form chatbot. See the honest evaluation below.

Creator: Asilarknes
Architecture: custom (PowerGQA — see train_powergqa_4090.py)
Tokenizer: Qwen/Qwen2.5-0.5B (vocab 151,665)
Precision: bf16
Context length seen during training: 1024 (most of the run)
License: Apache 2.0

Architecture (high level)

Per layer:
  q,k,v,o = nn.Linear (GQA: 24 query heads, 6 KV heads)
  RMSNorm on Q and K
  RoPE on Q and K
  scores = (Q @ K^T) / sqrt(d_h)
  scores = einsum('bhij,hg->bgij', scores, pre_talk)   # learnable head mix
  causal mask + softmax (bf16)
  attn   = einsum('bhij,hg->bgij', attn,  post_talk)   # learnable head mix
  y      = attn @ V
  y     *= sigmoid(head_gate)
  loc    = depthwise Conv1d(k=5) branch
  out    = o(y) + sigmoid(local_gate) * loc

Block: residual( attn ) + residual( SwiGLU MLP )
Stack: 22 layers, dim=1536, heads=24, kv_heads=6

The pre_talk / post_talk learnable mixing matrices between attention heads are what make this not a drop-in for AutoModelForCausalLM. You need the included train_powergqa_4090.py to instantiate the model.

Files in this repo

File	What it is
`PowerGQA-778M.pt`	Model checkpoint (state_dict + cfg + step). 1.5 GB.
`train_powergqa_4090.py`	Full training script — also the architecture definition.
`sample_chat_manual_12300.py`	Reference dialogue sampling script.
`powergqa_identity_sft_static_1000.jsonl`	Identity SFT pack (1k examples, not yet applied to this checkpoint).
`tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`	Qwen2.5 tokenizer files.

Usage

pip install torch transformers huggingface_hub
huggingface-cli download Asilarknes/PowerGQA-778M --local-dir powergqa
cd powergqa
python sample_chat_manual_12300.py PowerGQA-778M.pt

To load programmatically:

import sys, torch
sys.path.insert(0, "./")  # for train_powergqa_4090.py
import train_powergqa_4090 as tm
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("./")
tm.cfg.vocab_size = len(tok)
model = tm.LM().cuda().to(torch.bfloat16)
state = torch.load("PowerGQA-778M.pt", map_location="cuda", weights_only=False)
model.load_state_dict(state["model"])
model.eval()

prompt = "User: Explain why sleep helps memory.\nAssistant: "
ids = tok.encode(prompt, add_special_tokens=False)
x = torch.tensor([ids], device="cuda", dtype=torch.long)
with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
    for _ in range(120):
        logits = model(x[:, -1024:])[:, -1, :]
        nxt = torch.argmax(logits, dim=-1, keepdim=True)
        x = torch.cat([x, nxt], dim=1)
        if nxt.item() == tok.eos_token_id:
            break
print(tok.decode(x[0][len(ids):].tolist(), skip_special_tokens=True))

Training summary (this checkpoint)

Phase	What	Result
Phase17 (Arena/WildChat/UltraChat/OASST1)	dialogue recovery from earlier MCQ/drill overfit	repetition loops, abandoned
Phase18 (28% reasoning weight)	Bespoke + SmolTalk + OpenHermes + Tulu3 + OpenR1-Math	math improved (5 cups bug fixed once), but `"The answer is X"` formatting leaked everywhere
Phase18b (13% reasoning weight)	rebalanced mix, dialogue heavier	best numeric score 75% (vs 81.7% baseline); first correct ratio on the recipe bug
Phase19 (this ckpt)	LIMA mix: Capybara(45) + Pure-Dove(25) + NoRobots(20) + Bespoke-Stratos(10), lr=1e-5	first time Q8/Q9 produced genuinely helpful structured answers

The checkpoint corresponds to step=13500 in the original training script.

Honest evaluation

Tested on 15 hand-written conversational prompts. The model is partially useful:

✅ Q8 "They failed an exam and feel stupid" → "One test does not define you. We can look at what went wrong and make a plan for the next one."
✅ Q9 "4 tasks, 25 minutes, what first?" → "Write down the tasks, pick one urgent or easy task, and start with a small first step instead of trying to do everything at once."
✅ Q3 "Why does sleep help memory?" → coherent one-line answer
🟡 Q12 recipe scaling (2 cups for 4 → 10 people): produces the right formula 2*10/4=5 on some samples, drifts on others
🔴 Q11 transitive comparison (Maya < Leo < Nora): still fails, often loops
🔴 Q5, Q6, Q7, Q13, Q14, Q15: degenerate into repetition or off-topic

The model has not seen enough pretraining tokens (~320M) for its size. Treat it as a research artifact and small-scale teacher, not a chat assistant.

Intended use

Good: narrow Q&A pair generation for distilling into a larger model
Good: studying small-model SFT dynamics (catastrophic forgetting, reasoning-data overfitting)
Bad: general conversation, factual lookup, code generation, multilingual, anything safety-critical

Limitations

Cannot reliably refuse unsafe requests (no DPO/RLHF applied)
English-only training corpus
Code generation was explicitly excluded during the dialogue phases
Long context generalization untested; trained mostly at seq_len=1024

Citation

If you use this artifact:

@misc{powergqa778m,
  title  = {PowerGQA-778M: a custom small LM with learnable head-mixing GQA},
  author = {Asilarknes},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Asilarknes/PowerGQA-778M}}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Asilarknes/powergqa-778m

Base model

Qwen/Qwen2.5-0.5B

Finetuned

(630)

this model