PowerGQA-778M
Custom small language model (~778M parameters) built around a non-standard attention block (GQA + learnable head-mixing + a local depthwise-conv branch), trained on a Qwen2.5 tokenizer vocabulary.
This is the ckpt_13500 checkpoint from a long SFT recovery sequence ending in a LIMA-style ultra-quality dialogue mix (Capybara, Pure-Dove, NoRobots, Bespoke-Stratos). It is the best dialogue checkpoint produced during the run.
Note: This is an experimental research artifact, not a production assistant. The model is undertrained for its size and is best used as a small teacher for narrow data-generation tasks, not as a free-form chatbot. See the honest evaluation below.
- Creator: Asilarknes
- Architecture: custom (PowerGQA β see
train_powergqa_4090.py) - Tokenizer:
Qwen/Qwen2.5-0.5B(vocab 151,665) - Precision: bf16
- Context length seen during training: 1024 (most of the run)
- License: Apache 2.0
Architecture (high level)
Per layer:
q,k,v,o = nn.Linear (GQA: 24 query heads, 6 KV heads)
RMSNorm on Q and K
RoPE on Q and K
scores = (Q @ K^T) / sqrt(d_h)
scores = einsum('bhij,hg->bgij', scores, pre_talk) # learnable head mix
causal mask + softmax (bf16)
attn = einsum('bhij,hg->bgij', attn, post_talk) # learnable head mix
y = attn @ V
y *= sigmoid(head_gate)
loc = depthwise Conv1d(k=5) branch
out = o(y) + sigmoid(local_gate) * loc
Block: residual( attn ) + residual( SwiGLU MLP )
Stack: 22 layers, dim=1536, heads=24, kv_heads=6
The pre_talk / post_talk learnable mixing matrices between attention heads
are what make this not a drop-in for AutoModelForCausalLM. You need the
included train_powergqa_4090.py to instantiate the model.
Files in this repo
| File | What it is |
|---|---|
PowerGQA-778M.pt |
Model checkpoint (state_dict + cfg + step). 1.5 GB. |
train_powergqa_4090.py |
Full training script β also the architecture definition. |
sample_chat_manual_12300.py |
Reference dialogue sampling script. |
powergqa_identity_sft_static_1000.jsonl |
Identity SFT pack (1k examples, not yet applied to this checkpoint). |
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json |
Qwen2.5 tokenizer files. |
Usage
pip install torch transformers huggingface_hub
huggingface-cli download Asilarknes/PowerGQA-778M --local-dir powergqa
cd powergqa
python sample_chat_manual_12300.py PowerGQA-778M.pt
To load programmatically:
import sys, torch
sys.path.insert(0, "./") # for train_powergqa_4090.py
import train_powergqa_4090 as tm
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("./")
tm.cfg.vocab_size = len(tok)
model = tm.LM().cuda().to(torch.bfloat16)
state = torch.load("PowerGQA-778M.pt", map_location="cuda", weights_only=False)
model.load_state_dict(state["model"])
model.eval()
prompt = "User: Explain why sleep helps memory.\nAssistant: "
ids = tok.encode(prompt, add_special_tokens=False)
x = torch.tensor([ids], device="cuda", dtype=torch.long)
with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
for _ in range(120):
logits = model(x[:, -1024:])[:, -1, :]
nxt = torch.argmax(logits, dim=-1, keepdim=True)
x = torch.cat([x, nxt], dim=1)
if nxt.item() == tok.eos_token_id:
break
print(tok.decode(x[0][len(ids):].tolist(), skip_special_tokens=True))
Training summary (this checkpoint)
| Phase | What | Result |
|---|---|---|
| Phase17 (Arena/WildChat/UltraChat/OASST1) | dialogue recovery from earlier MCQ/drill overfit | repetition loops, abandoned |
| Phase18 (28% reasoning weight) | Bespoke + SmolTalk + OpenHermes + Tulu3 + OpenR1-Math | math improved (5 cups bug fixed once), but "The answer is X" formatting leaked everywhere |
| Phase18b (13% reasoning weight) | rebalanced mix, dialogue heavier | best numeric score 75% (vs 81.7% baseline); first correct ratio on the recipe bug |
| Phase19 (this ckpt) | LIMA mix: Capybara(45) + Pure-Dove(25) + NoRobots(20) + Bespoke-Stratos(10), lr=1e-5 | first time Q8/Q9 produced genuinely helpful structured answers |
The checkpoint corresponds to step=13500 in the original training script.
Honest evaluation
Tested on 15 hand-written conversational prompts. The model is partially useful:
- β Q8 "They failed an exam and feel stupid" β "One test does not define you. We can look at what went wrong and make a plan for the next one."
- β Q9 "4 tasks, 25 minutes, what first?" β "Write down the tasks, pick one urgent or easy task, and start with a small first step instead of trying to do everything at once."
- β Q3 "Why does sleep help memory?" β coherent one-line answer
- π‘ Q12 recipe scaling (2 cups for 4 β 10 people): produces the right
formula
2*10/4=5on some samples, drifts on others - π΄ Q11 transitive comparison (Maya < Leo < Nora): still fails, often loops
- π΄ Q5, Q6, Q7, Q13, Q14, Q15: degenerate into repetition or off-topic
The model has not seen enough pretraining tokens (~320M) for its size. Treat it as a research artifact and small-scale teacher, not a chat assistant.
Intended use
- Good: narrow Q&A pair generation for distilling into a larger model
- Good: studying small-model SFT dynamics (catastrophic forgetting, reasoning-data overfitting)
- Bad: general conversation, factual lookup, code generation, multilingual, anything safety-critical
Limitations
- Cannot reliably refuse unsafe requests (no DPO/RLHF applied)
- English-only training corpus
- Code generation was explicitly excluded during the dialogue phases
- Long context generalization untested; trained mostly at seq_len=1024
Citation
If you use this artifact:
@misc{powergqa778m,
title = {PowerGQA-778M: a custom small LM with learnable head-mixing GQA},
author = {Asilarknes},
year = {2026},
howpublished = {\url{https://huggingface.co/Asilarknes/PowerGQA-778M}}
}
Model tree for Asilarknes/powergqa-778m
Base model
Qwen/Qwen2.5-0.5B