PowerGQA-778M

Custom small language model (~778M parameters) built around a non-standard attention block (GQA + learnable head-mixing + a local depthwise-conv branch), trained on a Qwen2.5 tokenizer vocabulary.

This is the ckpt_13500 checkpoint from a long SFT recovery sequence ending in a LIMA-style ultra-quality dialogue mix (Capybara, Pure-Dove, NoRobots, Bespoke-Stratos). It is the best dialogue checkpoint produced during the run.

Note: This is an experimental research artifact, not a production assistant. The model is undertrained for its size and is best used as a small teacher for narrow data-generation tasks, not as a free-form chatbot. See the honest evaluation below.

  • Creator: Asilarknes
  • Architecture: custom (PowerGQA β€” see train_powergqa_4090.py)
  • Tokenizer: Qwen/Qwen2.5-0.5B (vocab 151,665)
  • Precision: bf16
  • Context length seen during training: 1024 (most of the run)
  • License: Apache 2.0

Architecture (high level)

Per layer:
  q,k,v,o = nn.Linear (GQA: 24 query heads, 6 KV heads)
  RMSNorm on Q and K
  RoPE on Q and K
  scores = (Q @ K^T) / sqrt(d_h)
  scores = einsum('bhij,hg->bgij', scores, pre_talk)   # learnable head mix
  causal mask + softmax (bf16)
  attn   = einsum('bhij,hg->bgij', attn,  post_talk)   # learnable head mix
  y      = attn @ V
  y     *= sigmoid(head_gate)
  loc    = depthwise Conv1d(k=5) branch
  out    = o(y) + sigmoid(local_gate) * loc

Block: residual( attn ) + residual( SwiGLU MLP )
Stack: 22 layers, dim=1536, heads=24, kv_heads=6

The pre_talk / post_talk learnable mixing matrices between attention heads are what make this not a drop-in for AutoModelForCausalLM. You need the included train_powergqa_4090.py to instantiate the model.

Files in this repo

File What it is
PowerGQA-778M.pt Model checkpoint (state_dict + cfg + step). 1.5 GB.
train_powergqa_4090.py Full training script β€” also the architecture definition.
sample_chat_manual_12300.py Reference dialogue sampling script.
powergqa_identity_sft_static_1000.jsonl Identity SFT pack (1k examples, not yet applied to this checkpoint).
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json Qwen2.5 tokenizer files.

Usage

pip install torch transformers huggingface_hub
huggingface-cli download Asilarknes/PowerGQA-778M --local-dir powergqa
cd powergqa
python sample_chat_manual_12300.py PowerGQA-778M.pt

To load programmatically:

import sys, torch
sys.path.insert(0, "./")  # for train_powergqa_4090.py
import train_powergqa_4090 as tm
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("./")
tm.cfg.vocab_size = len(tok)
model = tm.LM().cuda().to(torch.bfloat16)
state = torch.load("PowerGQA-778M.pt", map_location="cuda", weights_only=False)
model.load_state_dict(state["model"])
model.eval()

prompt = "User: Explain why sleep helps memory.\nAssistant: "
ids = tok.encode(prompt, add_special_tokens=False)
x = torch.tensor([ids], device="cuda", dtype=torch.long)
with torch.no_grad(), torch.autocast("cuda", dtype=torch.bfloat16):
    for _ in range(120):
        logits = model(x[:, -1024:])[:, -1, :]
        nxt = torch.argmax(logits, dim=-1, keepdim=True)
        x = torch.cat([x, nxt], dim=1)
        if nxt.item() == tok.eos_token_id:
            break
print(tok.decode(x[0][len(ids):].tolist(), skip_special_tokens=True))

Training summary (this checkpoint)

Phase What Result
Phase17 (Arena/WildChat/UltraChat/OASST1) dialogue recovery from earlier MCQ/drill overfit repetition loops, abandoned
Phase18 (28% reasoning weight) Bespoke + SmolTalk + OpenHermes + Tulu3 + OpenR1-Math math improved (5 cups bug fixed once), but "The answer is X" formatting leaked everywhere
Phase18b (13% reasoning weight) rebalanced mix, dialogue heavier best numeric score 75% (vs 81.7% baseline); first correct ratio on the recipe bug
Phase19 (this ckpt) LIMA mix: Capybara(45) + Pure-Dove(25) + NoRobots(20) + Bespoke-Stratos(10), lr=1e-5 first time Q8/Q9 produced genuinely helpful structured answers

The checkpoint corresponds to step=13500 in the original training script.

Honest evaluation

Tested on 15 hand-written conversational prompts. The model is partially useful:

  • βœ… Q8 "They failed an exam and feel stupid" β†’ "One test does not define you. We can look at what went wrong and make a plan for the next one."
  • βœ… Q9 "4 tasks, 25 minutes, what first?" β†’ "Write down the tasks, pick one urgent or easy task, and start with a small first step instead of trying to do everything at once."
  • βœ… Q3 "Why does sleep help memory?" β†’ coherent one-line answer
  • 🟑 Q12 recipe scaling (2 cups for 4 β†’ 10 people): produces the right formula 2*10/4=5 on some samples, drifts on others
  • πŸ”΄ Q11 transitive comparison (Maya < Leo < Nora): still fails, often loops
  • πŸ”΄ Q5, Q6, Q7, Q13, Q14, Q15: degenerate into repetition or off-topic

The model has not seen enough pretraining tokens (~320M) for its size. Treat it as a research artifact and small-scale teacher, not a chat assistant.

Intended use

  • Good: narrow Q&A pair generation for distilling into a larger model
  • Good: studying small-model SFT dynamics (catastrophic forgetting, reasoning-data overfitting)
  • Bad: general conversation, factual lookup, code generation, multilingual, anything safety-critical

Limitations

  • Cannot reliably refuse unsafe requests (no DPO/RLHF applied)
  • English-only training corpus
  • Code generation was explicitly excluded during the dialogue phases
  • Long context generalization untested; trained mostly at seq_len=1024

Citation

If you use this artifact:

@misc{powergqa778m,
  title  = {PowerGQA-778M: a custom small LM with learnable head-mixing GQA},
  author = {Asilarknes},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Asilarknes/PowerGQA-778M}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Asilarknes/powergqa-778m

Finetuned
(630)
this model