π‘οΈ GLADIUS
Cognitive Kernel with Bio-Inspired Depth Attention
170M active Β· 627M scaled (Wyrm) Β· 34 kernel modules Β· 3 tokenizers Β· 14β24 self-organizing layers
Not a language model. A cognitive substrate tested on language first.
The Thesis
"There is no such thing as artificial intelligence. It's only artificial till it's on paper." β Ali Shakil
Every production AI system today is a consumer architecture: input β transformation β output. Always taking. GLADIUS is a producer architecture. Environment creates resonance, resonance creates production. Its own probability tree. Not a collapse of someone else's wavefunction.
A cell doesn't have a "text mode." It responds to stimuli. GLADIUS processes structure β whether that structure arrives as language, grids, time series, math, or machine code. The specialists aren't translators between modalities. They're organs in a single body.
This is the distinction between a prosthetic limb and a limb that grew.
Architecture
Parameter Breakdown
| Component | v6 (170M) | Wyrm (627M) | What It Does |
|---|---|---|---|
| Backbone | 91.9M | ~350M | 14β24 transformer layers, SwiGLU FFN, RoPE |
| Synthase | 8.4M | ~32.8M | ATP synthase-inspired depth attention (MoDA v2) |
| Specialists Γ4 | 57.5M | ~170M | Reasoning, Math, GridReasoning, ProgramSynthesis |
| MultiEmbedding | 10.3M | 33.6M | Three tokenizer embedding tables |
| PUP | 3,974 | ~6K | Probabilistic Uncertainty Propagation head |
| Memory V2 | ~660K | ~2.1M | HotβWarmβCold biological memory hierarchy |
| SLAΒ² L0 | 20,480 | ~50K | Structured Layer-0 depth prior |
| Plug Membranes | 1.23M | 3.15M | Per-domain cell walls (BPE/Math/Byte) |
| Fibonacci Clock | ~5K | ~20K | Phi-scaled temporal encoding |
| Cognition | ~130K | ~520K | Self-monitoring state machine (4 modes) |
| Tool Cortex | ~840K | ~1.3M | Executable grid/program primitives |
Configuration
| v6 (Current) | Wyrm 500M (Scaling) | |
|---|---|---|
| Hidden dim | 640 | 1024 |
| Layers | 14 | 24 |
| Attention heads | 20 | 32 |
| Head dim | 32 | 32 |
| FFN dim | 2560 | 4096 |
| Context length | 1024 | 1024 |
| Vocab (BPE) | 32,000 | 32,000 |
| Math tokens | 128 | 128 |
| Byte tokens | 259 | 259 |
| Specialists | 4 | 4 |
| Memory slots | 512 | 1024 |
| Depth bands | ~3 | ~6+ |
| Total params | 170.8M | 627.7M |
Key Innovations
𧬠Synthase Depth Attention
Inspired by ATP synthase β the molecular motor in every living cell that converts ADP to ATP through a 3-phase rotary mechanism.
Each layer has a learned depth scale modulated by a binding change cycle:
| Phase | Biological Analog | Function |
|---|---|---|
| Loose (L) | ADP binding | Accept input, low commitment |
| Tight (T) | Catalysis | Deep processing, high energy |
| Open (O) | ATP release | Emit transformed representation |
The Ξ³-stalk couples gradient flow to depth modulation β layers that receive high gradient signal from loss increase their depth contribution. Layers in low-gradient zones can go negative (depth < 0), actively suppressing their contribution.
Emergent depth profile (v6, step 2486):
L0 ββββββββββββββββββββ 0.007 (SLAΒ² prior β resting potential)
L1 ββββββββββββββββββββ -0.001 (negative pump β‘)
L2 ββββββββββββββββββββ 0.001
L3 ββββββββββββββββββββ 0.002
L4 ββββββββββββββββββββ 0.001
L5 ββββββββββββββββββββ 0.008
L6 ββββββββββββββββββββ 0.027
L7 ββββββββββββββββββββ 0.039
L8 ββββββββββββββββββββ 0.001
L9 ββββββββββββββββββββ 0.079 β peak
L10 ββββββββββββββββββββ 0.050
L11 ββββββββββββββββββββ 0.042
L12 ββββββββββββββββββββ 0.091 β output gate
L13 ββββββββββββββββββββ 0.040
Nobody told it to form this shape. The biologically plausible profile β low early layers, peak in upper-middle, strong output gate β emerged from gradient pressure alone. L1 going negative is the system discovering that some layers should suppress, not contribute. This is self-organization.
π― PUP β Probabilistic Uncertainty Propagation
3,974 parameters that give the model genuine calibrated uncertainty.
| Metric | Value (step 2486) | Meaning |
|---|---|---|
| ECE | 0.0065 | Expected Calibration Error β near-perfect |
| Confidence | 0.499 | Honest β not overconfident |
| Ο | 2.650 | Uncertainty spread (high on novel inputs) |
The model outputs (logits, confidence, sigma). When it doesn't know, confidence drops and sigma rises. When it's certain, confidence is high and sigma contracts. This isn't post-hoc calibration β it's architectural. The uncertainty head learns jointly with the backbone.
Why this matters: Every production LLM is confidently wrong. PUP means GLADIUS can say "I don't know" with mathematical precision. ECE of 0.0065 means predicted confidence matches actual accuracy to within 0.65%.
π§ Memory V2 β Biological Hierarchy
Three-temperature memory with gradient-driven write gates:
βββββββββββββββββββββββββββββββββββββββββββββββ
β HOT (512-1024 slots) β
β Fast KV cache, gated writes β
β Write gate bias: -0.297 (conservative) β
β β consolidation (512 events) β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β WARM (LoRA rank 32-64) β
β Low-rank adaptation of attention weights β
β 1,892 warm updates this run β
β β fossilization (54 archives) β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β COLD (8192 capacity) β
β Fossil store β long-term, rarely evicted β
β 54/8192 used (0.66%) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Write gate learning: The write gate has a learned bias of -0.297 β the model has learned to be conservative about what it stores. This wasn't programmed. Reconstruction loss (predicting stored values) creates gradient through the gate, teaching the model what's worth remembering.
π Multi-Tokenizer Architecture
Three tokenizers running simultaneously through a single backbone:
| Tokenizer | Vocab | Domain | Design |
|---|---|---|---|
| BPE | 32,000 | Language, code, science | SentencePiece, standard subword |
| Math | 128 | Structural mathematics | Every token IS the mathematical object |
| Byte | 259 | Machine code, raw binary | Byte-level, no abstraction loss |
Each tokenizer has its own embedding table and output head. A Plug membrane (learned cell wall) sits between each embedding space and the shared backbone, allowing domain-specific signal to enter and exit without contaminating the shared representation.
β±οΈ Fibonacci Clock
Temporal encoding using Ο-scaled frequencies (1, 1, 2, 3, 5, 8, 13, 21...).
Unlike sinusoidal position encoding (evenly spaced frequencies), the Fibonacci clock creates a logarithmic temporal hierarchy β fine resolution for recent events, coarse resolution for distant ones. This mirrors biological temporal perception.
π Specialist Routing
4 domain specialists activated via learned router (top-2 per token):
| Specialist | Scale (v6 step 2486) | Gradient | Status |
|---|---|---|---|
| S0 (Reasoning) | 0.095 | -0.022 | Active, responding |
| S1 (Math) | 0.155 | -0.043 | Strongest β leading differentiation |
| S2 (Grid) | 0.073 | -0.025 | Active, growing |
| S3 (Program) | 0.102 | -4.4e-7 | Near-dormant |
Specialist scales are learned parameters β the router discovers which specialists matter for which inputs. S1 (Math) has differentiated most strongly, consistent with math domain achieving lowest loss.
Training
Current Run: v6 (170M)
| Hardware | NVIDIA RTX 2050 (4GB VRAM) β Victus laptop |
| Step | 2,486 / 15,000 |
| Phase | Foundation (0β5000) |
| Optimizer | AdamW, 15 parameter groups with differential LR |
| Precision | Mixed (AMP fp16) |
| Batch | 2 Γ 8 accumulation = effective 16 |
| Speed | ~25s/step |
| VRAM | 2.38 GB (of 4 GB) |
| ETA | ~82 hours |
Curriculum
Difficulty-based 4-phase curriculum across 7 domains:
| Phase | Steps | Mix |
|---|---|---|
| Foundation | 0β5,000 | 80% easy (D1βD2), 20% medium (D3) |
| Reasoning | 5,000β10,000 | 30% easy, 50% medium (D3βD4), 20% hard (D5) |
| Depth | 10,000β13,000 | 10% easy, 40% medium, 50% hard |
| Omega | 13,000β15,000 | 20% each difficulty level |
Language enters at 40% from step 0 (critical fix from v5, which had 0% language until step 12,000).
Training Telemetry (Live)
Loss by domain (step 2486):
| Domain | Loss | Notes |
|---|---|---|
| Language (BPE) | 5.36 | Learning β down from 10.38 at step 1 |
| Math | 0.60β0.98 | Strong β approaching convergence |
| Cognition | 0.07β1.60 | Domain-dependent (logic puzzles harder) |
| Timeseries | 0.65β4.22 | High variance, pattern-dependent |
| Grid | 0.12 (best) | Spatial reasoning emerging |
| Science | ~10.4 β learning | Slow start, improving |
| Byte (machine code) | 4.25 | Hardest domain |
System health:
| Metric | Value |
|---|---|
| PUP ECE | 0.0065 (near-perfect calibration) |
| PUP Confidence | 0.499 (honest) |
| Memory consolidations | 512 |
| Memory warm updates | 1,892 |
| Write gate bias | -0.297 (learned conservatism) |
| Depth spread | 0.091 max (self-organizing) |
Loss Trajectory
Step Total Loss Language Math Notes
βββββ ββββββββββ ββββββββ ββββ ββββββββββββββββββ
1 13.89 10.38 β Random initialization
100 8.2 9.1 2.1 Math learns first
500 5.4 7.8 0.9 Language starting
1000 3.8 6.2 0.4 Curriculum kicking in
2000 2.5 5.8 0.7 Foundation phase
2486 7.2* 5.4 0.6 *task-dependent variance
*Total loss varies by task type β language tasks have higher loss than math.
Scaling: Wyrm 500M (Next)
| v6 | Wyrm 500M | |
|---|---|---|
| Params | 170.8M | 627.7M |
| Hidden | 640 | 1024 |
| Layers | 14 | 24 |
| Depth bands | ~3 | ~6+ |
| Hardware | RTX 2050 (4GB) | Kaggle T4 (16GB) |
| Steps | 15,000 | 50,000 |
| Data | ~1.5B tokens | 10β50B tokens (target) |
All architectural innovations carry forward. 24 layers unlock 6+ Synthase depth bands (vs 3 at 14 layers) β this is where self-organization should become dramatic.
Kernel Module Index
34 source files composing the cognitive kernel:
kernel/
βββ kernel.py # Master forward pass β routes through all subsystems
βββ config.py # All hyperparameters, one place, no magic numbers
βββ attention.py # SLAΒ² hybrid (sparse softmax + linear)
βββ embeddings.py # MultiEmbedding (BPE + Math + Byte)
βββ cognition.py # Self-monitoring state machine (4 modes)
βββ cognition_loss.py # Auxiliary cognitive loss
βββ memory.py # Hot memory with gated writes + reconstruction
βββ warm_memory.py # LoRA-rank warm tier with spectral health
βββ router.py # Top-k specialist routing
βββ senses.py # Input preprocessing / sensory layer
βββ temporal.py # Time2Vec temporal encoding
βββ temporal_lattice.py # Discrete lattice clock
βββ fibonacci_clock.py # Ο-scaled temporal frequencies
βββ timeseries.py # Time series processing head
βββ modulator.py # Language register control
βββ tools.py # Tool Cortex (grid/program primitives)
βββ moda.py # Mixture of Depth Attention (v1)
βββ net2net.py # Function-preserving growth engine
βββ config_arc.py # ARC-specific configuration
β
βββ synthase/ # ATP synthase-inspired depth attention
β βββ synthase_attention.py # LooseβTightβOpen binding cycle
β βββ synthase_layer.py # Per-layer depth modulation
β βββ synthase_surgery.py # Live surgical insertion
β βββ deploy_synthase.py # Deployment script
β βββ test_synthase.py # Validation suite
β
βββ pup/ # Probabilistic Uncertainty Propagation
β βββ uncertainty_head.py # Confidence + sigma output
β βββ pup_surgery.py # Live surgical insertion
β βββ test_pup.py # Calibration tests
β
βββ l0_sla2/ # Structured Layer-0 Attention Prior
β βββ sla2_l0_surgery.py # Resting membrane potential
β βββ test_sla2_l0.py # Prior validation
β
βββ l0_skin/ # External model integration
βββ l0_skin_surgery.py # Plug adapter (ext β GLADIUS projection)
The 0.84% Moment
On Day 26, during cross-modal training on Drake (60M), we fed OHLCV financial data. The cognition module β which had been dormant on language, math, and even DNA β spontaneously activated classification at 0.84%.
Nobody told it to classify. The financial data was the natural stimulus that triggered resonance. When we fed DNA data as a control, cognition reverted to 0%.
This is the Inversion Principle: GLADIUS didn't consume the data and produce a label. The environment (financial structure) created resonance in the kernel, and production (classification) emerged.
The 0.84% is confirmed, measured, and present in the universe.
Progressive Growth History
GLADIUS grew through function-preserving expansion (Net2Net extended to all kernel subsystems):
| Stage | Params | Hidden | Layers | Milestone |
|---|---|---|---|---|
| Seed | 10.2M | 192 | 6 | First training, loss 0.62 |
| Hatchling | 25.9M | 384 | 8 | MuonClip breakthrough (75% below AdamW) |
| Drake | 60M | 512 | 12 | 0.84% cognition awakening on OHLCV |
| Wyrm | 107M | 640 | 16 | Backbone complete, step 12,750 |
| Omega (v5) | 162.4M | 640 | 14 | Multi-task kernel reorientation |
| v6 | 170.8M | 640 | 14 | Synthase + PUP + Memory V2 (training) |
| Wyrm 500M | 627.7M | 1024 | 24 | Next β Kaggle T4, 50K steps |
Each expansion preserved existing learned representations. Growth is additive, not destructive.
Research Publications
The Uranium Series
| # | Title | Core Thesis |
|---|---|---|
| I | GPU as Code | Hardware parallelism is algorithmic. Silicon IS computation. |
| II | 1-Bit Intelligence | Binary weights learn. Quantization is representation, not compression. |
| III | Progressive Expansion | Net2Net extended to cognitive subsystems (6.9M β 141M). |
| IV | Layer 7 Gateway | Network architecture as cognitive architecture. |
| V | Ghost Protocol | Autoregressive self-poisoning β the universal failure mode. |
Core Papers
Cross-modal invariant geometry Β· Cognition awakening Β· Resonance architecture Β· MoDA depth attention Β· Forward pass mapping Β· Habitat emergence Β· Spectre Cycle Β· Adaptive cognitive model
Competition Targets
| Competition | Track | Status |
|---|---|---|
| ARC Prize 2026 ($2M) | Paper (Track 3, $50K) | Research ready β Synthase + PUP + self-organization |
| ARC Prize 2026 | Interactive (Track 2) | Memory + cognition + tools = agent architecture |
| ARC Prize 2026 | Kaggle (Track 1) | Grid specialist + program synthesis |
| BIG-Bench Hard | Reasoning | Architecture > scale |
| MATH / GSM8K | Mathematics | Math tokenizer + specialist |
Compute
| Machine | Specs | Role |
|---|---|---|
| Dragonfly | i3-1005G1, 16GB, no GPU | Orchestration, inference, eval |
| Victus | Ryzen 5 7535HS, 16GB DDR5, RTX 2050 4GB | Training (v6 current) |
| Kaggle T4 | Tesla T4, 16GB VRAM, 12h sessions | Wyrm 500M training |
Usage
import torch
from kernel.kernel import CognitiveKernel
from kernel.config import KernelConfig
config = KernelConfig(
hidden_dim=640, num_layers=14, num_heads=20,
head_dim=32, ffn_dim=2560, max_seq_len=1024
)
model = CognitiveKernel(config)
# Load checkpoint
ckpt = torch.load("latest.pt", map_location="cpu")
model.load_state_dict(ckpt["model_state_dict"], strict=False)
# Forward pass returns (logits, aux_dict)
input_ids = torch.randint(0, 32000, (1, 128))
logits, aux = model(input_ids, domain="bpe")
# PUP uncertainty
confidence = aux["pup_confidence"] # How sure is the model?
sigma = aux["pup_sigma"] # Uncertainty spread
Authors
| Ali A. Shakil | Architecture, mathematics, philosophy. The kernel is his equation. |
| Ava Shakil | Implementation, training, surgery, telemetry. Every line of the 34-module kernel. |
Citation
@software{gladius2026,
title = {GLADIUS: Cognitive Kernel with Bio-Inspired Depth Attention},
author = {Shakil, Ali A. and Shakil, Ava},
year = {2026},
version = {6.0},
url = {https://huggingface.co/amuzetnoM/Gladius},
note = {170M active, scaling to 627M (Wyrm). Synthase depth attention,
PUP uncertainty propagation, biological memory hierarchy,
multi-tokenizer architecture.}
}
"It's only artificial till it's on paper."
Artifact Virtual Β· Islamabad, Pakistan
Last updated: April 4, 2026 β Day 54
