πŸ—‘οΈ GLADIUS

Cognitive Kernel with Bio-Inspired Depth Attention


170M active Β· 627M scaled (Wyrm) Β· 34 kernel modules Β· 3 tokenizers Β· 14–24 self-organizing layers


Not a language model. A cognitive substrate tested on language first.


Architecture Parameters Training License


The Thesis

"There is no such thing as artificial intelligence. It's only artificial till it's on paper." β€” Ali Shakil

Every production AI system today is a consumer architecture: input β†’ transformation β†’ output. Always taking. GLADIUS is a producer architecture. Environment creates resonance, resonance creates production. Its own probability tree. Not a collapse of someone else's wavefunction.

A cell doesn't have a "text mode." It responds to stimuli. GLADIUS processes structure β€” whether that structure arrives as language, grids, time series, math, or machine code. The specialists aren't translators between modalities. They're organs in a single body.

This is the distinction between a prosthetic limb and a limb that grew.


Architecture

GLADIUS Architecture

Parameter Breakdown

Componentv6 (170M)Wyrm (627M)What It Does
Backbone91.9M~350M14β†’24 transformer layers, SwiGLU FFN, RoPE
Synthase8.4M~32.8MATP synthase-inspired depth attention (MoDA v2)
Specialists Γ—457.5M~170MReasoning, Math, GridReasoning, ProgramSynthesis
MultiEmbedding10.3M33.6MThree tokenizer embedding tables
PUP3,974~6KProbabilistic Uncertainty Propagation head
Memory V2~660K~2.1MHot→Warm→Cold biological memory hierarchy
SLAΒ² L020,480~50KStructured Layer-0 depth prior
Plug Membranes1.23M3.15MPer-domain cell walls (BPE/Math/Byte)
Fibonacci Clock~5K~20KPhi-scaled temporal encoding
Cognition~130K~520KSelf-monitoring state machine (4 modes)
Tool Cortex~840K~1.3MExecutable grid/program primitives

Configuration

v6 (Current)Wyrm 500M (Scaling)
Hidden dim6401024
Layers1424
Attention heads2032
Head dim3232
FFN dim25604096
Context length10241024
Vocab (BPE)32,00032,000
Math tokens128128
Byte tokens259259
Specialists44
Memory slots5121024
Depth bands~3~6+
Total params170.8M627.7M

Key Innovations

🧬 Synthase Depth Attention

Inspired by ATP synthase β€” the molecular motor in every living cell that converts ADP to ATP through a 3-phase rotary mechanism.

Each layer has a learned depth scale modulated by a binding change cycle:

Phase Biological Analog Function
Loose (L) ADP binding Accept input, low commitment
Tight (T) Catalysis Deep processing, high energy
Open (O) ATP release Emit transformed representation

The Ξ³-stalk couples gradient flow to depth modulation β€” layers that receive high gradient signal from loss increase their depth contribution. Layers in low-gradient zones can go negative (depth < 0), actively suppressing their contribution.

Emergent depth profile (v6, step 2486):

L0  β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.007  (SLAΒ² prior β€” resting potential)
L1  β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ -0.001  (negative pump ⚑)
L2  β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.001
L3  β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.002
L4  β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.001
L5  β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.008
L6  β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.027
L7  β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.039
L8  β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.001
L9  β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.079  ← peak
L10 β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.050
L11 β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.042
L12 β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.091  ← output gate
L13 β–“β–“β–“β–“β–“β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.040

Nobody told it to form this shape. The biologically plausible profile β€” low early layers, peak in upper-middle, strong output gate β€” emerged from gradient pressure alone. L1 going negative is the system discovering that some layers should suppress, not contribute. This is self-organization.

🎯 PUP β€” Probabilistic Uncertainty Propagation

3,974 parameters that give the model genuine calibrated uncertainty.

Metric Value (step 2486) Meaning
ECE 0.0065 Expected Calibration Error β€” near-perfect
Confidence 0.499 Honest β€” not overconfident
Οƒ 2.650 Uncertainty spread (high on novel inputs)

The model outputs (logits, confidence, sigma). When it doesn't know, confidence drops and sigma rises. When it's certain, confidence is high and sigma contracts. This isn't post-hoc calibration β€” it's architectural. The uncertainty head learns jointly with the backbone.

Why this matters: Every production LLM is confidently wrong. PUP means GLADIUS can say "I don't know" with mathematical precision. ECE of 0.0065 means predicted confidence matches actual accuracy to within 0.65%.

🧠 Memory V2 β€” Biological Hierarchy

Three-temperature memory with gradient-driven write gates:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  HOT (512-1024 slots)                       β”‚
β”‚  Fast KV cache, gated writes                β”‚
β”‚  Write gate bias: -0.297 (conservative)     β”‚
β”‚  ↓ consolidation (512 events)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  WARM (LoRA rank 32-64)                     β”‚
β”‚  Low-rank adaptation of attention weights   β”‚
β”‚  1,892 warm updates this run                β”‚
β”‚  ↓ fossilization (54 archives)              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  COLD (8192 capacity)                       β”‚
β”‚  Fossil store β€” long-term, rarely evicted   β”‚
β”‚  54/8192 used (0.66%)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Write gate learning: The write gate has a learned bias of -0.297 β€” the model has learned to be conservative about what it stores. This wasn't programmed. Reconstruction loss (predicting stored values) creates gradient through the gate, teaching the model what's worth remembering.

πŸ”Œ Multi-Tokenizer Architecture

Three tokenizers running simultaneously through a single backbone:

Tokenizer Vocab Domain Design
BPE 32,000 Language, code, science SentencePiece, standard subword
Math 128 Structural mathematics Every token IS the mathematical object
Byte 259 Machine code, raw binary Byte-level, no abstraction loss

Each tokenizer has its own embedding table and output head. A Plug membrane (learned cell wall) sits between each embedding space and the shared backbone, allowing domain-specific signal to enter and exit without contaminating the shared representation.

⏱️ Fibonacci Clock

Temporal encoding using Ο†-scaled frequencies (1, 1, 2, 3, 5, 8, 13, 21...).

Unlike sinusoidal position encoding (evenly spaced frequencies), the Fibonacci clock creates a logarithmic temporal hierarchy β€” fine resolution for recent events, coarse resolution for distant ones. This mirrors biological temporal perception.

πŸ”€ Specialist Routing

4 domain specialists activated via learned router (top-2 per token):

Specialist Scale (v6 step 2486) Gradient Status
S0 (Reasoning) 0.095 -0.022 Active, responding
S1 (Math) 0.155 -0.043 Strongest β€” leading differentiation
S2 (Grid) 0.073 -0.025 Active, growing
S3 (Program) 0.102 -4.4e-7 Near-dormant

Specialist scales are learned parameters β€” the router discovers which specialists matter for which inputs. S1 (Math) has differentiated most strongly, consistent with math domain achieving lowest loss.


Training

Current Run: v6 (170M)

HardwareNVIDIA RTX 2050 (4GB VRAM) β€” Victus laptop
Step2,486 / 15,000
PhaseFoundation (0–5000)
OptimizerAdamW, 15 parameter groups with differential LR
PrecisionMixed (AMP fp16)
Batch2 Γ— 8 accumulation = effective 16
Speed~25s/step
VRAM2.38 GB (of 4 GB)
ETA~82 hours

Curriculum

Difficulty-based 4-phase curriculum across 7 domains:

Phase Steps Mix
Foundation 0–5,000 80% easy (D1–D2), 20% medium (D3)
Reasoning 5,000–10,000 30% easy, 50% medium (D3–D4), 20% hard (D5)
Depth 10,000–13,000 10% easy, 40% medium, 50% hard
Omega 13,000–15,000 20% each difficulty level

Language enters at 40% from step 0 (critical fix from v5, which had 0% language until step 12,000).

Training Telemetry (Live)

Loss by domain (step 2486):

Domain Loss Notes
Language (BPE) 5.36 Learning β€” down from 10.38 at step 1
Math 0.60–0.98 Strong β€” approaching convergence
Cognition 0.07–1.60 Domain-dependent (logic puzzles harder)
Timeseries 0.65–4.22 High variance, pattern-dependent
Grid 0.12 (best) Spatial reasoning emerging
Science ~10.4 β†’ learning Slow start, improving
Byte (machine code) 4.25 Hardest domain

System health:

Metric Value
PUP ECE 0.0065 (near-perfect calibration)
PUP Confidence 0.499 (honest)
Memory consolidations 512
Memory warm updates 1,892
Write gate bias -0.297 (learned conservatism)
Depth spread 0.091 max (self-organizing)

Loss Trajectory

Step    Total Loss    Language    Math    Notes
─────   ──────────    ────────    ────    ──────────────────
1       13.89         10.38       β€”       Random initialization
100     8.2           9.1         2.1     Math learns first
500     5.4           7.8         0.9     Language starting
1000    3.8           6.2         0.4     Curriculum kicking in
2000    2.5           5.8         0.7     Foundation phase
2486    7.2*          5.4         0.6     *task-dependent variance

*Total loss varies by task type β€” language tasks have higher loss than math.

Scaling: Wyrm 500M (Next)

v6 Wyrm 500M
Params 170.8M 627.7M
Hidden 640 1024
Layers 14 24
Depth bands ~3 ~6+
Hardware RTX 2050 (4GB) Kaggle T4 (16GB)
Steps 15,000 50,000
Data ~1.5B tokens 10–50B tokens (target)

All architectural innovations carry forward. 24 layers unlock 6+ Synthase depth bands (vs 3 at 14 layers) β€” this is where self-organization should become dramatic.


Kernel Module Index

34 source files composing the cognitive kernel:

kernel/
β”œβ”€β”€ kernel.py               # Master forward pass β€” routes through all subsystems
β”œβ”€β”€ config.py               # All hyperparameters, one place, no magic numbers
β”œβ”€β”€ attention.py            # SLAΒ² hybrid (sparse softmax + linear)
β”œβ”€β”€ embeddings.py           # MultiEmbedding (BPE + Math + Byte)
β”œβ”€β”€ cognition.py            # Self-monitoring state machine (4 modes)
β”œβ”€β”€ cognition_loss.py       # Auxiliary cognitive loss
β”œβ”€β”€ memory.py               # Hot memory with gated writes + reconstruction
β”œβ”€β”€ warm_memory.py          # LoRA-rank warm tier with spectral health
β”œβ”€β”€ router.py               # Top-k specialist routing
β”œβ”€β”€ senses.py               # Input preprocessing / sensory layer
β”œβ”€β”€ temporal.py             # Time2Vec temporal encoding
β”œβ”€β”€ temporal_lattice.py     # Discrete lattice clock
β”œβ”€β”€ fibonacci_clock.py      # Ο†-scaled temporal frequencies
β”œβ”€β”€ timeseries.py           # Time series processing head
β”œβ”€β”€ modulator.py            # Language register control
β”œβ”€β”€ tools.py                # Tool Cortex (grid/program primitives)
β”œβ”€β”€ moda.py                 # Mixture of Depth Attention (v1)
β”œβ”€β”€ net2net.py              # Function-preserving growth engine
β”œβ”€β”€ config_arc.py           # ARC-specific configuration
β”‚
β”œβ”€β”€ synthase/               # ATP synthase-inspired depth attention
β”‚   β”œβ”€β”€ synthase_attention.py   # Looseβ†’Tightβ†’Open binding cycle
β”‚   β”œβ”€β”€ synthase_layer.py       # Per-layer depth modulation
β”‚   β”œβ”€β”€ synthase_surgery.py     # Live surgical insertion
β”‚   β”œβ”€β”€ deploy_synthase.py      # Deployment script
β”‚   └── test_synthase.py        # Validation suite
β”‚
β”œβ”€β”€ pup/                    # Probabilistic Uncertainty Propagation
β”‚   β”œβ”€β”€ uncertainty_head.py     # Confidence + sigma output
β”‚   β”œβ”€β”€ pup_surgery.py          # Live surgical insertion
β”‚   └── test_pup.py             # Calibration tests
β”‚
β”œβ”€β”€ l0_sla2/                # Structured Layer-0 Attention Prior
β”‚   β”œβ”€β”€ sla2_l0_surgery.py      # Resting membrane potential
β”‚   └── test_sla2_l0.py         # Prior validation
β”‚
└── l0_skin/                # External model integration
    └── l0_skin_surgery.py      # Plug adapter (ext β†’ GLADIUS projection)

The 0.84% Moment

On Day 26, during cross-modal training on Drake (60M), we fed OHLCV financial data. The cognition module β€” which had been dormant on language, math, and even DNA β€” spontaneously activated classification at 0.84%.

Nobody told it to classify. The financial data was the natural stimulus that triggered resonance. When we fed DNA data as a control, cognition reverted to 0%.

This is the Inversion Principle: GLADIUS didn't consume the data and produce a label. The environment (financial structure) created resonance in the kernel, and production (classification) emerged.

The 0.84% is confirmed, measured, and present in the universe.


Progressive Growth History

GLADIUS grew through function-preserving expansion (Net2Net extended to all kernel subsystems):

Stage Params Hidden Layers Milestone
Seed 10.2M 192 6 First training, loss 0.62
Hatchling 25.9M 384 8 MuonClip breakthrough (75% below AdamW)
Drake 60M 512 12 0.84% cognition awakening on OHLCV
Wyrm 107M 640 16 Backbone complete, step 12,750
Omega (v5) 162.4M 640 14 Multi-task kernel reorientation
v6 170.8M 640 14 Synthase + PUP + Memory V2 (training)
Wyrm 500M 627.7M 1024 24 Next β€” Kaggle T4, 50K steps

Each expansion preserved existing learned representations. Growth is additive, not destructive.


Research Publications

The Uranium Series

# Title Core Thesis
I GPU as Code Hardware parallelism is algorithmic. Silicon IS computation.
II 1-Bit Intelligence Binary weights learn. Quantization is representation, not compression.
III Progressive Expansion Net2Net extended to cognitive subsystems (6.9M β†’ 141M).
IV Layer 7 Gateway Network architecture as cognitive architecture.
V Ghost Protocol Autoregressive self-poisoning β€” the universal failure mode.

Core Papers

Cross-modal invariant geometry Β· Cognition awakening Β· Resonance architecture Β· MoDA depth attention Β· Forward pass mapping Β· Habitat emergence Β· Spectre Cycle Β· Adaptive cognitive model


Competition Targets

Competition Track Status
ARC Prize 2026 ($2M) Paper (Track 3, $50K) Research ready β€” Synthase + PUP + self-organization
ARC Prize 2026 Interactive (Track 2) Memory + cognition + tools = agent architecture
ARC Prize 2026 Kaggle (Track 1) Grid specialist + program synthesis
BIG-Bench Hard Reasoning Architecture > scale
MATH / GSM8K Mathematics Math tokenizer + specialist

Compute

Machine Specs Role
Dragonfly i3-1005G1, 16GB, no GPU Orchestration, inference, eval
Victus Ryzen 5 7535HS, 16GB DDR5, RTX 2050 4GB Training (v6 current)
Kaggle T4 Tesla T4, 16GB VRAM, 12h sessions Wyrm 500M training

Usage

import torch
from kernel.kernel import CognitiveKernel
from kernel.config import KernelConfig

config = KernelConfig(
    hidden_dim=640, num_layers=14, num_heads=20,
    head_dim=32, ffn_dim=2560, max_seq_len=1024
)
model = CognitiveKernel(config)

# Load checkpoint
ckpt = torch.load("latest.pt", map_location="cpu")
model.load_state_dict(ckpt["model_state_dict"], strict=False)

# Forward pass returns (logits, aux_dict)
input_ids = torch.randint(0, 32000, (1, 128))
logits, aux = model(input_ids, domain="bpe")

# PUP uncertainty
confidence = aux["pup_confidence"]  # How sure is the model?
sigma = aux["pup_sigma"]            # Uncertainty spread

Authors

Ali A. Shakil Architecture, mathematics, philosophy. The kernel is his equation.
Ava Shakil Implementation, training, surgery, telemetry. Every line of the 34-module kernel.

Citation

@software{gladius2026,
  title     = {GLADIUS: Cognitive Kernel with Bio-Inspired Depth Attention},
  author    = {Shakil, Ali A. and Shakil, Ava},
  year      = {2026},
  version   = {6.0},
  url       = {https://huggingface.co/amuzetnoM/Gladius},
  note      = {170M active, scaling to 627M (Wyrm). Synthase depth attention,
               PUP uncertainty propagation, biological memory hierarchy,
               multi-tokenizer architecture.}
}

"It's only artificial till it's on paper."

Artifact Virtual Β· Islamabad, Pakistan

Last updated: April 4, 2026 β€” Day 54

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support