Multi-Xi SPLM (Scalar-Potential Language Model)

The Multi-Xi SPLM is a conservative-by-construction autoregressive language model whose inference is a damped Euler--Lagrange flow on a single learned scalar energy field. Unlike attention-based transformers, the model's next-token dynamics derive entirely from the gradient of a shared scalar potential VθV_\theta, making the inference trajectory globally conservative and endowing the hidden-state manifold with a natural Riemannian geometry (the Jacobi metric).

This is the standalone (no attention) variant from the Semantic Simulation framework.

Table of Contents

Model Details

Model Description

The Multi-Xi SPLM replaces the single causal cumulative-mean context summary of the baseline SPLM with K causal exponential moving averages (K-EMA) at multiple learnable decay scales. This gives the scalar potential VθV_\theta a multi-resolution summary of the past, addressing the rank-1 information bottleneck that limits baseline SPLM performance.

At each integration step, the model computes:

  1. K causal EMA channels: ξt(k)=stWk[t,s]hs\xi^{(k)}_t = \sum_{s \leq t} W_k[t,s] \cdot h_s (learnable αk\alpha_k)
  2. Scalar potential: Vθ([ξ1,,ξK,h])RV_\theta([\xi_1, \ldots, \xi_K, h]) \to \mathbb{R} (wide MLP)
  3. Conservative force: ft=hVθf_t = -\nabla_h V_\theta
  4. Damped dynamics: semi-implicit Euler with per-token mass and fixed damping γ\gamma
  • Developed by: Dimitar P. Gueorguiev (Independent Researcher)
  • Model type: Conservative autoregressive language model (Lagrangian dynamics)
  • Language: English
  • License: CC-BY-4.0

Model Sources

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K=4 channels]
       |
       +-- Scalar potential: V_theta([xi_1..xi_K, h]) -> R      [3-layer MLP, hidden=1024]
       |
       +-- Conservative force: f = -grad_h V_theta              [autograd]
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)                                         [LN-after-step]
       |
   Logits = h @ E^T                                            [tied embeddings]
Parameter Value
Hidden dim (d) 256
Layers (L) 8
VθV_\theta hidden / depth 1024 / 3
Xi channels (K) 4
Alpha init [0.25, 0.5, 0.75, 0.95] (learned from uniform)
VθV_\theta input dim (K+1)d=1280(K+1) \cdot d = 1280
Mass model logfreq (frozen surprisal lookup)
Damping γ\gamma 0.30 (fixed)
Total parameters 16,539,911

Key Design Properties

  • Globally conservative: All forces derive from Vθ\nabla V_\theta -- the inference trajectory conserves a well-defined Hamiltonian (up to controlled damping).
  • Single shared potential: One VθV_\theta is shared across all integration steps, preserving the "single energy field" interpretation.
  • Autograd-native: Forces are computed via PyTorch autograd, not manual Jacobians.
  • No attention primitive: Token interactions occur only through the causal EMA context channels, not through pairwise attention.

Why Not a Transformer?

The SPLM family is not based on the Transformer architecture. There are no attention layers, no key-value cache, and no feed-forward network towers. Instead, the entire model dynamics are driven by a single small scalar-potential MLP, VθV_\theta — 3-layer, 1024-hidden, ~3.4M parameters — whose gradient provides the conservative force at every integration step.

Key structural differences from Transformers:

Property Transformer (GPT-2 small) Multi-Xi SPLM (this model)
Architecture Self-attention + FFN blocks Scalar-potential gradient flow
Core computation 50.3M (MLP) + 28.3M (attention) 3.4M VθV_\theta MLP
Runtime state per token O(T)O(T) — KV-cache grows linearly O(1)O(1) — fixed-size h,v,ξh, v, \xi
Total parameters 124M 16.5M
Pairwise token interaction O(T2)O(T^2) attention None (causal EMA summary only)

Because the model carries only a fixed-size state (h,v,ξ)(h, v, \xi) per position — with no KV-cache — its inference memory is O(1)O(1) in sequence length. The figure below illustrates the widening memorization gap between the Transformer's linearly-growing KV-cache and the SPLM's constant-size dynamic state:

Runtime information capacity vs sequence length

Geometric Capabilities of Conservative Architectures

This model is fully attention-free and conservative by construction. Because all forces derive from the gradient of a scalar potential VθV_\theta, the hidden-state manifold is endowed with a natural damped Riemannian geometry — the layer-dependent Jacobi metric Ω2=2Tm\Omega^2_\ell = 2T_\ell \cdot m — which is categorically absent from Transformer architectures. This geometry opens the door to capabilities that cannot be replicated in attention-based models:

Capability Conservative SPLM Transformer
Riemannian metric on hidden states Layer-dependent Jacobi metric Ω2=2Tm\Omega^2_\ell = 2T_\ell \cdot m from VθV_\theta; confirmed positive at 100% of positions (diagnostic battery Arm 1) No metric structure
Geodesics between semantic states Damped geodesic equation with friction term γv-\gamma v; directional cosine similarity 0.52–0.75 (Arm 2). Geodesics are asymmetric: d(AB)d(BA)d(A \to B) \neq d(B \to A) Linear interpolation only
Controlled energy dissipation as inference signal ΔEanomaly(t)=ΔE(t)ΔEexpected(t)\Delta E_{\text{anomaly}}(t) = |\Delta E(t) - \Delta E_{\text{expected}}(t)|; monotonic damped decay with measurable anomaly signal (Arm 4) No conserved or tracked quantity
Curvature as uncertainty measure Kmax=λmax(2Vθ)/2T\mathcal{K}_{\max} = \lambda_{\max}(\nabla^2 V_\theta) / 2T_\ell; well-defined across all layers (Arm 3) None

These structural properties enable a set of native architectural features that are planned or under investigation (Section 18d and Section 23 of the paper):

  • Geodesic Analogical Reasoning: Analogy completion via parallel transport of directed geodesic arcs on the semantic manifold, respecting potential barriers that linear embedding arithmetic ignores. The damped geodesic equation yields 3–20% cosine-similarity improvement over undamped (diagnostic battery Arm 2). Because damped geodesics are asymmetric, analogy transport must use directed arcs.
  • Native Hallucination Detection: Energy dissipation anomalies ΔEanomaly(t)=ΔE(t)ΔEexpected(t)\Delta E_{\text{anomaly}}(t) = |\Delta E(t) - \Delta E_{\text{expected}}(t)| and curvature spikes Kmax(t)\mathcal{K}_{\max}(t) provide mechanistically grounded uncertainty signals computable at inference time without additional parameters. The smooth damping-induced energy decay is normal operation; deviation from the expected dissipation curve flags hallucination. For Fock models, the detector needs a per-model baseline that accounts for the known layer-1 exchange transient.
  • Geodesic Semantic Distance: A replacement for cosine similarity that encodes the model's learned energy landscape, expected to outperform cosine on polysemy and cross-basin semantic cases. The geodesic distance is inherently asymmetric: dgeo(A,B)dgeo(B,A)d_{\text{geo}}(A,B) \neq d_{\text{geo}}(B,A); a symmetrised variant [d(AB)+d(BA)]/2[d(A \to B) + d(B \to A)]/2 is available when symmetry is desired.
  • Native Chain-of-Thought (via Fock extension): The Fock-PARFLM v2.1 extends this model with register-based native CoT — reasoning steps as Fock register waypoints on damped geodesics, with zero extra token generation. The diagnostic confirms Fock register dynamics are predominantly linear — Rfull20.83R^2_{\text{full}} \approx 0.83 (Arm 5), supporting the geodesic-waypoint interpretation.

The conservative constraint imposes a PPL cost relative to attention, but the price buys geometric structure and interpretability that attention-based architectures are structurally incapable of providing.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from multixi.model_multixi import (
    ScalarPotentialLMSARFMassLNMultiXi,
    SPLMSARFMassLNMultiXiConfig,
)

config = SPLMSARFMassLNMultiXiConfig(
    vocab_size=50257,      # GPT-2 BPE
    d=256,
    n_layers=8,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma=0.30,
    xi_channels=4,
    xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
    xi_learnable=True,
)

model = ScalarPotentialLMSARFMassLNMultiXi(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoints

Two trained checkpoints are included in this repository:

File Steps PPL Description
checkpoint/model_16k.pt 16,000 11.51 Extended schedule (best)
checkpoint/model_8k.pt 8,000 12.49 Standard scaleup schedule

Additional artifacts:

File Description
training_log_16k.jsonl / training_log_8k.jsonl Per-step training metrics
loss_curve_16k.png / loss_curve_8k.png Training/validation loss plots
training_summary_16k.md / training_summary_8k.md Hyperparameters and final metrics
convergence_curves.png 8k vs 16k convergence comparison
alpha_evolution.png Learned alpha channel evolution
experiment_report.json Full experiment report (JSON)

To load the best checkpoint (16k steps, PPL 11.51):

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-splm-multixi",
    filename="checkpoint/model_16k.pt",
)

state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

Hyperparameter Value
Optimizer AdamW
Learning rate 5e-4 (cosine decay)
Warmup steps 400
Weight decay 0.01
Gradient clipping 1.0
Batch size 16
Block size 512
Training steps 16,000 (extended) / 8,000 (scaleup)
Hardware A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_splm_em_ln_multixi_scaleup.py

Colab Notebook

notebooks/conservative_arch/scaleup/colab_splm_multixi_rerun.ipynb — runs both 8k and 16k arms with live progress display, saves results to Google Drive.

Training Results

notebooks/conservative_arch/scaleup/results/splm_multixi_rerun/ — training logs, loss curves, alpha evolution plots, convergence comparison, and full experiment report.

Evaluation Results

TinyStories Validation Perplexity

Model PPL Params Gap vs Attention
Matched Attention (baseline) 7.81 19.5M --
Hybrid SPLM+Attn 8.50 ~19.0M +0.69
Fock-PARFLM v2.1 9.30 17.4M +1.49
Fock Attention 9.42 16.7M +1.61
Multi-Xi PARFLM 12.06 17.6M +4.25
Multi-Xi SPLM (this model, 16k) 11.51 16.5M +3.70
Multi-Xi SPLM (8k) 12.49 16.5M +4.68

The Multi-Xi SPLM is the simplest model in the family (no pair forces, no registers, no attention). Extended training from the pilot (4k steps, 14.69 PPL) to 16k steps closes the gap substantially. The remaining PPL gap to attention is attributable to the conservative constraint and the absence of pairwise token interactions -- precisely the structural limitations that PARFLM and Fock extensions are designed to address.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family -- five conservative-by-construction language model variants exploring different points in the design space between pure scalar-potential dynamics and attention:

Model Design Inference HuggingFace
Multi-Xi SPLM Pure scalar potential, K-EMA context O(1)O(1) this model
Hybrid SPLM+Attn Attention front-end + SPLM refinement O(T)O(T) semsimula-hybrid-splm
Multi-Xi PARFLM Scalar potential + sparse pairwise forces O(1)O(1) semsimula-parflm-multixi
Fock-PARFLM v2.1 PARFLM + Fock register pool (mediated exchange) O(1)O(1) semsimula-fock-parflm
Fock Attention PARFLM + direct token-to-token exchange O(T2)O(T^2) semsimula-fock-attention

Bias, Risks, and Limitations

  • Research checkpoint only. This model is a proof-of-concept for the conservative language model architecture, not a production system.
  • TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
  • English only. No multilingual capability.
  • Small scale. 16.5M parameters, 256-dim hidden states. The architecture's scaling properties to larger dimensions/datasets are unexplored.
  • No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

  • Hardware: NVIDIA A100 40GB (Google Colab)
  • Training time: ~1.1 hours (16,000 steps)
  • Carbon footprint: Estimated < 1 kg CO2 (short training run on cloud GPU)
Downloads last month
394
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dimitarpg13/semsimula-splm-multixi

Collection including dimitarpg13/semsimula-splm-multixi

Evaluation results