Multi-Xi SPLM (Scalar-Potential Language Model)

The Multi-Xi SPLM is a conservative-by-construction autoregressive language model whose inference is a damped Euler--Lagrange flow on a single learned scalar energy field. Unlike attention-based transformers, the model's next-token dynamics derive entirely from the gradient of a shared scalar potential $V_\theta$ , making the inference trajectory globally conservative and endowing the hidden-state manifold with a natural Riemannian geometry (the Jacobi metric).

This is the standalone (no attention) variant from the Semantic Simulation framework.

Model Details
Architecture
Why Not a Transformer?
Geometric Capabilities of Conservative Architectures
How to Get Started
Training Details
Evaluation Results
SPLM Family Overview
Bias, Risks, and Limitations
Citation
Environmental Impact

Model Details

Model Description

The Multi-Xi SPLM replaces the single causal cumulative-mean context summary of the baseline SPLM with K causal exponential moving averages (K-EMA) at multiple learnable decay scales. This gives the scalar potential $V_\theta$ a multi-resolution summary of the past, addressing the rank-1 information bottleneck that limits baseline SPLM performance.

At each integration step, the model computes:

K causal EMA channels: $\xi^{(k)}_t = \sum_{s \leq t} W_k[t,s] \cdot h_s$ (learnable $\alpha_k$ )
Scalar potential: $V_\theta([\xi_1, \ldots, \xi_K, h]) \to \mathbb{R}$ (wide MLP)
Conservative force: $f_t = -\nabla_h V_\theta$
Damped dynamics: semi-implicit Euler with per-token mass and fixed damping $\gamma$

Developed by: Dimitar P. Gueorguiev (Independent Researcher)
Model type: Conservative autoregressive language model (Lagrangian dynamics)
Language: English
License: CC-BY-4.0

Model Sources

Paper: Semantic Simulation: A Prescriptive Lagrangian Framework for Efficient Semantic Inference
Repository: github.com/dimitarpg13/semsimula-paper
Model source code: notebooks/conservative_arch/multixi/model_multixi.py

Architecture

Input tokens x_1, ..., x_T
       |
   Embedding E[x] + positional encoding
       |
   For each of L=8 integration steps:
       |
       +-- K-EMA channels: xi^(k)_t = causal_ema(h, alpha_k)   [K=4 channels]
       |
       +-- Scalar potential: V_theta([xi_1..xi_K, h]) -> R      [3-layer MLP, hidden=1024]
       |
       +-- Conservative force: f = -grad_h V_theta              [autograd]
       |
       +-- Damped Euler step: v += dt*f/m; v /= (1 + dt*gamma); h += dt*v
       |
       +-- LayerNorm(h)                                         [LN-after-step]
       |
   Logits = h @ E^T                                            [tied embeddings]

Parameter	Value
Hidden dim (d)	256
Layers (L)	8
$V_\theta$ hidden / depth	1024 / 3
Xi channels (K)	4
Alpha init	[0.25, 0.5, 0.75, 0.95] (learned from uniform)
$V_\theta$ input dim	$(K+1) \cdot d = 1280$
Mass model	logfreq (frozen surprisal lookup)
Damping $\gamma$	0.30 (fixed)
Total parameters	16,539,911

Key Design Properties

Globally conservative: All forces derive from $\nabla V_\theta$ -- the inference trajectory conserves a well-defined Hamiltonian (up to controlled damping).
Single shared potential: One $V_\theta$ is shared across all integration steps, preserving the "single energy field" interpretation.
Autograd-native: Forces are computed via PyTorch autograd, not manual Jacobians.
No attention primitive: Token interactions occur only through the causal EMA context channels, not through pairwise attention.

Why Not a Transformer?

The SPLM family is not based on the Transformer architecture. There are no attention layers, no key-value cache, and no feed-forward network towers. Instead, the entire model dynamics are driven by a single small scalar-potential MLP, $V_\theta$ — 3-layer, 1024-hidden, ~3.4M parameters — whose gradient provides the conservative force at every integration step.

Key structural differences from Transformers:

Property	Transformer (GPT-2 small)	Multi-Xi SPLM (this model)
Architecture	Self-attention + FFN blocks	Scalar-potential gradient flow
Core computation	50.3M (MLP) + 28.3M (attention)	3.4M $V_\theta$ MLP
Runtime state per token	$O (T)$ — KV-cache grows linearly	$O (1)$ — fixed-size $h, v, \xi$
Total parameters	124M	16.5M
Pairwise token interaction	$O (T^{2})$ attention	None (causal EMA summary only)

Because the model carries only a fixed-size state $(h, v, \xi)$ per position — with no KV-cache — its inference memory is $O (1)$ in sequence length. The figure below illustrates the widening memorization gap between the Transformer's linearly-growing KV-cache and the SPLM's constant-size dynamic state:

Geometric Capabilities of Conservative Architectures

This model is fully attention-free and conservative by construction. Because all forces derive from the gradient of a scalar potential $V_\theta$ , the hidden-state manifold is endowed with a natural damped Riemannian geometry — the layer-dependent Jacobi metric $\Omega^2_\ell = 2T_\ell \cdot m$ — which is categorically absent from Transformer architectures. This geometry opens the door to capabilities that cannot be replicated in attention-based models:

Capability	Conservative SPLM	Transformer
Riemannian metric on hidden states	Layer-dependent Jacobi metric $\Omega^2_\ell = 2T_\ell \cdot m$ from $V_\theta$ ; confirmed positive at 100% of positions (diagnostic battery Arm 1)	No metric structure
Geodesics between semantic states	Damped geodesic equation with friction term $-\gamma v$ ; directional cosine similarity 0.52–0.75 (Arm 2). Geodesics are asymmetric: $d(A \to B) \neq d(B \to A)$	Linear interpolation only
Controlled energy dissipation as inference signal	$\Delta E_{\text{anomaly}}(t) = \|\Delta E(t) - \Delta E_{\text{expected}}(t)\|$ ; monotonic damped decay with measurable anomaly signal (Arm 4)	No conserved or tracked quantity
Curvature as uncertainty measure	$\mathcal{K}_{\max} = \lambda_{\max}(\nabla^2 V_\theta) / 2T_\ell$ ; well-defined across all layers (Arm 3)	None

These structural properties enable a set of native architectural features that are planned or under investigation (Section 18d and Section 23 of the paper):

Geodesic Analogical Reasoning: Analogy completion via parallel transport of directed geodesic arcs on the semantic manifold, respecting potential barriers that linear embedding arithmetic ignores. The damped geodesic equation yields 3–20% cosine-similarity improvement over undamped (diagnostic battery Arm 2). Because damped geodesics are asymmetric, analogy transport must use directed arcs.
Native Hallucination Detection: Energy dissipation anomalies $\Delta E_{\text{anomaly}}(t) = |\Delta E(t) - \Delta E_{\text{expected}}(t)|$ and curvature spikes $\mathcal{K}_{\max}(t)$ provide mechanistically grounded uncertainty signals computable at inference time without additional parameters. The smooth damping-induced energy decay is normal operation; deviation from the expected dissipation curve flags hallucination. For Fock models, the detector needs a per-model baseline that accounts for the known layer-1 exchange transient.
Geodesic Semantic Distance: A replacement for cosine similarity that encodes the model's learned energy landscape, expected to outperform cosine on polysemy and cross-basin semantic cases. The geodesic distance is inherently asymmetric: $d_{\text{geo}}(A,B) \neq d_{\text{geo}}(B,A)$ ; a symmetrised variant $[d(A \to B) + d(B \to A)]/2$ is available when symmetry is desired.
Native Chain-of-Thought (via Fock extension): The Fock-PARFLM v2.1 extends this model with register-based native CoT — reasoning steps as Fock register waypoints on damped geodesics, with zero extra token generation. The diagnostic confirms Fock register dynamics are predominantly linear — $R^2_{\text{full}} \approx 0.83$ (Arm 5), supporting the geodesic-waypoint interpretation.

The conservative constraint imposes a PPL cost relative to attention, but the price buys geometric structure and interpretability that attention-based architectures are structurally incapable of providing.

How to Get Started

# Clone the companion repository for full source code
# git clone https://github.com/dimitarpg13/semsimula-paper.git
# cd semsimula-paper/notebooks/conservative_arch

import torch
import sys
sys.path.insert(0, "multixi")
sys.path.insert(0, "energetic_minima")
sys.path.insert(0, "sarf_mass_variant")

from multixi.model_multixi import (
    ScalarPotentialLMSARFMassLNMultiXi,
    SPLMSARFMassLNMultiXiConfig,
)

config = SPLMSARFMassLNMultiXiConfig(
    vocab_size=50257,      # GPT-2 BPE
    d=256,
    n_layers=8,
    v_hidden=1024,
    v_depth=3,
    max_len=1024,
    block_size=512,
    gamma=0.30,
    xi_channels=4,
    xi_alpha_inits=[0.25, 0.5, 0.75, 0.95],
    xi_learnable=True,
)

model = ScalarPotentialLMSARFMassLNMultiXi(config)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Forward pass
x = torch.randint(0, 50257, (1, 64))
logits, loss = model(x, targets=x)

Available Checkpoints

Two trained checkpoints are included in this repository:

File	Steps	PPL	Description
`checkpoint/model_16k.pt`	16,000	11.51	Extended schedule (best)
`checkpoint/model_8k.pt`	8,000	12.49	Standard scaleup schedule

Additional artifacts:

File	Description
`training_log_16k.jsonl` / `training_log_8k.jsonl`	Per-step training metrics
`loss_curve_16k.png` / `loss_curve_8k.png`	Training/validation loss plots
`training_summary_16k.md` / `training_summary_8k.md`	Hyperparameters and final metrics
`convergence_curves.png`	8k vs 16k convergence comparison
`alpha_evolution.png`	Learned alpha channel evolution
`experiment_report.json`	Full experiment report (JSON)

To load the best checkpoint (16k steps, PPL 11.51):

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download(
    repo_id="dimitarpg13/semsimula-splm-multixi",
    filename="checkpoint/model_16k.pt",
)

state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()

Training Details

Training Data

TinyStories -- a synthetic corpus of short children's stories generated by GPT-3.5/4, tokenized with GPT-2 BPE (vocab size 50,257). Training cap: 5M tokens; validation: ~140k tokens.

Training Procedure

Hyperparameter	Value
Optimizer	AdamW
Learning rate	5e-4 (cosine decay)
Warmup steps	400
Weight decay	0.01
Gradient clipping	1.0
Batch size	16
Block size	512
Training steps	16,000 (extended) / 8,000 (scaleup)
Hardware	A100 40GB (Google Colab)

Training Script

notebooks/conservative_arch/scaleup/train_splm_em_ln_multixi_scaleup.py

Colab Notebook

notebooks/conservative_arch/scaleup/colab_splm_multixi_rerun.ipynb — runs both 8k and 16k arms with live progress display, saves results to Google Drive.

Training Results

notebooks/conservative_arch/scaleup/results/splm_multixi_rerun/ — training logs, loss curves, alpha evolution plots, convergence comparison, and full experiment report.

Evaluation Results

TinyStories Validation Perplexity

Model	PPL	Params	Gap vs Attention
Matched Attention (baseline)	7.81	19.5M	--
Hybrid SPLM+Attn	8.50	~19.0M	+0.69
Fock-PARFLM v2.1	9.30	17.4M	+1.49
Fock Attention	9.42	16.7M	+1.61
Multi-Xi PARFLM	12.06	17.6M	+4.25
Multi-Xi SPLM (this model, 16k)	11.51	16.5M	+3.70
Multi-Xi SPLM (8k)	12.49	16.5M	+4.68

The Multi-Xi SPLM is the simplest model in the family (no pair forces, no registers, no attention). Extended training from the pilot (4k steps, 14.69 PPL) to 16k steps closes the gap substantially. The remaining PPL gap to attention is attributable to the conservative constraint and the absence of pairwise token interactions -- precisely the structural limitations that PARFLM and Fock extensions are designed to address.

SPLM Family Overview

This model is part of the Semantic Simulation SPLM family -- five conservative-by-construction language model variants exploring different points in the design space between pure scalar-potential dynamics and attention:

Model	Design	Inference	HuggingFace
Multi-Xi SPLM	Pure scalar potential, K-EMA context	$O (1)$	this model
Hybrid SPLM+Attn	Attention front-end + SPLM refinement	$O (T)$	semsimula-hybrid-splm
Multi-Xi PARFLM	Scalar potential + sparse pairwise forces	$O (1)$	semsimula-parflm-multixi
Fock-PARFLM v2.1	PARFLM + Fock register pool (mediated exchange)	$O (1)$	semsimula-fock-parflm
Fock Attention	PARFLM + direct token-to-token exchange	$O (T^{2})$	semsimula-fock-attention

Bias, Risks, and Limitations

Research checkpoint only. This model is a proof-of-concept for the conservative language model architecture, not a production system.
TinyStories only. Trained exclusively on synthetic children's stories (~5M tokens). Not suitable for general-purpose language generation.
English only. No multilingual capability.
Small scale. 16.5M parameters, 256-dim hidden states. The architecture's scaling properties to larger dimensions/datasets are unexplored.
No safety training. No RLHF, DPO, or safety filtering has been applied.

Citation

@misc{Gueorguiev2026SemSim,
  author    = {Gueorguiev, Dimitar P.},
  title     = {Semantic Simulation: A Prescriptive Lagrangian Framework
               for Efficient Semantic Inference --- A Conservative-by-
               Construction Language Model and the Shared-Potential
               Separator, with a Correspondence to Joint Embedding
               Predictive Architectures},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19712427},
  url       = {https://doi.org/10.5281/zenodo.19712427},
  note      = {Version v15 (Jun 7, 2026).
               Companion code repository (DOI 10.5281/zenodo.20579561):
               \url{https://github.com/dimitarpg13/semsimula-paper}}
}

Environmental Impact

Hardware: NVIDIA A100 40GB (Google Colab)
Training time: ~1.1 hours (16,000 steps)
Carbon footprint: Estimated < 1 kg CO2 (short training run on cloud GPU)

Downloads last month: 394

Dataset used to train dimitarpg13/semsimula-splm-multixi

Collection including dimitarpg13/semsimula-splm-multixi

Semantic Simulation — SPLM Model Family

Collection

Conservative language models based on Lagrangian mechanics. Paper: https://doi.org/10.5281/zenodo.19712427 • 8 items • Updated 1 day ago

Evaluation results

Validation Perplexity on TinyStories
validation set self-reported

11.510

dimitarpg13
/

semsimula-splm-multixi