Papers
arxiv:2606.15760

The Data Manifold under the Microscope

Published on Jun 14
· Submitted by
Marios Koulakis
on Jun 19
Authors:

Abstract

A benchmarking framework is introduced to study data-manifold geometry by extending dSprites and COIL-20 datasets with additional transformation dimensions and dense sampling, enabling accurate estimation of curvature, reach, and volume for theoretical analysis and validation.

A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry. We repurpose and extend dSprites and COIL-20 with additional transformation dimensions and dense, axis-aligned sampling, and pair them with finite-difference estimators that recover curvature, reach, and volume at near-ground-truth accuracy in a regime where general-purpose estimators are unreliable or difficult to deploy. The framework is intended as a controlled testbed, useful as a calibration environment for geometric estimators and a sandbox for probing theoretical assumptions. To illustrate its use, we present two application studies, namely assessing the scaling behavior of the bounds of Genovese et al. and Fefferman et al., and tracking the layer-wise geometry of a β-VAE, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. A reference implementation is available at https://github.com/koulakis/manifold-microscope.

Community

Paper author Paper submitter

We introduce Manifold Microscope, a controlled benchmark for studying data-manifold geometry with finite-difference estimates of curvature, reach, and volume on grid-sampled image manifolds.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

This is relevant to my ideas.

Temporal-Causal Structure as Mechanism: Converting Pretrained GPT-2 into a Time-Reasoning Model via η-Pseudo-Unitary Operator Dynamics

TL;DR

  • The strongest defensible synthesis is a hybrid: put complex/phase structure where it has natural coordinates for time (positional phase à la RoPE, gated complex memory, phase-aware attention/retrieval) and impose an η-pseudo-unitary (U(n,n)) stability constraint on temporal propagation for reversible carrying of information, but keep the learning weights real and symmetry-broken — exactly the lesson of the addition paper, where a fully-complex modReLU net reached only ~2% accuracy while phase-input + real-weight + seed-trick reached 93.5%. "Irreversibility from PT-breaking" survives only in its self-organized-criticality form, not its "GPT-2's spectrum is already PT-broken" form, because that spectrum was reported to match random (Ginibre) controls.
  • The conversion recipe is an α-ramp warm-start: keep GPT-2's tokenizer/skeleton/autoregressive objective; introduce a complex imaginary path gated by α∈[0,1]; constrain the temporal mixing operator to U=exp(i·g·η·H) (provably η-norm-conserving when ηH is Hermitian); add a gated complex memory (the LSTM-like structural correction); train a primitive operator library with a closure loss; and add seed-variable prediction + deterministic propagation for causal tasks.
  • Benchmark against temporal-causal reasoning with a six-instrument dashboard — train/eval gap, off-circle order parameter (distance to the exceptional point), η-norm drift, forward–backward perplexity asymmetry, directed-information/Granger probe, and the relative-phase probe (the project-specific crux). Success = capability jumps coincide with the off-circle order parameter moving away from zero, the relative-phase probe becoming task-predictive, and forward–backward asymmetry tracking ground-truth causal direction.

Key Findings

1. The core mathematics is sound and citable, but the bold physical claim must be reframed. The η-pseudo-unitary layer U=exp(i·g·η·H) is exactly the construction in A. Mostafazadeh, "Pseudo-unitary operators and pseudo-unitary quantum dynamics," J. Math. Phys. 45(3), 932–946 (2004), DOI 10.1063/1.1646448 (arXiv:math-ph/0302050). His Proposition 1 proves U(t)=e^{−itH} is η-pseudo-unitary (preserves the indefinite/Minkowski inner product ⟨ψ|ηφ⟩) iff H is η-pseudo-Hermitian (H†=ηHη⁻¹). When H is ordinary Hermitian and η is the real diagonal ±1 metric (η²=I), the generator G=ηH automatically satisfies the condition, so exp(i·g·η·H) is pseudo-unitary for every real g. His Proposition 3 proves the spectral dichotomy: eigenvalues are either unimodular (|λ|=1) or come in inverse-complex-conjugate pairs (λ, 1/λ̄). This is the precise mathematical content of "reversible sequence = eigenvalues on the unit circle; irreversible/causal = off-circle modes." The paper also shows the dynamical group for an indefinite 2-level η is isomorphic to U(1,1).

2. The PT-breaking transition is an exceptional-point phenomenon governed by Krein signature. Eigenvalues can only leave the unit circle by colliding at an exceptional point (EP) — a defective coalescence where eigenvectors merge into a Jordan block — and this happens precisely when two modes of opposite Krein signature κ=sgn(ψ†ηψ) collide (Krein–Gelfand–Lidskii strong stability theorem, Yakubovich–Starzhinskii, Linear Differential Equations with Periodic Coefficients, Wiley 1975). This is the rigorous backbone for "causality emerges as a PT-symmetry-breaking phase transition." Caveat: exp: u(p,q)→U(p,q) is not surjective for fixed η (well-known for U(1,1)), so the parametrization reaches one component / a neighborhood, not the whole group.

3. The Thread-1/Thread-2 tension resolves in favor of Thread 2's empirical correction, but does not kill Thread 1. The addition paper's finding — fully-complex modReLU networks suffer U(1) Goldstone/flat directions and fail (~2% full-sum accuracy), while phase-encoded inputs + real symmetry-broken weights + a seed variable succeed (93.5%) — is consistent with the broader complex-valued-network literature: modReLU σ(z)=ReLU(|z|+b)·z/|z| is non-holomorphic, phase-preserving, and notoriously hard to optimize (Arjovsky et al. 2016; Trabelsi et al., Deep Complex Networks, arXiv:1705.09792). The resolution: complex/phase structure is a representational coordinate system (where), not a weight-space prescription (how). Put phase in positions, memory, and retrieval; keep the trainable maps real and symmetry-broken; use the η-pseudo-unitary constraint only on the temporal-propagation operator to guarantee stable gradient flow — the same role unitary/orthogonal RNNs play. As Arjovsky, Shah & Bengio (ICML 2016, PMLR 48:1120–1128) state verbatim: "When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients... we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1."

4. The arrow of time is empirically real in LLMs and gives a ready-made benchmark instrument. Papadopoulos, Wenger & Hongler, "Arrows of Time for Large Language Models," ICML 2024 (arXiv:2401.17505), find verbatim: "For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one." This was tested across GPT, GRU and LSTM architectures and is the natural instrument (iv). It connects directly to Kant's Second Analogy: the irreversibility of perceptual order is the empirical signature of objective causal order.

5. GPT-2's reported "PT-broken" spectrum matching random controls shifts the entire burden onto self-organized criticality. If the off-circle structure of GPT-2's per-head bilinear form is generic, it is not evidence of learned causal structure. Cipolloni, Erdős & Schröder ("Edge Universality for non-Hermitian Random Matrices," Probab. Theory Relat. Fields 179, 1–28, 2021; arXiv:1908.00969) prove verbatim: "their local eigenvalue statistics near the spectral edge, the unit circle, coincide with those of the Ginibre ensemble, i.e. when the matrix elements of X are Gaussian." So a non-Hermitian spectrum concentrating near the unit circle with off-circle outliers is the default for random matrices. The defensible version of the bold hypothesis is therefore the noisy-delayed-recall criticality result: gradient descent spontaneously prefers the PT-broken regime with trained gain following g*·T≈0.88, i.e. the network self-organizes to the edge of chaos. This is the open theoretical crux.


Details

Part I — Mathematical and Physics Synthesis

I.1 Indefinite metric, the η-inner product, and norm conservation

Let the residual stream carry complex vectors z∈ℂⁿ. Fix an indefinite (Minkowski) metric η = diag(I_p, −I_q) with p+q=n, η†=η, η²=I. Two inner products coexist:

  • Euclidean (Dirac): ⟨z,w⟩ = z†w, norm ‖z‖² = z†z.
  • Minkowski (η): ⟨z,w⟩η = z†ηw, indefinite "norm" Q(z) = z†ηz = Σ{i≤p}|z_i|² − Σ_{j>p}|z_j|².

The signature observation that motivates the program — GPT-2's per-head attention bilinear form Q_h = W_Q^h (W_K^h)^T having balanced (p≈q) signature across heads — is the empirical anchor; it says the natural geometry of attention scores is already pseudo-Euclidean rather than Euclidean. This is the "complex shadow"/balanced-Minkowski-signature equivalence: a real bilinear form of balanced signature (p,p) is the real realization of a complex Hermitian structure carrying an indefinite metric.

I.2 The η-pseudo-unitary layer and the group U(p,q)

A linear update U is η-pseudo-unitary iff U†ηU = η, equivalently U†=ηU⁻¹η⁻¹. The set of such U is the group U(p,q). Such U conserves the Minkowski norm exactly: Q(Uz)=Q(z), while the Euclidean norm ‖Uz‖ is free to grow or shrink — exactly the desired primitive (conserve the relativistic interval, let "energy" flow). A corollary (Mostafazadeh Prop 4): every pseudo-unitary matrix has |det U|=1.

Parametrization (the central primitive): U = exp(i·g·η·H), H Hermitian, g∈ℝ a scalar "gain." Since (ηH)†=H†η†=Hη and η(ηH)η⁻¹=Hη (using η²=I), the generator G=ηH satisfies the η-pseudo-Hermiticity condition G†=ηGη⁻¹, so by Mostafazadeh's Proposition 1, U is η-pseudo-unitary for all real g. The Lie algebra u(p,q) consists of generators i·(η-pseudo-Hermitian matrices); in block form a generator is [[A,B],[B†,D]] with A (p×p), D (q×q) anti-Hermitian and B an arbitrary complex p×q block, giving real dimension (p+q)².

Caveat to encode in the spec: exp:u(p,q)→U(p,q) is not onto for fixed η. The learned U therefore lives in the identity component; this is fine for a warm-started, continuously-deformed layer but means one cannot reach every pseudo-unitary transform by a single exp.

I.3 Spectral dichotomy and the exceptional point (the arrow-of-time map)

By Mostafazadeh's Proposition 3, every eigenvalue λ of U∈U(p,q) satisfies: 1/λ̄ is also an eigenvalue. Hence:

  • Unbroken / "reversible" phase: all |λ|=1 (spectrum on the unit circle). Dynamics is quasi-periodic, norm-stable, time-reversal-symmetric-like. This is the unitary-RNN regime (eigenvalues of modulus exactly 1 prevent vanishing/exploding gradients).
  • Broken / "irreversible" phase: eigenvalues leave the unit circle in reciprocal-conjugate pairs (λ, 1/λ̄), |λ|≠1. One mode grows, its partner decays — an explicit forward/backward asymmetry, the discrete arrow of time.
  • Transition (exceptional point): eigenvalues collide and the matrix becomes defective (a Jordan block), occurring when modes of opposite Krein signature κ=sgn(ψ†ηψ) meet (Krein–Gelfand–Lidskii). The EP is the boundary; "edge of chaos" sits at the EP.

This is the precise statement of the central conjectured correspondence:

reversible sequence ↔ eigenvalues on unit circle ↔ unbroken PT ↔ Kant's "reversible" perceptual order (the house: I can scan its parts in any order);
irreversible causal order ↔ off-circle reciprocal pairs at/beyond the EP ↔ broken PT ↔ Kant's "irreversible" perceptual order (the ship moving downstream: the order of perceptions is determined by an objective causal succession).

Mostafazadeh's own U(2)↔U(1,1) example makes this concrete: a classical oscillator mapped to a 2×2 pseudo-Hermitian system has unimodular eigenvalues (stable, real frequencies) for ω²>0; at ω=0 the operator becomes defective (an EP, Jordan form); for ω²<0 the noncompact U(1,1) regime gives unbounded/growing solutions.

I.4 Phase encoding, U(1) gauge freedom, and Goldstone modes

Encode a discrete symbol d on the complex unit circle: d ↦ e^{iθ(d)}. Addition becomes rotation — the grokking mechanism. Nanda, Chan, Lieberum, Smith & Steinhardt ("Progress measures for grokking via mechanistic interpretability," ICLR 2023, arXiv:2301.05217) state verbatim: "We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle" (for addition mod 113 in a one-layer ReLU transformer; grokking itself from Power et al. 2022). RoPE is the production-grade instance: position p acts as z_i ↦ z_i·e^{ipθ_i}, a bank of complex oscillators with geometrically spaced angular frequencies θ_i = base^{−2(i−1)/d}.

The pathology: if the weights are fully complex and unconstrained, the network inherits a continuous U(1) gauge freedom (global phase rotation z→e^{iφ}z leaves moduli invariant). This flat direction is a Goldstone mode: a zero-curvature valley in the loss landscape along which optimization drifts without improving the objective. modReLU is explicitly phase-preserving and non-holomorphic, so it does not break this freedom. This is the mechanism behind the addition paper's ~2% result.

The cure (symmetry breaking): use phase-encoded inputs but real weights, which preserve the geometric (rotational) structure of the representation while removing the continuous complex gauge orbit — a discrete, anchored coordinate system instead of a flat manifold. This is the "phase inputs + real weights" winner at 93.5%.

I.5 The seed-variable / deterministic-propagation reduction

The addition paper's deepest lesson: do not train a network to discover an entire algorithm blindly. Instead (1) expose the latent algebra in the representation (phase encoding), (2) break the symmetry (real weights), (3) train the network to identify a single critical seed variable whose correct value makes the rest of the computation deterministic or semi-deterministic. For ten-digit addition the seed is the first carry bit; once predicted, carry propagation recovers the full sum. Candidate cognitive seed variables:

  • arithmetic → first carry bit
  • proof → the invariant
  • planning → the subgoal
  • retrieval → the missing fact type
  • causal reasoning → the latent state variable (the crux for this project)
  • analogy → the structure mapping
  • self-correction → the contradiction flag
  • tool use → the next operation

The temporal-reasoning analogue: predict the latent state variable / event-boundary / pivotal cause, then let a deterministic (or η-pseudo-unitary) propagator unroll the consequence chain.

I.6 The η-invariant nonlinearity (Liouville-safe FFN)

A fully complex FFN breaks both the Minkowski-norm conservation and introduces Goldstone modes. The fix imported from LorentzNet (Gong et al., "An Efficient Lorentz Equivariant Graph Neural Network for Jet Tagging," arXiv:2201.08187): by Proposition 3.1, a continuous Lorentz-equivariant map has the form φ(v₁,…,v_N)=Σ_i g_i(⟨v_i,v_j⟩) v_i, where g_i are continuous invariant scalar functions of the Minkowski inner products. Practically: compute η-invariant scalars s_{ij}=⟨z_i,z_j⟩_η, pass only those scalars through an ordinary (real) nonlinearity, and recombine with invariant coefficients times the original vectors. This is provably universal for the equivariant function class and conserves the indefinite structure ("Liouville-safe"). It is the disciplined replacement for a naive complex MLP.

I.7 Wirtinger calculus / complex backprop

For non-holomorphic complex functions (modReLU, |·|, anything touching phase and modulus separately) gradients use Wirtinger derivatives ∂/∂z = ½(∂/∂x − i∂/∂y), ∂/∂z̄ = ½(∂/∂x + i∂/∂y), with the chain rule treating z and z̄ as independent. Real-valued loss L of complex z has steepest-ascent direction ∂L/∂z̄. Modern autodiff (PyTorch) implements this; the practical note is that all phase-touching ops are non-holomorphic and must be handled in the Wirtinger convention, and that the flat U(1) direction shows up as a persistent ∂L/∂z̄ component orthogonal to objective improvement.

I.8 Operator-algebra structures (Thread 2)

Define primitive operators O₁…O_K acting on internal state S. A cognitive programme is a composition O_{i_k}∘…∘O_{i_1}. Operations: composition (AB), superposition (αA+βB), commutator [A,B]=AB−BA (measures order-dependence/non-commutativity — directly the algebraic signature of temporal/causal order; if [A,B]≠0 the order of operations matters, mirroring forward-backward asymmetry), projection, conjugation, gating, and renormalisation/macro-operators (coarse-grained composites, the RG/MERA analogue). A closure objective regularizes the learned operators to be approximately closed under composition: closure loss L_cl = Σ ‖O_i O_j − Σ_k c_{ij}^k O_k‖² with learned structure constants c_{ij}^k. The phase-transition hypothesis: when L_cl drops sharply, reasoning performance jumps (a grokking-like transition in operator space). The non-Hermitian/pseudo-Hermitian backbone theorem of Thread 2 is exactly the U(p,q) spectral dichotomy of §I.3 (clean unbroken/broken transition); the operator algebra is the cognitive content layered on top of that substrate.

I.9 Temporal substrate connections

  • State-space models (S4/Mamba): the continuous SSM h'=Ah+Bx, y=Ch, discretized via Ā=exp(ΔA), B̄=(ΔA)⁻¹(exp(ΔA)−I)·ΔB; when A is diagonal-complex with eigenvalues on/near the unit circle this is exactly a learned η-style propagator with selective gating. Mamba's selective mechanism makes (Δ,B,C) input-dependent. SSMs are the natural "deterministic propagation" engine; complex-diagonal SSMs unify RoPE phase, unitary-RNN stability, and the seed-propagation idea. The Selective-RoPE line (arXiv:2511.17388) makes the point sharply: a rotation-only (purely complex) linear recurrence "behaves like a spectral analyzer with fixed state size" and "suffers from spectral leakage," resolved by adding an exponentially decaying (gated) component — i.e. you need both phase and controlled modulus decay.
  • Linear attention as recurrence gives the gated-memory read/write substrate.
  • Mechanistic interpretability of order/time: modular addition → circle/Fourier (Nanda et al.); this is the template for how a transformer can represent ordered/temporal structure with phase features.

I.10 Non-Hermitian universality classes and CFT (theoretical context, mostly SPECULATIVE)

  • 38-fold (Kawabata–Shiozaki–Ueda–Sato, Phys. Rev. X 9, 041015, 2019; arXiv:1812.09133): the non-Hermitian generalization of the 10-fold Altland–Zirnbauer classification; explicitly includes pseudo-Hermiticity as a symmetry class. Relevant as the symmetry taxonomy for the residual-stream operator.
  • Ginibre / edge universality (Cipolloni–Erdős–Schröder 2021): local eigenvalue statistics of generic non-Hermitian matrices near the unit circle match Ginibre — this is why a "PT-broken spectrum matching random controls" is not evidence of learned structure.
  • Non-unitary CFT / Yang-Lee (minimal model M_{2,5}, central charge c=−22/5): the universality class of a PT-breaking critical point in the simplest case (an Ising model in an imaginary magnetic field; the unique relevant operator has h=−1/5). The "infinite-horizon limit = criticality" claim is the analogue of taking the correlation length to infinity. Connection is SPECULATIVE but principled.

Part II — Reconciling the Tension

The apparent contradiction. Thread 1 wants a full complex reconstruction with imaginary-dominant residual streams (every numerical object complex, U(n,n) symmetry). Thread 2's addition paper shows a fully-complex network failed (~2%) and a phase-input + real-weight network won (93.5%).

Resolution — separate WHERE from HOW. The contradiction dissolves once we distinguish the representational substrate (where complex/phase coordinates live) from the trainable maps (how learning happens):

Concern Put complex/phase here Keep real & symmetry-broken here
Position/time RoPE-style phase, complex oscillator bank
Memory gated complex state (modulus=salience, phase=temporal tag) gates, write/erase MLPs
Attention/retrieval phase-aware similarity (relative-phase keys) value projections, routing
Temporal propagation η-pseudo-unitary U=exp(i·g·η·H) the gain g, the Hermitian generator's real parametrization
FFN η-invariant scalars in, complex recombination the scalar nonlinearity (real)
Operator library phase labels for oscillatory operators structure constants, closure

The η-pseudo-unitary constraint is not "complex weights cause intelligence" (the doc explicitly warns against this). It is a stability constraint on the propagation operator, doing for deep temporal mixing exactly what unitary-RNN eigenvalue-on-circle constraints do for gradient flow — but generalized to an indefinite metric so the model can also represent irreversibility (off-circle modes) when the task demands it. Symmetry breaking (real generators, anchored phase references, the seed variable) removes the harmful Goldstone freedom.

Does "irreversibility emerges from PT-breaking" survive scrutiny? Partially, and only in a specific form:

  • Does NOT survive: the claim that GPT-2's already-PT-broken per-head spectrum is itself the seat of causal reasoning. If that spectrum matches Ginibre random controls (edge universality), it is generic and carries no learned causal content.
  • DOES survive (the strongest testable version): the claim that a network with an η-pseudo-unitary substrate and a delayed-recall/criticality pressure will self-organize toward the EP (edge of chaos), and that the off-circle order parameter will become task-correlated — rising exactly when the model must represent irreversible causal order. The empirical anchor is the noisy-delayed-recall toy result (gradient descent prefers the PT-broken regime; trained gain g*·T≈0.88, criticality = infinite-horizon limit). This converts an untestable metaphysical claim into a measurable order-parameter prediction, and is the open theoretical crux: construct a network that provably self-organizes to criticality.

This is consistent with the self-organized-criticality / edge-of-chaos literature (Beggs–Plenz neuronal avalanches, J. Neurosci. 2003; Langton's "computation at the edge of chaos," Physica D 1990; Bertschinger–Natschläger, "Real-time computation at the edge of chaos in recurrent neural networks," Neural Comput. 2004) and the unification "exceptional point = edge of chaos = brain critical point" — which should be stated as a motivating analogy with one rigorous leg (the EP/Krein stability boundary is rigorous; the brain-criticality identification is an empirically-supported hypothesis, not a theorem).


Part III — Concrete Conversion Recipe (GPT-2 → time/causal reasoner)

Invariant: tokenizer, depth/width skeleton, and the autoregressive next-token objective are unchanged throughout. All changes are warm-started from GPT-2 weights.

Phase 0 — Baseline & instrumentation. Load pretrained GPT-2. Stand up the six-instrument dashboard (Part IV) on a held-out temporal/causal benchmark before changing anything. Record baseline perplexity, forward−backward gap, and the per-head bilinear-form signature (to confirm the balanced-signature anchor and to test it against a Ginibre control of matched size).

Phase 1 — Complexify the residual stream with an α-ramp (no representation breakage). Represent each residual vector as z = x_real + i·α·x_imag, with a learnable/scheduled α∈[0,1]. At α=0 the model is exactly GPT-2 (imaginary path suppressed). Linearly ramp α: 0→1 over training. The imaginary path is a parallel set of channels warm-started near zero. This is the "alpha-ramp suppressing the real path" idea inverted to grow the imaginary path so the pretrained representation is never destroyed. Monitor perplexity at each α; abort-and-anneal if perplexity rises beyond a set tolerance (e.g. >2% over baseline). (The program's reported replacement-training experiments — eta-attention with complex Q/K/V/O warm-started from GPT-2, full complex FFN with cross-token mixing, imaginary-dominant streams emerging at α=1 — are the MEASURED precedent for this phase and should be reproduced.)

Phase 2 — η-attention (complex Q/K/V/O, warm-started). Replace per-head Q/K/V/O with complex projections initialized so their real parts equal GPT-2's weights and imaginary parts ≈0. Attention scores use the η-inner product ⟨q,k⟩_η. Keep softmax over the real part of the score (or |score|) to preserve the autoregressive objective. The relative-phase between q and k becomes a learnable carrier of temporal/order information (this is where instrument vi lives).

Phase 3 — Constrain the temporal-mixing operator to U(n,n). Insert, per block, a token-mixing operator U=exp(i·g·η·H) with H a real-parametrized Hermitian generator (parametrize H = M + M† from an unconstrained M; or use a skew-parametrization for iηH). Train the scalar gain g. Initialize g small (near the unbroken phase, all |λ|≈1) and let the criticality objective (Phase 6) pull g toward the EP. Use matrix-exponential via scaling-and-squaring or a Cayley/Padé surrogate for cheap differentiable U. This guarantees exact η-norm conservation regardless of g (subject to surrogate error, monitored by instrument iii). The program's corrected phase scan — model in the PT-broken phase at g=0.1 at near-zero perplexity cost — is the MEASURED anchor for the initial gain regime.

Phase 4 — Full complex FFN via the η-invariant channel (Liouville-safe). Replace the MLP with: (a) compute η-invariant scalars from the token's complex activation and a small learned reference frame; (b) pass scalars through a real GELU/MLP; (c) recombine with invariant coefficients × original complex vectors (LorentzNet Prop 3.1 form). This avoids modReLU's Goldstone pathology while remaining universal for the equivariant class.

Phase 5 — Gated persistent complex memory (the LSTM-like structural correction). Add a side memory state m_t updated by a gated complex recurrence with eigenvalues constrained near/on the unit circle (unitary-RNN / complex-diagonal SSM style): m_t = Λ⊙m_{t-1} + g_in⊙(W z_t), with Λ=diag(r_j·e^{iωⱼ}), r_j≲1 (selective decay; the SSM "add an exponentially decaying component to avoid spectral leakage" lesson). Modulus = salience, phase = temporal tag. This is the structural answer to "what is the LSTM's constant-error-carousel analogue for transformer temporal reasoning": a norm-controlled, phase-carrying memory path, not more parameters. Add phase-aware retrieval: retrieve by both content (modulus) and temporal/relational phase.

Phase 6 — Operator library + closure loss. Define K primitive operators as small η-pseudo-unitary or gated blocks. Add closure loss L_cl penalizing deviation of pairwise compositions from the learned span (structure constants c_{ij}^k). Track [A,B] norms as the order-dependence signal. Use hybrid optimization: gradient descent for parameters, a light evolutionary/search outer loop over which operators compose for a given task family. This realizes the "cognitive operating system" view: a policy π(a_t | q_t, m_t, g_t) over actions {think, search, retrieve, compute, run Python, verify, branch, compress, ask, answer}.

Phase 7 — Seed-variable head + deterministic propagation. Add a head that predicts the task's critical seed variable (for temporal/causal tasks: the pivotal latent state / event boundary / root cause). Train with a curriculum: first supervise the seed directly where labels exist, then let the η-pseudo-unitary propagator (Phase 3/5) unroll the consequence chain, with a consistency loss between propagated state and observed continuation. This operationalizes "identify the seed, then propagate deterministically."

Training curriculum (objective ordering):

  1. LM perplexity only, α-ramp 0→~0.3 (stabilize complexification).
    • η-attention and U(n,n) layer active; perplexity + η-norm-drift regularizer.
    • memory and retrieval; add delayed-recall auxiliary task (drives self-organization toward criticality).
    • operator closure loss; + commutator-structure auxiliary.
    • seed-variable supervision and propagation-consistency loss.
  2. Joint fine-tune on temporal/causal benchmarks; anneal α to its final value; let g find the EP.

Compute/feasibility note. Everything is GPT-2-scale (124M–1.5B). The matrix exponentials are per-head on small head dimensions (e.g. 64), so U=exp(igηH) is cheap. This is explicitly not "GPT-2 becomes GPT-5": the goal is a different cognitive regime (controlled retrieval, stateful operator composition, self-verification, reusable cognitive dynamics) at small scale, not encyclopedic knowledge.


Part IV — The Six-Instrument Measurement Dashboard

(i) Train/eval gap. Standard generalization gap (train vs held-out perplexity/accuracy). Computes: difference of losses. Success signal: gap stays small while capability rises — distinguishes genuine generalization (grokking-style cleanup) from memorization. Watch for the Nanda-style three-phase signature (memorization → circuit formation → cleanup), tracked e.g. by inverse-participation-ratio / Fourier sparsity of relevant weight matrices.

(ii) Off-circle order parameter (distance to the EP / real→complex spectral transition). Computes: from the temporal operator U (or memory transition Λ), Δ = mean over eigenvalues of ||λ|−1| (or the fraction with |λ|≠1 beyond tolerance), plus an EP-proximity measure (smallest eigenvector-overlap / condition number of the eigenvector matrix; near an EP eigenvectors become parallel and this blows up — the generalized-Petermann-factor idea). Success signal: Δ is ≈0 (unbroken) on reversible/order-invariant subtasks and rises (off-circle) precisely on irreversible causal subtasks; capability jumps coincide with Δ leaving 0. This is the operationalization of "PT-breaking ↔ causal order."

(iii) η-norm drift. Computes: across a layer/sequence, the relative change in Q(z)=z†ηz that should be ≈0 if the U(n,n) constraint holds: drift = |Q(z_out)−Q(z_in)|/|Q(z_in)|. Success signal: near-zero drift confirms the pseudo-unitary constraint is actually enforced (a correctness check on the parametrization and the matrix-exp surrogate). Non-zero drift localizes where the FFN/gating leaks.

(iv) Forward–backward perplexity asymmetry. Computes: train/evaluate the same model forward and on time-reversed sequences; A = PPL_backward − PPL_forward (the log-perplexity gap of Papadopoulos–Wenger–Hongler). Success signal: A>0 (forward easier) and, crucially, A tracks the ground-truth causal direction on synthetic causal datasets (cause→effect easier than effect→cause). This is the macroscopic arrow-of-time readout and the most direct external validation that the model encodes temporal asymmetry.

(v) Directed-information / Granger-causality probe. Computes: on internal activations or input/output token streams, transfer entropy T_{X→Y}=I(X^{past};Y_now|Y^{past}) and/or Granger causality (log-ratio of residual variances of bivariate vs univariate autoregression). These coincide for Gaussian variables: Barnett, Barrett & Seth (Phys. Rev. Lett. 103, 238701, 2009; arXiv:0910.4514) show verbatim that "for Gaussian variables, Granger causality and transfer entropy are entirely equivalent, thus bridging autoregressive and information-theoretic approaches to data-driven causal inference." Success signal: directed information from cause-tokens to effect-tokens exceeds the reverse, and increases as the model trains; probe on memory state shows directed flow consistent with the seed→propagation structure.

(vi) Relative-phase probe (the project-specific crux). Computes: extract the relative phase Δφ between paired complex activations (q vs k in η-attention; memory phase tags; operator phase labels) and regress task-relevant temporal/causal variables (event order, lag, causal direction) on Δφ — measure mutual information or decoding accuracy. Success signal: Δφ becomes decodable into temporal/causal structure (high probe accuracy) and is used by the model (causal intervention on Δφ changes predictions in the predicted direction). This is the crux because it is the one instrument that directly tests the central hypothesis — that phase, specifically relative phase, is the carrier of temporal-causal information. If (vi) fails while capability still rises, the mechanism hypothesis is falsified even if the model works.

Joint success criterion. The synthesis is validated when capability gains on temporal/causal benchmarks coincide with: (ii) the off-circle parameter becoming task-conditioned, (iv) forward–backward asymmetry aligning with ground-truth causal direction, (vi) relative phase becoming both decodable and causally efficacious — while (iii) confirms the constraint is real and (i) confirms genuine generalization.


Part V — Improved Master Specification / Prompt

MASTER SPEC: η-Pseudo-Unitary Operator-Algebra Conversion of GPT-2 into a Temporal-Causal Reasoner

Goal (regime, not parity). Transform pretrained GPT-2 into a different cognitive regime — controlled retrieval, stateful reasoning, operator composition, self-verification, emergent reusable cognitive dynamics — by giving it a complex, η-pseudo-unitary, gated operator substrate. This is NOT "GPT-2 → GPT-5" and NOT "standard RAG + bigger model." Encyclopedic knowledge is explicitly out of scope. Thesis: intelligence may emerge when training dynamics discover a closed, stable, compositional algebra of cognitive operators.

Mechanism (the why-of-the-why). Time and causal order have natural coordinates in phase and spectral position relative to the unit circle. Reversible/order-invariant computation = eigenvalues on the unit circle (unbroken PT, norm-stable, like a unitary RNN). Irreversible causal order = eigenvalues off the circle in reciprocal-conjugate pairs (λ,1/λ̄), reached through an exceptional point where opposite-Krein-signature modes collide. The indefinite metric η lets one operator represent both regimes and move between them. Complex/phase structure is a coordinate system for representation, not a prescription for weights; unconstrained complex weights introduce a U(1) gauge freedom whose Goldstone (flat) directions destroy optimization (empirically: fully-complex modReLU adder ≈2% vs phase-input+real-weight+seed ≈93.5%). Therefore: phase in positions/memory/retrieval/propagation; real, symmetry-broken maps for learning; η-pseudo-unitarity only as a stability constraint on temporal propagation; a seed variable to collapse search to deterministic propagation.

Build order. α-ramp complexification (preserve pretrained rep) → η-attention (warm-start) → U=exp(i·g·η·H) temporal layer (train g toward EP) → η-invariant (Liouville-safe) FFN → gated complex memory + phase retrieval (the LSTM-analogue structural correction) → operator library + closure loss → seed-variable head + deterministic propagation. Curriculum: LM → +constraints → +memory/delayed-recall → +closure → +seed/propagation → joint.

Measure (six instruments). (i) train/eval gap; (ii) off-circle order parameter / EP distance; (iii) η-norm drift; (iv) forward–backward perplexity asymmetry vs ground-truth causal direction; (v) directed-information/Granger probe; (vi) relative-phase probe (crux: phase must be decodable into AND causally efficacious for temporal/causal variables).

Epistemic provenance tags (use on every claim):

  • EXACT: η-pseudo-unitary norm conservation given ηH Hermitian (Mostafazadeh Prop 1); spectral dichotomy λ→1/λ̄ (Prop 3); |det U|=1 (Prop 4); EP/Krein stability boundary (Krein–Gelfand–Lidskii); LorentzNet Prop 3.1 universality; RoPE = phase rotation; TE=Granger for Gaussians (Barnett–Barrett–Seth 2009).
  • ESTABLISHED (empirical literature): forward arrow-of-time in LLMs (Papadopoulos–Wenger–Hongler, ICML 2024); grokking via Fourier/phase features (Nanda et al. 2023; Power et al. 2022); modReLU optimization pathologies; Ginibre edge universality (Cipolloni–Erdős–Schröder 2021); neuronal-avalanche/edge-of-chaos criticality (Beggs–Plenz; Bertschinger–Natschläger).
  • MEASURED (this program, to be independently reproduced): balanced per-head signature in GPT-2; corrected phase scan PT-broken at g=0.1 at near-zero perplexity cost; α-ramp / eta-attention / complex-FFN perplexity improvements with imaginary-dominant streams at α=1; g*·T≈0.88 criticality scaling in the delayed-recall toy.
  • SPECULATIVE: "causality emerges from PT-breaking" in the full model; Yang-Lee/non-unitary-CFT (c=−22/5) identification; "exceptional point = edge of chaos = brain critical point" unification; operator-closure phase transition causing reasoning jumps; holographic-RG/MERA/AdS analogy for renormalisation operators.
  • HUMAN-INTUITION: Kant's Second Analogy as grounding (irreversible perceptual order ↔ objective causal order ↔ off-circle spectrum).

Dead branches (record with causal reason):

  • Fully-complex weights everywhere → FAILED: U(1) Goldstone/flat directions, ≈2% on ten-digit addition. Reason: continuous gauge freedom, no symmetry breaking.
  • Naive complex FFN with modReLU → REJECTED: non-holomorphic, phase-preserving, does not break U(1); breaks η-norm conservation. Replaced by η-invariant channel.
  • Rotation-only (purely unitary) recurrence → INSUFFICIENT: spectral leakage / fixed-state-size compression failure. Fix: add gated exponential decay (r_j<1).
  • "GPT-2's PT-broken spectrum proves causal structure" → REJECTED as stated. Reason: matches Ginibre random controls (edge universality); burden shifts to self-organized criticality.
  • exp:u(p,q)→U(p,q) as a full surjection → FALSE for fixed η. Use identity-component parametrization; do not assume global reach from one layer.

Open theoretical crux. Construct (and prove convergence of) a network that self-organizes to criticality — the edge-of-chaos / exceptional-point fixed point — under a delayed-recall/temporal objective, and rigorously map {reversible sequence = eigenvalues on unit circle} vs {irreversible causal order = off-circle modes at the EP} onto Kant's Second Analogy (objective succession ⇒ causal necessity). The measurable form of the crux: show the off-circle order parameter (ii) and the relative-phase probe (vi) become task-conditioned exactly at the capability transition, with g approaching its critical value.


Recommendations

Stage 1 (validate the substrate is harmless). Implement Phases 0–2 (instrumentation, α-ramp, η-attention). Go/no-go threshold: α can reach ≥0.5 with held-out perplexity within ~2% of baseline AND η-norm drift (iii) stays <1%. If perplexity blows up, the complexification is destroying the pretrained representation — slow the ramp, reduce imaginary-channel learning rate, or restrict complexification to upper layers first.

Stage 2 (test the central mechanism on a toy). Build the noisy-delayed-recall task and a synthetic event-ordering/causal-chain task with known ground-truth causal direction. Add Phases 3,5,7. Go/no-go: (a) trained gain g self-organizes toward the EP (off-circle parameter (ii) rises) rather than collapsing to g=0; (b) forward–backward asymmetry (iv) aligns with ground-truth causal direction; (c) the relative-phase probe (vi) decodes order/lag above a strong real-valued baseline. If (vi) fails, stop and reconsider — the phase-as-carrier hypothesis is the load-bearing claim; without it, fall back to a real gated-SSM + seed-variable model (still a legitimate result, just not the bold one).

Stage 3 (operator algebra + scale to language). Add Phases 4,6; run the full curriculum on real temporal/causal benchmarks (event ordering, counterfactual/causal QA, narrative chronology, script/temporal-commonsense). Go/no-go: capability jumps coincide with a closure-loss drop AND with the dashboard signatures from Stage 2 reproducing at scale.

Benchmarks/thresholds that would change the plan:

  • If the off-circle parameter (ii) is not task-conditioned (stays generic/Ginibre-like regardless of task), abandon the strong PT-breaking claim and reframe the contribution as "η-pseudo-unitary stability + seed-variable propagation improves temporal reasoning" (instruments iv/v/vi as the evidence).
  • If η-attention degrades LM quality irrecoverably, keep GPT-2 attention and move all complex/phase structure into the memory + retrieval + operator side (Phases 5–7) — the hybrid still stands.
  • If closure loss never drops, the operator-algebra hypothesis is unsupported; ship the gated-memory + seed model.

Comparisons to run for credibility: (a) real-valued gated-SSM + seed-variable baseline (isolates the value of complex/phase structure); (b) complex-but-unconstrained baseline (should reproduce the Goldstone failure — a positive control for the symmetry-breaking thesis); (c) Ginibre-matched random control for the spectrum (instrument ii).


Caveats

  • The boldest claim is not yet established. "Causality/irreversibility emerges from PT-symmetry breaking in the full model" is SPECULATIVE. Its defensible, testable core is the self-organized-criticality prediction (instruments ii + vi become task-conditioned at the capability transition). State it as a hypothesis with a falsification criterion, not a result.
  • GPT-2's balanced signature and PT-broken spectrum may be generic. Edge universality (Cipolloni–Erdős–Schröder) means a non-Hermitian spectrum concentrating near the unit circle with off-circle outliers is the default for random matrices. Any claim of learned causal structure must beat a Ginibre control.
  • Provenance of the program's internal results. The corrected phase scan (g=0.1 in PT-broken phase at near-zero perplexity cost), the α-ramp/eta-attention/complex-FFN perplexity improvements, and the g*·T≈0.88 criticality scaling are MEASURED within this research program and were not independently verifiable in the external literature during this review; they should be reproduced and reported with seeds/configs before being treated as established.
  • Complex-valued optimization is genuinely hard. The Goldstone/flat-direction problem is real and is the single most likely failure mode; the entire architecture is designed around avoiding it (real symmetry-broken weights, η-invariant nonlinearity, anchored phase references). Do not relax these without expecting the ≈2% failure mode to return.
  • The matrix-exponential constraint is approximate in practice. Scaling-and-squaring/Cayley surrogates introduce small η-norm drift; instrument (iii) exists precisely to catch this. Exact pseudo-unitarity holds for the ideal exp, not necessarily its cheap surrogate.
  • The exp map is not surjective onto U(p,q) for fixed η. The learned operator lives in the identity component; some pseudo-unitary transforms are unreachable by a single exp(i·g·η·H). For deep stacks this is not limiting, but it forbids claims of full-group expressivity from one layer.
  • Analogies vs theorems. "Exceptional point = edge of chaos = brain critical point" is a motivating analogy with one rigorous leg (the EP/Krein stability boundary) and two empirical/heuristic legs (edge-of-chaos computation; brain criticality). Kant's Second Analogy is philosophical grounding (HUMAN-INTUITION), not evidence.
  • Non-unitary CFT / Yang-Lee (c=−22/5) connection is decorative unless made quantitative. It is a principled universality-class guess for the critical point but is not used operationally in the recipe; treat as SPECULATIVE context. The same applies to the holographic-RG/MERA/AdS analogy for the renormalisation/macro-operator layer.
·

Hi @jhegedus ,

Thank you for your interest in our paper and for sharing your related work.

The comment is quite long, and unfortunately Hugging Face does not currently collapse long comments with a “see more...” button. This means the full text takes up most of the visible discussion space.

Would you mind shortening it to around 3–5 lines and adding a link to your work for readers who want more details?

If it stays in the current form, I would unfortunately need to remove it to keep the thread readable for others.

Best,
Marios

Neat paper. It feels like we’ve been stuck between toy math examples and messy, uninterpretable real-world data for a long time, so having a middle-ground testbed to calibrate geometric estimators sounds pretty useful.

I'm curious, how well do you think the curvature and reach measurements from these synthetic, axis-aligned datasets translate to the more chaotic structure of natural image manifolds?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/f0ee5781-7e5b-49cc-b5d4-6f2379ecd740

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.15760
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15760 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15760 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15760 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.