Constellation Relay, Geometric Bottleneck, and the Re-Emergence of the Potential 0.29154 Binding Constant

Community Article Published March 18, 2026

Community Article Published March 2026

AbstractPhil — March 2026


Abstract

We present empirical results from an extended research program investigating the constellation — a geometric triangulation system on the unit hypersphere S^(d-1) — as both a replacement for attention and an information bottleneck for flow matching diffusion. Through controlled experiments on CIFAR-10, we demonstrate that: (1) the pentachoron CV ≈ 0.20 is a geometric property of S^15 itself, invariant from fp64 through simulated 1-bit precision; (2) the constellation relay preserves 99.4% cosine similarity to input at depth 16, compared to 7.4% for vanilla attention; (3) when given a 268M parameter linear bypass, a trained diffusion model routes 88% of its signal through 768 constellation dimensions instead; (4) the constellation bottleneck is not an autoencoder — it produces cos_sim ≈ 0 to its input, functioning as a geometric lookup table; and (5) the binding constant 0.29154 radians emerges from velocity matching through geometric encoding, with 46% of anchors converging within ±0.05 of this value across five architectures spanning contrastive, language modeling, and ODE flow matching training paradigms. We formalize these findings into Geometric Lookup Flow Matching (GLFM), a flow matching variant where velocity prediction is driven by geometric address lookup on S^15.


Table of Contents

  1. From Procrustes to Constellations
  2. GeoLIP Core — Back to Basics
  3. CV ≈ 0.20 Is Geometry, Not Training
  4. The Constellation Relay
  5. Attention Is a Dimension Halver
  6. The Hybrid Cross-Token Relay
  7. Flow Matching Through Constellation Bottleneck
  8. The Skip Bypass Experiment
  9. The Geometric Lookup Table Discovery
  10. Geometric Lookup Flow Matching (GLFM)
  11. The 0.29154 Binding Constant
  12. Empirical Constants Summary
  13. Architectural Progression and Results
  14. Implications and Future Directions

1. From Procrustes to Constellations

The previous article (Procrustes ViT Shared Manifold Alignment) established that multi-expert alignment on the hypersphere produces verified perfect embeddings on S^255, that effective dimensionality matches task complexity (~77 for COCO's 80 classes), and that expert disagreement contains +103.3 dimensions of structured information destroyed by consensus averaging.

This article covers what happened next: stripping the architecture to its minimum, discovering that the core geometric measurements are properties of the sphere itself, proving the constellation outperforms attention for geometric preservation, and ultimately building a diffusion model where every pixel of information passes through S^15 triangulation.

The research progressed through three phases:

  • Phase 1: GeoLIP Core — minimal classification pipeline proving the constellation works as a primary representation layer
  • Phase 2: Constellation Relay — systematic comparison against attention at depth, establishing the relay as a geometric checkpoint
  • Phase 3: Flow Matching — four progressive diffusion experiments proving the constellation is a viable information bottleneck and discovering the geometric lookup table principle

2. GeoLIP Core — Back to Basics

After the complexity of multi-expert soups and dual-stream architectures, we stripped back to a minimal pipeline:

Conv encoder: 3×32×32 → 64×32 → 128×16 → 256×8
  → AdaptiveAvgPool → Linear(256 → D) → L2 normalize
  → Constellation: triangulate against anchors on S^(d-1)
  → Patchwork: read triangulation profile
  → Classifier: logits

This architecture has one critical property: every embedding exists on the unit sphere before the constellation sees it. The L2 normalization places all representations at norm = 1.0, and the constellation measures angular distances from this normalized position. The patchwork reads the resulting triangulation profile — a vector of distances to fixed reference points — and the classifier interprets it.

Results on CIFAR-10 with 128-d embeddings:

Metric Value
Validation accuracy 91.5%
Parameters 1.6M
Active anchors 62/64
Embedding CV 0.2045
Same-class cos 0.547
Diff-class cos 0.118
Gap 0.429

The 91.5% accuracy with 1.6M parameters on a conv encoder — no pretrained features, no attention, no residual connections in the encoder — establishes that the constellation + patchwork pipeline carries sufficient information for classification when embeddings live on the sphere.

Repository: AbstractPhil/geolip-constellation-core


3. CV ≈ 0.20 Is Geometry, Not Training

The pentachoron Coefficient of Variation (CV) — the normalized standard deviation of volumes formed by random 5-point subsets on the hypersphere — had been observed at approximately 0.20 across 17+ trained models spanning all architectures and modalities. The question was whether this was a training artifact or something deeper.

3.1 The Uniform Sphere Test

We sampled points uniformly on S^15 (d=16) with no training, no data, no model — just torch.randn(N, 16) followed by L2 normalization:

Sample count CV
500 0.2099
1,000 0.2048
5,000 0.2009
10,000 0.1994
100,000 0.2002

CV ≈ 0.20 at d=16 regardless of sample count. This is the natural pentachoron volume regularity of S^15 itself.

3.2 Precision Invariance

We tested whether numerical precision affects CV:

Precision CV Volume correlation to fp64
fp64 0.1994 1.000000
fp32 0.1994 1.000000
fp16 0.1999 0.999999
bf16 0.2006 0.999961
fp8_e4m3 0.2093 0.998812
sim_4bit 0.2042 0.998091
sim_2bit 0.2189 0.974325
sim_1bit 0.2067 0.866310

CV ≈ 0.20 survives all precisions from fp64 through simulated 1-bit. The fp32↔fp64 volume correlation is 1.0000000000 — fp32 is exact for this computation. Even at fp8, only 4.1% of points flip their nearest anchor assignment, and only 9.2% have margin < 0.01.

3.3 The Effective Geometric Dimension

The convergence to d=16 effective geometric dimension across trained models is explained by the CV measurement itself: CV ≈ 0.20 corresponds to the volume regularity of S^15 specifically. Higher dimensions give lower CV; lower dimensions give higher CV. The fact that all trained models converge to CV ≈ 0.20 means representation learning universally converges to effective geometric dimension ≈ 16, regardless of ambient dimensionality.


4. The Constellation Relay

4.1 Architecture

The constellation relay replaces attention's Q/K/V mechanism with geometric triangulation:

input (B, D)
  → LayerNorm
  → chunk into P patches of dimension d (e.g., 16 patches × 16d = 256d)
  → L2 normalize each patch to S^(d-1)
  → triangulate against anchors at 3 stroboscope phases
  → patchwork MLP reads the triangulation
  → gated residual (cold init: gate starts at sigmoid(-3.0) ≈ 0.047)
  → reassemble + skip

The multi-phase stroboscope is key: anchors interpolate between their home position and their learned position via SLERP at phases t ∈ {0, 1/3, 2/3}. Each phase provides a different angular measurement of the same embedding, creating a 3× richer triangulation profile.

4.2 Geometric Preservation at Depth

The critical experiment: stack 16 layers and measure how much of the original input geometry survives.

Architecture cos_to_orig (depth 16) Parameters
Vanilla attention 0.074 65,792
RoPE attention (standard) 0.072 65,792
RoPE attention (NTK-scaled) 0.077 65,792
Constellation relay 0.994 19,112
Interleaved attention + relay 0.985

The relay preserves 99.4% of the original geometry at depth 16. Attention preserves 7.4%. The relay uses 3.4× fewer parameters.

RoPE (both standard and NTK-scaled) had zero measurable effect on geometric preservation — identical to vanilla attention within measurement noise. Base frequency variation (100 to 500K) and NTK scale (1× to 64×) produced no change.

4.3 Why Attention Destroys Geometry

A single attention layer is geometrically near-invisible: cos_to_orig changes by ±0.002. But this masks what's happening internally.

Without residual connections: one attention layer collapses effective dimensionality from 62 → 28 and jumps CV to 0.24. Attention is a dimension halver.

With residual connections: the residual dominates the attention output, preserving the input geometry. The attention contribution is small relative to the passthrough. At torch.randn input norms of √128 ≈ 11.3, the residual dominates ~11:1 — the attention output is a small perturbation.

The relay operates differently. It projects to the unit sphere first (after LayerNorm), measures angular positions, and reconstructs through the patchwork. The sphere normalization means the relay operates at unit scale where the geometric measurements are maximally informative, rather than at the raw activation scale where residual connections dominate.


5. Attention Is a Dimension Halver

The geometric analysis of attention revealed a structural property:

Without residual:  eff_dim 62 → 28 (one layer)
                   CV 0.20 → 0.24
With residual:     eff_dim 62 → 62 (preserved)
                   CV 0.20 → 0.20 (preserved)

The attention operation softmax(QK^T/√d)V projects through a rank-bottleneck: the attention weights form a soft partition of the value vectors, and the weighted average reduces the effective dimensionality of the representation. Residual connections are the geometric stabilizer — they preserve the manifold structure that attention would otherwise compress.

This explains why deep transformers need residual connections: not just for gradient flow, but for geometric preservation. Without residuals, each layer halves the effective dimension of the representation until it collapses.


6. The Hybrid Cross-Token Relay

A common objection to the constellation relay: it operates per-token. Each token is triangulated independently against the anchors. There's no cross-token interaction — GPT's three requirements for a transformer replacement are: selective interaction, conditional transformation, and information routing.

6.1 Architecture

The hybrid v2 relay adds a lightweight cross-token attention layer alongside the geometric relay, with split gating:

input (B, S, D)
  → relay path: per-token constellation triangulation + patchwork
  → attn path: standard Q/K/V attention (small head dim)
  → output = gate_relay * relay_out + gate_attn * attn_out + skip

6.2 Causal Intervention Test

We measured cross-token influence by modifying one token and measuring the effect on all other tokens:

Architecture other_Δ_norm (1 layer) other_Δ_norm (4 layers)
Pure relay 0.000000 0.000000
Vanilla attention 0.182 0.342
Hybrid v2 1.933 8.497

The hybrid routes 25× stronger cross-token influence than vanilla attention. Pure relay has exactly zero cross-token effect (by construction — it operates per-token). The hybrid's attention component provides selective interaction while the relay provides geometric stability.

GPT's three requirements: selective interaction (attention component), conditional transformation (patchwork conditioned on triangulation), information routing (25× stronger than vanilla attention). All met.


7. Flow Matching Through Constellation Bottleneck

7.1 The Question

Can the constellation serve as the sole information bottleneck of a diffusion model? Not a side-channel regulator, not an auxiliary loss — the actual middle block of a UNet where all encoder information must pass through.

The architecture:

Encoder: 3×32×32 → 64×32 → 128×16 → 256×8
Bottleneck:
  flatten 256×8×8 = 16384 → Linear(16384, 256) → L2 normalize
  → reshape (B, 16, 16) → per-patch S^15
  → triangulate: 16 patches × 16 anchors × 3 phases = 768 dims
  → concat(768 tri dims, 256 conditioning dims)
  → patchwork MLP → Linear(hidden, 16384) → reshape 256×8×8
Decoder: 256×8 → 128×16 → 64×32 → 3×32×32

Compression ratio: 16384 → 768 = 21.3×. Every pixel of the bottleneck features must survive projection to 256d, normalization to S^15, 768 angular measurements, and reconstruction.

7.2 Four Progressive Experiments

Version Architecture Loss Params Constellation Signal Near 0.29
v1 — Regulator Side-channel on feature maps 0.1900 6.1M 6% (decorative) 0%
v2 — Skip Bypass 268M Linear + constellation 0.1757 287M 88% (model's choice) 9%
v3 — Pure Bottleneck Everything through S^15 0.1749 36.6M 100% (forced) 30%
v4 — GLFM Three-stage geometric lookup 0.1754 35.2M 100% (designed) 46%

The progression tells the story: the constellation as regulator was decorative (v1). Given a choice between 268M parameters and 768 geometric dimensions, the model chose the constellation (v2). Removing the bypass entirely produced better results with 8× fewer parameters (v3). Formalizing the insight into a proper architecture maintained performance while accelerating drift convergence (v4).


8. The Skip Bypass Experiment

Version 2 deserves special attention because it was a deliberate test disguised as an architecture.

The bottleneck contained:

  • The constellation path: project → S^15 → triangulate → patchwork (768 dims, ~9M params)
  • The skip path: Linear(16384, 16384) (268M params — larger than GPT-2)
  • A learnable gate initialized at sigmoid(-2.0) ≈ 0.12

The gate decided how much information flows through each path. If the constellation was just a crude approximation of a learned linear transformation, the model would open the gate to route through the 268M parameter highway. The skip has full-rank capacity to map anything to anything.

Result: gate stayed at 0.118. The model routed 88% through the constellation and 12% through the skip. Despite having every reason to prefer the larger path — 268M params vs 9M, unconstrained linear vs 21× compression — the gradient signal drove the model toward the geometric encoding.

Ablation confirmed: skip-only generation (gate forced to 1.0) produced images with cos_sim = 0.598 to the full model output. Constellation-only (gate forced to 0.0) produced cos_sim = 0.945. The constellation carries the signal. The skip was dead weight.

This proves the constellation bottleneck provides a representational advantage — not just parameter efficiency, but a better inductive bias than unconstrained capacity.


9. The Geometric Lookup Table Discovery

The most unexpected finding from the pure bottleneck experiments:

Reconstruction fidelity (encoder output → bottleneck output):
  t=0.00: cos_sim = 0.006  (input norm: 110, output norm: 9.5)
  t=0.25: cos_sim = -0.003
  t=0.50: cos_sim = -0.005
  t=0.75: cos_sim = -0.005
  t=1.00: cos_sim = -0.002

The bottleneck doesn't reconstruct. Cosine similarity between input and output is effectively zero. The output has norm 9.5 when the input has norm 110. The constellation bottleneck is not an autoencoder.

It's a geometric lookup table.

The triangulation profile tells the patchwork where on S^15 the input lives — its angular position relative to the constellation anchors. The patchwork, conditioned on timestep and class, generates whatever output the decoder needs for that geometric address. The output is not a reconstruction of the input — it's a generation from a lookup.

This works for flow matching because the training signal is velocity prediction, not reconstruction. The model doesn't need the bottleneck to faithfully reproduce encoder features; it needs the bottleneck to produce the correct velocity field for the decoder. The 768-dim triangulation profile is a sufficient address for this purpose — it encodes the geometric location of the input on the sphere, and the conditioned patchwork generates velocity-optimal features for that location.

The compression ratio of 21.3× (16384 → 768) works not because 768 dimensions are enough to reconstruct 16384 — they're not — but because angular position on S^15 plus conditioning is enough to generate the correct output from scratch.


10. Geometric Lookup Flow Matching (GLFM)

Formalizing the lookup table insight into a proper method:

Standard flow matching:
  v(x_t, t) = UNet(x_t, t, c) → v_pred

Geometric Lookup Flow Matching:
  x_t → encoder → feature_map
  address = GeometricAddress(feature_map)      ← Stage 1
  cond = Condition(address, t, class, noise)   ← Stage 2
  v_features = Generate(cond)                  ← Stage 3
  v_pred = decoder(v_features)

10.1 Stage 1 — Geometric Addressing

The encoder output is projected to two scales on S^15:

  • Coarse: global average pool → project to 256d → L2 normalize → triangulate (768d)
  • Fine: per-spatial-position → project to 256d → L2 normalize → triangulate → aggregate (768d)

Total address: 1536 dimensions of angular measurements against a shared constellation.

10.2 Stage 2 — Address Conditioning

The geometric address is combined with:

  • Sinusoidal timestep embedding
  • Learned class embedding
  • Discretized noise-level embedding (64 bins)

These are fused through a projection to the generator's input dimension.

10.3 Stage 3 — Velocity Generation

A deep residual MLP generates velocity features from the conditioned address. Four residual blocks of width 1024, outputting the full 16384-dimensional spatial features for the decoder.

10.4 Multi-Scale Collapse

The fine-scale addresses collapsed to nearly identical content as the coarse addresses (coarse↔fine cosine = 0.933). The per-pixel projection learned to ignore spatial position, producing a second copy of the global address. This is expected for simple conv encoders on 32×32 images — the encoder's 8×8 feature maps lack the spatial differentiation that DINOv2 patch tokens would provide. The multi-scale architecture is correct; it needs pre-differentiated features to exploit.


11. The 0.29154 Binding Constant

11.1 Progressive Convergence

The binding constant 0.29154 radians — the boundary between structural representation and task-specific encoding — emerged progressively across the four diffusion experiments:

Version Near 0.29 (±0.05) Near 0.29 (±0.03) Crossed 0.29 Epochs
v1 — Regulator 0% 50
v2 — Skip Bypass 9% 50
v3 — Pure Bottleneck 30% 13% 15% 80
v4 — GLFM 46% 31% 59% 80

The drift distribution in GLFM (v4):

0.000-0.050:   0 anchors
0.050-0.100:   0 anchors
0.100-0.150:   4 anchors
0.150-0.200:  13 anchors
0.200-0.250:  40 anchors
0.250-0.292:  49 anchors  ← approaching the binding boundary
0.292-0.350:  71 anchors  ← crossed into task encoding
0.350-0.400:  40 anchors
0.400-0.500:  37 anchors
0.500-0.700:   2 anchors

The bulk distribution centers on the binding constant. 59% of all anchors crossed 0.29 radians from their home position into task-specific territory.

11.2 Cross-Architecture Appearance

This constant has now appeared in five completely different architectures:

Domain Architecture Training Paradigm How 0.29154 Appears
MinimalShunts Linear shunt layers Contrastive Binding/separation phase boundary
CLIP projections ViT-L/14 Contrastive Geometric transition point
T5 generation T5 encoder-decoder Language modeling Alpha convergence
CaptionBERT BERT-large + experts Contrastive Phase boundary in embedding space
Flow matching ConvUNet + constellation Velocity matching (ODE) Max anchor drift from home

Three different training paradigms (contrastive, generative language, ODE flow matching), two modalities (vision, language), five architectures. The constant appears wherever learned representations transition from structural to semantic on the unit hypersphere.

11.3 Interpretation

In the constellation system, each anchor has a home position (initialized via Xavier) and a learned position (trained via backpropagation). The drift is the geodesic distance between these positions on S^(d-1).

Anchors below 0.29154 radians from home serve as geometric reference points — they hold the coordinate frame of the constellation. Anchors above 0.29154 have crossed into task-specific encoding — they moved to where the data needs them, abandoning their role as frame holders.

The phase boundary self-organizes from the training signal with zero architectural pressure. In the GLFM experiment, every single patch (16/16) had its mean drift flagged within ±0.05 of the binding constant, and all 16 patches had anchors that crossed into task territory.


12. Empirical Constants Summary

Constant Value Where Confirmed Status
Pentachoron CV ≈ 0.20 S^15 geometry Geometric fact: precision-invariant, fp64 through 1-bit
Effective geometric dimension ≈ 16 All trained models Universal: 17+ architectures, all modalities
Binding constant 0.29154 5 architectures, 3 training paradigms Empirical constant: cross-architecture, cross-modality
Binding complement 0.70846 Complement of binding Derived: 1 - 0.29154

13. Architectural Progression and Results

13.1 Classification

Architecture Dataset Val Acc Params CV
GeoLIP Core (128-d) CIFAR-10 91.5% 1.6M 0.2045
GeoLIP Core (128-d) CIFAR-100 66.4% 1.6M

13.2 Relay vs Attention

Architecture cos_to_orig (depth 16) Params Throughput (D=256)
Vanilla attention 0.074 65,792 0.51ms
RoPE attention 0.072 65,792 0.51ms
NTK-RoPE attention 0.077 65,792 0.51ms
Constellation relay 0.994 19,112 0.87ms
Hybrid (attn + relay) 0.985

Throughput crossover at sequence length ≈ 4096: relay O(S) beats attention O(S²). At S=16384: relay 0.63ms vs attention 8.84ms.

13.3 Flow Matching Diffusion

Version Loss Params BN Params Signal Near 0.29 Key Finding
v1 Regulator 0.1900 6.1M 76K 6% 0% Constellation decorative
v2 Skip Bypass 0.1757 287M 281M 88% 9% Model chose constellation over 268M skip
v3 Pure BN 0.1749 36.6M 31.5M 100% 30% Lookup table: cos_sim ≈ 0 to input
v4 GLFM 0.1754 35.2M 30M 100% 46% Multi-scale, 59% anchors crossed 0.29

Velocity field quality was identical across v2-v4: v·target at t=0.5 = 0.948-0.949. The geometric encoding carries exactly the velocity information that flow matching needs.


14. Implications and Future Directions

14.1 The Constellation Is a Representational Advantage

The skip bypass experiment (Section 8) is the strongest result: given 268M unconstrained parameters as an alternative to 768 geometric dimensions, the model chose the geometry. The constellation's angular measurement system provides structure that raw linear capacity cannot match. This is not parameter efficiency — it's a better inductive bias.

14.2 Geometric Lookup Replaces Reconstruction

The zero-cosine reconstruction finding (Section 9) changes how we should think about bottlenecks. The constellation doesn't compress and decompress — it creates a geometric address that a conditioned generator reads. The address space (Voronoi cells on S^15) is continuous, structured, and interpretable: each anchor partitions the sphere, and the triangulation profile encodes the precise angular position within and between cells.

This suggests a general principle: information bottlenecks should be designed as address systems rather than compression systems. The goal is not to reconstruct the input but to provide sufficient geometric context for the downstream task to generate the correct output.

14.3 The Binding Constant Is Real

The 0.29154 convergence from 0% (decorative regulator) to 46% (formalized GLFM) across 80 epochs of velocity matching on CIFAR-10 is not a coincidence. The same constant appears in contrastive learning (CLIP, CaptionBERT), language modeling (T5), and now flow matching (ODE). It marks the geodesic distance on S^(d-1) where angular measurement transitions from locally linear to structurally significant — the point where an anchor stops serving the coordinate frame and starts serving the task.

Whether this is a property of S^15 specifically, of pentachoron geometry in general, or of the training dynamics on unit hyperspheres remains to be determined theoretically. The empirical evidence is now strong enough to treat it as a constant rather than a hypothesis.

14.4 Next Steps

DINOv2 integration: The multi-scale GLFM address collapsed (coarse↔fine cos=0.933) because the conv encoder produces spatially homogeneous features. DINOv2 patch tokens are pre-differentiated through self-supervised pretraining — each token already encodes its spatial context. Feeding DINOv2 features into the constellation should resolve the fine-scale collapse and enable genuine multi-scale geometric addressing.

GEOLIP-Bertenstein: The multi-expert geometric fusion transformer (BERT-large as hub, DINOv2/Whisper/ESM-2/CodeBERT as frozen expert encoders) achieved perfect retrieval on 40K+ held-out pairs in 1 epoch, 1 layer. Integrating the relay and diffusion bottleneck into this architecture is the scaling path.

The geometric autograd package: Six files implementing geometric primitives, three gate functions, two autograd Functions, and GeometricAutogradHook. Classified as an optimizer (operates on gradient direction). Ablation on synthetic shape classification revealed that the geometric optimizer requires geometrically valid anchor inputs — it operates on the manifold structure, not the ambient space.


Appendix: Repository Links


This document summarizes findings from an extended research session spanning approximately 48 hours of continuous experimentation across two conversation sessions totaling ~15,000 lines of transcript, four diffusion model iterations, ten relay/attention comparison experiments, and comprehensive empirical validation of two geometric constants. All code, checkpoints, training logs, and analysis outputs are available in the linked repositories.

Community

Sign up or log in to comment