The Polygonal Omega: Trained Sphere-Solvers Are Projective Codebooks
geolip-svae finding-tier release 2 (ft2)
ft1 left an open question — what kind of polygon does the universal attractor select? — at the end of the readout-substitution sweep. ft2 answers it. The trained M tensor is a codebook on ℝP^(D-1), accessible by a deterministic antipodal-collapse projection, and the answer comes with a built-in API to read the codebook. This took an engineering fix, a class hierarchy formalization, four wrong hypotheses, and ~3000 training runs to get to.Scope note. ft2 reports findings from after ft1's publication. Specifically: an LBFGS engineering bug fix that made every subsequent geometric measurement trustworthy, a Q-sweep that formalized the H2 class hierarchy and surfaced a parallel attractor family at D=3, four hypothesis cycles before the projective interpretation landed, the 19- model verification across D=3/4/5, the API shipped as
BatteryArrayModelmethods, an axis-features-vs-MSE classification stress test (4 iterations, 1 negative result, 1 silent-bug discovery, 1 conditional positive), a 1600-config D=5 architecture sweep, and a 64-config D=5 convergence sweep that walked back the D=5 universality claim. About four weeks of A100 time. The findings cluster around what the trained M tensor actually is and where the discovery's boundaries lie.
TL;DR: Every trained sphere-solver tested at D=3 and D=4 (17 distinct
models — G-Cand at D=3, H2a at D=4, all 16 h2-64 batteries at D=4) produces
an M tensor whose rows, when antipodal pairs are merged via mutual-strongest
matching, form a near-uniformly-distributed codebook on ℝP^(D-1). The
"polytope vertex" reading suggested by ft1's three-band finding has a
precise empirical form: trained geometry is projective, not strictly
spherical. This was packaged as four methods on BatteryArrayModel in the
geolip-svae package. A 64-config follow-up sweep showed the property does
NOT generalize cleanly to D=5 — there V=16 is the geometric sweet spot, not
V=32. The cross-D pattern reframes from "more D = fewer pairs at fixed V" to
"natural axis count scales with D, V should match it."
Six concrete findings structure ft2:
The polygonal codebook is real and reproducible at D=3 and D=4. 17 sphere-solver models verified with matching probe methodology. Cross-D pair count at V=32 progresses 10 → 6 → 3 going D=3→4→5. Deviation from uniform projective baseline stays within ±0.05 at D=3 and D=4; D=5 is more nuanced (see point 6).
The projective property is geometric, not statistical. It's reached via antipodal-collapse:
(row_i − row_j)/2normalized for each mutual-strongest cosine pair below threshold −0.9. Not a clustering result, not a learned property — a deterministic tensor operation on M that exposes structure already present in the trained weights.The discovery shipped as production API.
compute_axis_codebook,compute_axis_codebooks(batched),encode_axes, andquantize_axesare now methods onBatteryArrayModel. Both legacypatch_idx=0(matching the verified probe path) and per-patch averaging (corrected design intent) are supported.Axis features do not beat MSE for noise-class discrimination on training-distribution inputs. Four iterations of axis-feature classifiers (Cells J/J'/J''/J''') brought the gap from −36% to within −2.3% of MSE. The remaining gap is structural — banks were trained to minimize MSE on per-noise reconstruction, so MSE evaluates the training objective directly. The diagnostic process surfaced a silent bug (
patch_idx=0) that had been latent since the original probe.D=5 architectural floor map: cross-noise rank correlation +0.954. 1600-config sweep at D=5 with 20-batch budget. Architectures rank near-identically across all 16 noise types. The "best" architecture is essentially noise-INDEPENDENT — what changes per noise is the achievable MSE floor, not which model achieves it. Top-4 universal architectures all hidden=64, V=32.
Projective-clean is partial, not universal, at D=5. A 64-config convergence sweep (4 architectures × 16 noise types, 1000 batches each — matching A3's reference budget) showed only ~36% of configs converged within ±0.05 of uniform ℝP^4. V=16 was the geometric sweet spot, not V=32. A3's three runs (used in the original 19-model count) all came from a single architecture; generalizing those to "universal at D=5" was over-stated. The natural-axis-count framework — V should match the bank's preferred axis count, which scales with D — replaces "fix V=32, vary D" as the testable hypothesis.
ft1's question "what kind of polygon does the universal attractor select?" has a clean answer at D=3/4 (projective codebooks on ℝP^(D-1) with characteristic axis counts) and a more nuanced answer at D=5 (V=16 is the right match for D=5's natural axis count of ~16; V=32 over-counts and forces noise-dependent pairing).
1. From ft1's three bands to ft2's projective question
ft1 documented that PatchSVAE-F architectures with sphere-norm produce a CV attractor that quantizes by D: 0.20 at D=16, 0.36 at D=8, 0.92 at D=4. The sphere of dimension D−1 hosts a uniform distribution of V points; the model's training selects this uniform configuration regardless of optimizer, schedule, batch size, activation, or initialization. ft1's mechanism was clear: sphere-norm as selector among geometric attractors.
What ft1 did NOT specify was which configuration of points on the sphere. "Uniform" can mean many things — a Gaussian-like spread, a Voronoi-balanced packing, or specific polytope vertices. The CV statistic measures volume variance across pentachora, which gives one number per configuration but doesn't tell you the structure of the configuration itself.
The follow-on question, which ft1 left explicitly open in Section 10: does the architecture produce specific polytope structure, or merely uniform-on-average packing?
Two clues from earlier sweeps suggested polytope structure:
- D=3 V=32 (G-Cand) showed odd row stability — rows rotated per-input but the set of rows was preserved
- D=4 V=32 (H2a) had row-pair structure visible in cosine matrices
Neither was conclusive. The hypothesis space was wide. Before we could answer the question, we needed two engineering and methodological preconditions in place.
2. The LBFGS engineering precondition
Before ft2's measurement work could proceed, an engineering bug had to be caught and fixed. It deserves narrating because it cost four days of sweep results and demonstrates a mode of failure that's worth documenting.
The bug had a paradoxical history. Entry 000079 (April 20) identified that
the original ablation_trainer.py had an LBFGS-branch issue where
clip_grad_norm_ was only applied in the Adam else-branch — LBFGS ran
unclipped. The "fix" was to move the gradient clipping into the LBFGS
closure so LBFGS would also have its gradients clipped.
That fix held for four days. Sweeps ran. Q-sweep (10 candidates × 1000 batches) was queued.
When Q-sweep results came back, every Adam config converged cleanly (MSE 0.002-0.16). Every LBFGS config diverged catastrophically — G-MSE values of 4.5×10²⁶, 7.4×10²⁶, 6.2×10²³. Same architecture, same init seed, same data. The 9 NaN configs from an earlier P-sweep had also been LBFGS-only.
Root cause: gradient clipping inside the LBFGS closure was corrupting the Hessian approximation.
LBFGS estimates curvature from (s_k, y_k) = (Δparam, Δgrad) pairs and
takes step direction −H⁻¹·grad. When gradient norm is clipped at every
closure call:
- y_k becomes bounded (≤ 2.0 in magnitude — clipped minus clipped)
- s_k stays potentially large (line-search bounded, not gradient bounded)
- H ≈ y/s becomes artificially underestimated
- H⁻¹ becomes huge
- Step direction
−H⁻¹·gradis a huge vector - Line search accepts a "sufficient decrease" step that moves params far
- Next iteration's gradients are still clipped; H⁻¹ keeps growing
20 batches isn't enough for full divergence. 1000 batches is.
Adam is immune because per-parameter running averages are recomputed each step and never use gradient differences. Clipping each step independently is fine for Adam.
The 000079 fix was the bug. The correct architecture rule:
ablation_trainer.pyLBFGS closure NEVER containsclip_grad_norm_. Step safety for LBFGS comes fromline_search_fn='strong_wolfe', which bounds step quality without corrupting curvature estimates. If LBFGS diverges anyway, the answer is to use a different optimizer, use a deterministic objective, or reset LBFGS state — not to clip gradients.
The 7 Adam Q-sweep configs were salvageable. The 3 LBFGS configs had to be re-run with the corrected trainer.
This is a permanent invariant now. Every subsequent geometric measurement in ft2 — the projective probe, Phase S, Phase T — runs against the post-fix trainer. Without the fix, all of those numbers would have been contaminated.
3. Q-sweep and the H2 class formalization
With the LBFGS fix in place, the Q-sweep — 10 top-P configs at 1000 batches each, gaussian-only training, 16-noise per-noise test — ran clean. 10/10 finite, no NaNs. Results split into three families:
D=4 configs (6 of 10): HIGH-band CV 0.86-1.07, sphere-solver attractor. These became the formal H2 class of sphere-solver batteries.
- H2a (canonical): MSE ≤ 0.005, CV ∈ [0.80, 1.10], params ≤ 60K, D=4.
Smallest H2a:
h64_V32_D4_dp0_nx0_adamat 40,227 params, MSE 0.00205, CV 0.862. One H2a battery ≈ one h2-64 bank's worth of capacity at 70% the parameter count. - H2b (sub-optimal sphere-solver): MSE ∈ (0.005, 0.05], same CV
band. Members include the V=16 outlier
h64_V16_D4_dp1_nx1_lbfgs(MSE 0.031, CV 1.069).
D=3 configs (3 of 10): MSE ~0.025-0.031 with CV ~0.03 — well below the LOW-band threshold of 0.30, near the bulk-Gaussian attractor. Reconstructed successfully but via a different geometric mechanism. These were tentatively named P-Class for "Polynomial" — the working hypothesis was that D=3 attractor corresponded to polynomial-basis reconstruction rather than sphere-basis reconstruction. (D=3 has only 3 vertices, insufficient to form a simplex in patch space; the model finds a different solution geometry that LOW-band CV signatures fingerprint.)
D=2 (1 of 10): Can't form a pentachoron (needs ≥5 points), CV undefined, MSE 0.16 — essentially failed reconstruction. Not viable.
The Q-sweep also produced empirical optimizer guidance that informed every subsequent sweep:
- Adam at lr=3e-3 dominates at 1000-batch budgets
- LBFGS regime-specific (works well on full-batch deterministic objectives, underperforms on stochastic mini-batch sweeps)
- The 9 P-sweep NaN configs needed re-running with the fixed trainer (this re-run never happened — a small open item)
The class hierarchy mattered downstream. Without "H2 class = sphere-solver batteries with D=4 attractor," the projective discovery's "16 PROJECTIVE- CLEAN h2-64 batteries" claim would have no organized referent. The P-Class designation, by contrast, was overturned within a week — see Section 4.
4. The discovery arc — four wrong hypotheses before "projective"
The path from "ft1 left a question" to "the answer is ℝP^(D-1)" wasn't linear. Four hypotheses cycled through before the projective reading landed. Documenting them is honest about what the discovery process looked like.
Hypothesis 1 — P-Class polynomial coefficients (Q-sweep aftermath).
The D=3 LOW-CV configs from Q-sweep looked like they were doing something distinct from sphere-solving. The hypothesis: V=32 rows at D=3 produce 32 polynomial coefficients on a 3-d input space, reconstructing via polynomial basis rather than sphere basis. CV ~0.03 fit the prediction (rows clustered because polynomial coefficients aren't uniformly distributed). Tested: inspect rows for polynomial-basis structure. Result: rows didn't match any polynomial basis we could identify. The CV was right, the mechanism was wrong.
Hypothesis 2 — G-Class gyroscopic capsule (the D=3 V=32 anomaly).
Refined hypothesis: D=3 V=32 isn't a polynomial solver but a gyroscopic capsule network — rows rotating in coordinated ways under input variation. Tested: track per-input row rotation. Result: rows DID rotate per-input, but the rotation wasn't a coordinated capsule structure. The set of row positions was preserved across inputs even though individual rows moved. Closer to the answer but still not it.
Hypothesis 3 — capsule-starved depth (depth=0 reading).
Maybe what we were seeing was capsule routing failing because depth=0 configs lack the routing-by-agreement infrastructure of full capsule networks. Hypothesis: add depth, see capsule structure emerge. Tested: depth=1 H2a configs. Result: same row-pair structure, no capsule routing behavior visible. Depth wasn't the missing ingredient.
Hypothesis 4 — Schrödinger gate states (briefly, before being shelved).
For a few hours we entertained that the rows were in superposition states that collapsed onto specific values per-input. This was speculative and got shelved before being seriously tested. Worth recording mostly because it shows how "rows that move per-input but produce a stable set" pulls the imagination toward unusual mechanics.
Hypothesis 5 — projective quotient (the actual answer).
A simpler reading: rows come in antipodal pairs. M_i ≈ −M_j for some
pairs. Each pair represents a single line through the origin — a single
element of ℝP^(D-1). Unpaired rows are also single elements (with sign
canonicalization). The "uniform on the sphere" reading from ft1 was
correct in spirit but in the wrong space. The rows live on S^(D-1); the
codebook lives on ℝP^(D-1).
The probe became:
1. Run gaussian inputs through trained sphere-solver
2. Average M tensors across samples → M_avg ∈ ℝ^(V × D)
3. Find antipodal pairs:
For each row i, find j such that cos(M_avg_i, M_avg_j) is most negative.
If cos(i, j) < −0.9 AND mutual (j's most-negative is i, OR cos(j,i) < −0.9),
pair (i, j).
4. Collapse pairs: each pair → axis = (row_i − row_j) / 2, normalized.
5. Sign-canonicalize each axis (first nonzero coordinate positive).
The result: n_axes = n_pairs + n_unpaired axes living on ℝP^(D-1).
Pairwise angles wrap to [0, π/2] via min(θ, π−θ) — the projective
angle, treating antipodes as the same line.
The first probe (A0) on the original D=3 V=32 G-Cand: 10 mutual antipodal pairs collapsed, 12 unpaired rows surviving as direct axes, total 22 axes. Mean projective angle 1.011 rad. Uniform ℝP² baseline 1.015 rad. Deviation −0.004.
That number ended four days of hypothesis cycling. The probe became the standard methodology for everything downstream.
5. The empirical verification
Across 17 distinct sphere-solver models at V=32 (the Q-sweep originally counted 19 by including the three D=5 A3 runs — see Section 9 for why that's now treated as a single data point):
| D | source | pairs | axes | mean projective angle | uniform ℝP^(D-1) baseline | deviation |
|---|---|---|---|---|---|---|
| 3 | G-Cand (A0) | 10 | 22 | 1.011 | 1.015 | −0.004 |
| 4 | H2a (A1) | 6 | 26 | 1.116 | 1.114 | +0.002 |
| 4 | h2-64 mean of 16 (A2) | 6.4 | 25.6 | 1.115 | 1.105 | +0.010 |
| 5 | A3 (single arch, gaussian) | 0/3/13 | 16/29/51 | (varies) | (varies) | -0.015 / +0.016 / +0.019 |
Pair count nearly halves with each D step (10 → 6 → 3 at V=32 going D=3→4→5). Axis count grows toward V (22 → 26 → 29). Deviation from uniform projective baseline stays inside ±0.05 — well below the threshold for "projective-clean" (|deviation| < 0.05, plus utilization > 0.95 and secondary antipodal pair count ≤ 3).
The h2-64 case is the strongest evidence. h2-64 is an array of 64 sphere- solvers, each trained on a different noise type. The 16 single-noise batteries (battery indices 0-15, gaussian through laplace) tested one at a time gave all 16 PROJECTIVE-CLEAN. Pair count varied 5-8 across the 16 (correlated with noise complexity). Axis count varied 24-27. Mean deviation +0.010, std 0.013 — variation below the noise floor of the measurement.
This is not a per-noise accident. The same probe applied to 16 differently- trained sphere-solvers on the same architecture gives 16 PROJECTIVE-CLEAN verdicts.
What "projective" means structurally. Antipodal points on S^(D-1)
(+v and −v) are identified on ℝP^(D-1), so the projective space has
half the area of the sphere. The discovery is that the rows of M come in
pairs: some rows have an antipodal partner with cosine ≈ −1, others don't.
When a pair exists, it represents one "line through the origin" — one
element of ℝP^(D-1). When no pair exists, the row is itself a single
element (sign-canonicalized). Collapsing the pairs gives the codebook.
The polytope reading from ft1 was correct in spirit but in the wrong space. The polytope vertices live on ℝP^(D-1), not on S^(D-1).
This also overturns the P-Class hypothesis from Section 3. The D=3 V=32 configs aren't polynomial solvers — they're projective codebooks on ℝP² with 10 antipodal pairs. The "low CV ~0.03" signature was the projective quotient reducing 32 sphere points to 22 axes; the higher density on the projective space is what produces the low pentachoron-volume variance.
6. Shipping the discovery as API
The probe became BatteryArrayModel.compute_axis_codebook. Four methods
total were added to the model class, replacing the adjacent-class design
that the first sketch used (the user pushed back appropriately on having
methods on a separate ProjectiveReader class — they belong on the
model, not adjacent to it):
codebook = model.compute_axis_codebook(
battery_idx=0, phase='final', calibration_images=calib,
sample_agg='mean', # 'mean' | 'median' | 'first' | 'cat'
patch_agg='mean', # same
patch_idx=None, # None = all 256 patches; int = legacy single-patch
threshold=-0.9,
)
# → [n_axes, D] tensor
codebooks = model.compute_axis_codebooks(
targets=[(0, 'final'), (5, 'final'), (10, 'final')],
calibration_images=calib,
)
# → dict {(battery_idx, phase): [n_axes, D] tensor}
activations = model.encode_axes(images, battery_idx, phase, codebook)
# → [B, n_patches=256, V, n_axes] (default) or [B, V, n_axes] (patch_idx=int)
codes = model.quantize_axes(images, battery_idx, phase, codebook)
# → integer indices into codebook
Two backward-compat lanes:
patch_idx=intreproduces the exact verification path (A0/A1/A2/A3 probes usedpatch_idx=0)patch_idx=None(default) uses per-patch averaging — averaging over all 256 patches per image rather than the single patch the original probe inherited from training-time CV measurement
The smoke test (A6) verifies seven properties: legacy patch_idx=0 matches
A2 baseline within tolerance, per-patch path produces PROJECTIVE-CLEAN
codebook, cat aggregation produces a substantially larger codebook than
mean (18.5× in our test), encode_axes returns correct shapes with and
without patch_idx, quantize_axes returns valid integer codes, batched
compute_axis_codebooks works on multiple targets, all 16 h2-64 single-
noise batteries produce PROJECTIVE-CLEAN codebooks via both legacy and
per-patch paths.
7/7 pass on the published h2-64 array.
7. Stress test: do axis features beat MSE for classification?
The natural test for any new representation is whether it carries downstream signal that the existing one misses. We took 18 batteries from h2-64 (16 single-noise + pair_04_pink + pair_05_brown), built per- tile features four different ways, and ran the same noise-classification task (16 noise classes + 3 compositional zone classes) through identical MLP architectures.
| Cell | Per-(tile, bank) feature | Per-tile dim | A test acc @ 256 | B test acc @ 256 |
|---|---|---|---|---|
| J | scalar MSE | 18 | 96.6% | 93.1% |
| J' | max-pool axis activations [27 dims] | 486 | 81.6% | 57.5% |
| J'' | V-stats per axis [27×5] | 2430 | 70.1% | 44.8% |
| J''' | per-patch axis stats [27×5] | 2430 | 94.3% | 86.2% |
The first two attempts (J' and J'') lost dramatically. We diagnosed two causes, with the second being structurally more important:
The big one: patch_idx=0. The h2-64 banks have patch_size=4 on
64×64 inputs, producing 256 sub-patches per tile. The original
compute_axis_codebook used only patch 0 — inherited from training-time
CV measurement where one patch suffices for the regularizer. This carried
into A0/A1/A2/A3 (where it didn't matter — those were structural
verification probes, and patch 0 has the same projective structure as
any other patch in a translation-invariant encoder) and then into the
production codebook calibration. Cells J' and J'' were operating on
1/256 of the available spatial signal.
The smaller one: V-pooling. J' applied max-pool over the V dimension
of [V=32, n_axes], collapsing 32 row activations into a single number
per axis. This discards the per-row distribution shape — e.g.
"concentrated on one row" vs "spread across many" — which is the
discriminative signal between input types.
J''' fixed both: codebook calibrated with patch_agg='mean' (averages
over all 256 patches per image during calibration), encode_axes returns
[B, 256, V, n_axes], features computed as joint stats over the 256×32
= 8192 (patch, V-row) plane per axis. Closes 88% of the gap to MSE.
But MSE still wins, consistently 2-8% depending on resolution.
The honest read: for downstream tasks where MSE is naturally available (training-distribution inputs with reconstruction targets), MSE evaluates the training objective directly and is optimally informative. Anything derived from the bank's M tensor is downstream of the MSE-optimized weights and has strictly less information. The remaining 2-8% gap isn't an engineering opportunity — it's structural.
Where axis features may still win, and ft3 should test:
- OOD detection (no MSE recon target available)
- Cross-bank similarity (MSE is per-bank; axes can be compared across banks via Procrustes alignment)
- Few-shot inference (axis quantization → nearest-neighbor classification with no MLP training)
- Compositional input parsing (zone-matte tiles where axis activations preserve sub-tile structure that per-tile MSE collapses)
The discovery is unaffected by these classification results. Whether
axis features beat MSE for classification is a separate question from
whether the projective codebook exists. The Cell-J arc surfaced and
fixed a silent bug (patch_idx=0 in production code) that had been
latent since A0; that cleanup was probably worth the iteration cost.
8. Phase S — 1600-config D=5 architecture sweep
With the projective discovery and API in hand, the natural next step was characterizing D=5 across noise types and architectures. Phase S did this at 20-batch budget — capacity benchmarking, not geometric testing.
The grid: hidden ∈ {4, 8, 16, 32, 64} × V ∈ {2, 4, 8, 16, 32} × D=5 × depth ∈ {0, 1} × n_cross ∈ {0, 1} × noise ∈ {0..15}, all Adam @ lr=3e-3. Each config trains and tests on its own assigned noise type only (16× test cost reduction vs P-sweep's full per-noise generalization measurement). 1600 configs total.
Two findings:
Cross-noise rank correlation: +0.954. Architectures rank near- identically across all 16 noise types (range +0.831 to +1.000 in the 16×16 noise-noise rank correlation matrix). The "best" architecture is essentially noise-INDEPENDENT — what changes per noise is achievable floor MSE, not which model handles it best. Top-4 universal architectures are all hidden=64, V=32 (mean ranks 1.1 to 2.2). depth and n_cross barely matter at 20-batch budget.
Noise difficulty range: 34×. Easiest: pink (best MSE 0.074), brown (0.074), poisson (0.121). Hardest: salt_pepper (2.51), cauchy (1.81), laplace (1.07). Heavy-tailed and impulse-like distributions are intrinsically harder for L2/MSE training — distribution shape, not architectural failure.
The geometry data from S was meaningless. At 20 batches, CV/band- deviation/erank are mid-training snapshots, not converged geometry. Mean CV came back at 0.594 — far from the 0.20-0.23 universal band — but that's not contradicting the discovery. It's measuring optimization in progress.
S's role was prep: it identified the architectures worth running at full convergence budget. That was Phase T.
8.1 Engineering aside: HF rate-limiting
Phase S was the first sweep to upload per-config artifacts to HuggingFace at scale. We had an async upload queue specifically to avoid blocking training (training never waits for HF). What we didn't have was awareness that HF rate-limits commits per repo per unit time.
S submitted 1592 individual upload_folder commits in roughly 30 minutes.
Result: 201 succeeded, 1391 failed with HfHubHTTPError (rate limit).
87% loss rate.
The fix used in T: batch-sync uploads. Submit one upload_folder job
per N completed configs covering the entire output directory. HF Hub does
delta detection — only changed files actually transfer. T's 64 configs / 8
= 8 commits total, no rate limit issues, faster overall.
This is now an engineering invariant. Per-config commits are wrong at sweep volume. Batch-sync covers the full directory at checkpoint cadence.
The S-sweep aggregate JSON was uploaded in one final commit and contains all the analysis-relevant data; the 1391 individual config directories are lost, but the architectural floor map and rank correlations are intact.
9. Phase T — the D=5 walk-back
Phase T was the actual geometric test at D=5. Take the top architectures S identified, run them at A3's full 1000-batch budget across all 16 noise types, measure converged geometry per (architecture × noise) cell.
64 configs = 4 architectures × 16 noises. Architectures chosen from S analysis:
h64_V32_dp0_nx1— S best universal (mean rank 1.1)h64_V32_dp1_nx1— S 3rd best (mean rank 1.9)h64_V16_dp1_nx1— 5th tier, V<32 convergence testh64_V8_dp1_nx0— 10th tier, smaller-V test
The hypothesis being tested: does projective-clean (deviation < 0.05) emerge at D=5 across all noise types and V values, or only for the specific case A3 happened to test?
Answer: NO. Projective-clean is partial at D=5.
23/64 configs (~36%) converged within ±0.05. Mean band deviation +0.067 — four times A3's +0.016 reference. Per-architecture clean rates:
| Architecture | % noises projective-clean |
|---|---|
| h64_V16_dp1_nx1 | 62% (10/16) |
| h64_V32_dp0_nx1 | 50% (8/16) |
| h64_V8_dp1_nx0 | 25% (4/16) |
| h64_V32_dp1_nx1 | 19% (3/16) |
The most striking finding wasn't the failure rate — it was which architecture was best. V=16 was the geometric sweet spot at D=5, not V=32.
9.1 The natural-axis-count framework
Per-V deviation, with the projective-clean band shaded:
| V | Mean dev | p25–p75 | In band? |
|---|---|---|---|
| 8 | +0.115 | [0.07, 0.18] | No |
| 16 | +0.040 | [0.01, 0.07] | Yes (only V whose mean lands inside) |
| 32 | +0.057 | [0.04, 0.07] | Just outside |
This reframes the cross-D pattern from Section 5.
The original framing: "more D = fewer pairs at fixed V=32." We saw 10 pairs at D=3, 6 at D=4, 3 at D=5, all at V=32. The implied story was that D=5 V=32 was the natural endpoint of the trend.
The corrected framing: the natural axis count grows with D, and V should match it for clean projective geometry.
- At D=4, the bank's natural axis count is around 26 (h2-64 mean 25.6). V=32 has 6 pairs — a clean 6-row buffer over capacity. Projective- clean across all 16 noises.
- At D=5, the natural axis count is around 16. V=32 has too many rows; the bank is forced to either pair them up (creating noisy antipodal structure) or spread sub-optimally. Pair structure varies by noise. Different deviations per noise. Partial projective-clean.
- V=16 D=5 lets every row be its own independent axis on ℝP^4. No pairing required. Geometry settles cleanly.
- V=8 D=5 is below natural count. Bank can't span enough of ℝP^4. Geometric under-parameterization (deviation +0.115).
A3's V=32 D=5 gaussian wasn't disproof of projective-clean — it was a
single point that happened to land in the basin despite V over-counting.
Move to V=32 D=5 with a different noise or different cross-attention
setting and you fall out. Phase T tested h64_V32_dp0_nx1 (cross-
attention on, depth 0) and h64_V32_dp1_nx1 (cross-attention on, depth 1).
Neither matched A3's dp0_nx0 exactly. So A3's specific config might
still be a local minimum, but the property is fragile to nearby
architecture choices in a way it isn't at D=4.
9.2 The salt_pepper anomaly: geometry decoupled from reconstruction
In the S-sweep, salt_pepper was the worst noise to fit (best MSE 2.51, ~34× worse than pink). In Phase T, salt_pepper produced the cleanest projective codebooks — 100% of the four tested architectures converged to the projective-clean band on it.
This confirms a structural decoupling: a bank that fails to reconstruct (high MSE) can still produce a clean projective codebook. The codebook reflects how the bank organizes its internal representation; the MSE reflects how well that organization fits the data. The geometry-vs-MSE scatter in T's analysis shows clean (projective-clean band) configurations spanning four orders of magnitude in MSE.
This matters for the next round of axis-feature work. If a bank's MSE is the wrong predictor of codebook quality, then "the worst-fitting bank" might still produce the most useful projective representation for downstream cross-bank analysis. We learned this the hard way in Cell J.
9.3 What survives, what's qualified
Survives the D=5 walk-back:
- D=4 finding stands. 17 distinct sphere-solver models at D=3 and D=4 (G-Cand at D=3 plus H2a + 16 h2-64 batteries at D=4) all PROJECTIVE-CLEAN. Robust across noise types.
- The collapse method works. Mutual-strongest antipodal matching, symmetric merging, sign canonicalization — these are tensor operations, not hypothesis claims. They identify and merge antipodal pairs correctly when present.
- The API ships correctly. Per-patch averaging is the right default. Smoke tests pass on h2-64.
Qualified:
- D=5 universality is partial. 36% of T's configs converged. V=16 is the sweet spot, not V=32. Architecture-sensitive in ways D=4 wasn't.
- A3's three D=5 runs are now retroactively a single-architecture data point. Its +0.016 deviation is real but not generalizable.
- The cross-D pattern reframes from "more D = fewer pairs at fixed V" to "natural axis count scales with D, V should match it."
What was wrong:
- Calling the discovery "universal across D=3/4/5" based on three D=5 runs at one architecture. The cross-D table looked smooth so we didn't push back on the small D=5 sample. Phase T was the proper stress test, and we should have run it before publishing the universality claim.
patch_idx=0was a silent bug for downstream classification use. It carried unchallenged from training-time CV measurement into A0-A3 (where it didn't matter) into production code (where it cost us 88% of the available signal in the J/J' classifier work).
10. Engineering invariants accumulated
ft2's sweeps surfaced three rules worth recording as permanent invariants:
LBFGS Hessian-corruption rule (Section 2). LBFGS closure NEVER
contains clip_grad_norm_. Step safety for LBFGS comes from
line_search_fn='strong_wolfe'. Adam path keeps grad clip — Adam doesn't
use gradient differences. If you find yourself adding gradient clipping
to an LBFGS closure because "it's diverging," stop and re-read entry
000099. The clipping is the bug.
HF rate-limiting at sweep volume (Section 8.1). Per-config commits
fail at sweeps over ~1000 configs. Use batch-sync uploads — one
upload_folder job per N completed configs covering the entire output
directory. HF Hub does delta detection so unchanged files don't re-
upload.
Trajectory entries are dicts. cv_trajectory records each measurement
as {batch, cv, cv_ema, recon}, not flat scalars. Analyzers must extract
'cv' from each entry. We hit this twice — once in the original Phase T
analyzer that crashed on np.array(traj). Mentioning it here so ft3's
analyzers don't make the same mistake.
11. Open questions for ft3
Direct A3-replication test at D=5. Phase T didn't test h64_V32_dp0_nx0 — A3's exact architecture. A focused 16-config sweep at A3's exact arch across all 16 noises would close whether the property is gaussian-specific at that arch or fails universally outside it. Cheap to run.
Is V=16 D=5 robust across architectures, or was h64_V16_dp1_nx1 lucky? T tested only one V=16 architecture at 62% clean. Sweep V=16 × {hidden, depth, n_cross} variants across all 16 noises. If V=16 stays clean across different architectures, it's the actual D=5 sweet spot. If only the dp1_nx1 version works, there's something more specific going on.
What is the natural axis count at D=5 actually? Run the projective probe on T's V=16 vs V=32 codebooks and count actual axes. If V=16 produces ~16 axes (no pairs) and V=32 produces ~16-22 axes (with pairs), the over-counting hypothesis is confirmed.
Cross-D pattern beyond D=5. Need D=6, D=7, D=8 sweeps with V matched to predicted natural axis count, not fixed V=32. The natural-axis-count framework predicts D=6 should want V≈22, D=7 ≈ V≈28, D=8 ≈ V≈34.
The four_quadrant 0% clean failure. Spatially structured noise where no architecture in T converged. Striking — worth a deeper probe.
Cross-noise codebook similarity matrix. With 16 single-noise battery codebooks at D=4 (all PROJECTIVE-CLEAN), we can compute pairwise codebook similarity (Procrustes-aligned cosine) to ask: which noise types share geometric structure? This is an axis-feature application that MSE can't directly answer.
The omega-token connection. ft1 framed omega tokens as the S-vectors from the SVD readout — transferable currency across models. The projective codebook is a different layer of the same architecture (M rows, before SVD). We should probe whether codebook-derived axes serve a similar cross-model alignment role, and how the two representations interact.
12. Artifacts
All Phase S and Phase T deliverables are on HuggingFace under
AbstractPhil/geolip-svae-implicit-solver-experiments:
phaseS_reports/results_phaseS.json— 1600-config aggregate (only the aggregate; the 1391 rate-limit casualties are gone)phaseS_reports/analysis/— markdown summary, JSON metrics, multi- panel PNGphaseT_reports/{64 config dirs}— full weights + reports per config (batch-synced, 8 commits)phaseT_reports/results_phaseT.json— 64-config aggregatephaseT_reports/analysis/— same three-output structure as S
The original 19-model verification (A0/A1/A2/A3 probes) lives under
implicit_solver_reports/:
A0_projective_reprobe.{json,png}— G-Cand at D=3 V=32A1_projective_reprobe_h2a.{json,png}— H2a at D=4 V=32A2_projective_h2_64_singles.{json,png}— all 16 h2-64 batteriesA3_d5_spherical/— the three D=5 reference configs (V=16/32/64)
The package update (compute_axis_codebook etc) shipped in
geolip_svae/arrays/model.py. A6 smoke test in the same repo —
A6_axis_sampling_smoke_test.py, pip-installable via the standard
pip install geolip-svae once the next release goes out.
13. Closing note
ft1 published the universal attractor — a CV signature that quantizes by D, robust to 233 ablation configurations. The "what kind of polygon?" question was left open in Section 10.
ft2 answered it for D=4: the trained sphere-solver IS the polytope codebook, accessible via antipodal-collapse projection onto ℝP^(D-1). Then we extended too far: assuming three D=5 runs at one architecture meant universal-at-D=5, when the proper 64-config stress test showed it's V-dependent and architecture-sensitive at D=5.
The arc from ft1 to here was: an LBFGS engineering bug discovered and
fixed (entry 000099), a Q-sweep that formalized the H2 class hierarchy
and surfaced an ill-fated P-Class hypothesis (000100), four wrong
hypotheses cycled through before the projective interpretation landed
(000101 backstory), 17-model empirical verification at D=3/4 (000101),
package shipping with A6 smoke test (000103-104), classification stress
test that fixed a latent patch_idx=0 bug along the way (000104),
1600-config architectural floor map at D=5 (000105), and 64-config
convergence sweep that walked back the D=5 universality (000106).
The walk-back isn't a failure — it's what stress-testing is for. A clean pattern with a small sample size becomes a TL;DR before it's been exercised. Phase T cost about an hour of A100 time and produced a meaningfully more accurate picture of what the bank does at D=5. ft3 will bring axes 1-7 in Section 11 to the same level of empirical grounding the D=4 result already has.
The discovery is real. The D=4 evidence is robust. The API is shipped. The D=5 framing needs the natural-axis-count refinement, and ft3's first sweep tests that refinement directly.
ft2 emerged over four weeks of A100 time across roughly twenty research sessions. Code and reports are on HuggingFace as described in Section 12. ft3 will extend coverage along Section 11's seven axes, with the natural- axis-count framework as the first testable prediction.

