Large Experiment 2: Fixing AdamW.

Hypothesis:

AdamW is killing performance in these geometric systems, causing cascade failure with weight decay.

Preliminary tests show this is not only a possibility, it is most likely what is happening overall.

Reason:

The rounding elemental system applied by weight decay is helpful for many weights and it helps align rounded structures, while simultaneously destroying optimization for rigid structures.

Until now this has proven valuable, and now that it's become a hinderance a new formula must be created to replace the AdamW limitations.

Experiment 1: Retune AdamW Directly

I'll attempt to tweak AdamW specifically to not destroy the geometric shape, disabling weight_decay from this point onward.

The outcomes have shown there isn't much beyond tuning specifics to tweak this particular classifier. Introducing more anchors helps, and with that the dims can be reduced. Essentially the anchors are capacity tuning forks in this variation rather than utilities, which is fine. We can allocate them to a student.

=================================================================
SWEEP RESULTS
=================================================================

  Config           v_acc  t_acc    gap      cv     Δcv  eq_std  poly curve  star struct
  ------------------------------------------------------------------------------------------
  raw_adam         0.617  0.681 +0.064  1.3917 +1.1917  0.4075  0.39  0.75  0.86  0.61
  proven           0.722  0.706 -0.016  1.3629 +1.1629  0.4157  0.45  0.99  0.93  0.71
  +spread          0.669  0.686 +0.017  1.4491 +1.2491  0.4212  0.41  0.98  0.71  0.72
  +entropy         0.674  0.711 +0.037  1.4945 +1.2945  0.4237  0.42  0.97  0.70  0.74
  +ortho           0.695  0.690 -0.005  1.3454 +1.1454  0.4171  0.40  0.99  0.85  0.72
  +cluster         0.701  0.717 +0.016  1.3034 +1.1034  0.4131  0.44  0.93  0.91  0.70
  +drift           0.709  0.698 -0.012  1.3480 +1.1480  0.4134  0.41  1.00  0.91  0.72
  +spr+ort         0.723  0.698 -0.025  1.3881 +1.1881  0.4224  0.46  0.97  0.94  0.71
  +all_micro       0.694  0.700 +0.007  1.5181 +1.3181  0.4077  0.40  0.97  0.85  0.72

  Best accuracy: +spr+ort (val_acc=0.723)
  Best structure: +entropy (struct=0.737)
  Closest to CV=0.2: +cluster (cv=1.3034, Δ=+1.1034)
  Most equidistant: raw_adam (equi_std=0.4075)
  Most stable CV: raw_adam (cv_std=0.1734)

=================================================================
DONE
=================================================================

The outcomes show that we can definitely impact the outcome and the deviation of the system will conform to an entirely new spectrum of CV currently unoccupied.

There's a lot to unpack here, and I think the biggest most critical piece to unpack is a hyperparameter about controlling where on the latent spectrum you want your model's continuum to exist within.

Experiment 2: Teacher/Student hierarchy

I've ran plenty of genetic experiments, a single student anchored from a teacher should provide a more robust sweep

=================================================================
COMPARISON
=================================================================

  Config                v_acc  t_acc    gap      cv  poly curve  star struct
  ---------------------------------------------------------------------------
  Raw Adam              0.626  0.679 +0.053  1.3669  0.30  1.00  0.77  0.65
  Teacher               0.645  0.629 -0.017  1.5645  0.38  0.91  0.72  0.71
  Student+entropy       0.698  0.677 -0.021  2.1624  0.39  0.98  0.88  0.73
  Student+same          0.672  0.681 +0.010  1.5182  0.39  1.00  0.75  0.72

  Val accuracy trajectory:
  Epoch    Raw Adam        Teacher         Student+entropy Student+same   
  E1      0.198           0.167           0.164           0.189          
  E5      0.592           0.426           0.509           0.526          
  E10     0.623           0.576           0.614           0.652          
  E15     0.633           0.647           0.670           0.618          
  E20     0.658           0.638           0.630           0.603          
  E25     0.685           0.662           0.665           0.700          
  E30     0.626           0.645           0.698           0.672          

  Teacher→Student anchor drift:
    Mean drift: 0.3036
    Max drift:  0.4506
    Min drift:  0.1311

Genetic inheritance works quite well in this spectrum either way, the anchors can exist down the chain passed to student.

Experiment 3: Dual Teacher/Student Procrustes Duality

A good analysis for my particular architectures, needs procrustes whitening, center, and alignment curation. With this the system conforms more cleanly.

This cannot exist on the teachers, but the student must see it.

As expected the student outperformed the teacher. That's the same results from the captionbert process, and we now have an autograd representation that can produce a reusable state of this.

This demonstrates a practical use of the dual-teacher

=================================================================
DUAL-TEACHER PROCRUSTES CONSENSUS DISTILLATION
=================================================================
  Device: cuda

  Generating data...
  Train: 15,000  Val: 3,000

=================================================================
STAGE 1A: TEACHER A — Raw Adam
=================================================================
  [A] E 1: t=0.073 v=0.200 cv=1.3069
  [A] E10: t=0.612 v=0.613 cv=1.4364
  [A] E20: t=0.655 v=0.590 cv=1.4770
  [A] E30: t=0.690 v=0.699 cv=1.3797

=================================================================
STAGE 1B: TEACHER B — Geometric (+spr+ort)
=================================================================
  [B] E 1: t=0.072 v=0.184 cv=1.4589
  [B] E10: t=0.578 v=0.606 cv=1.5603
  [B] E20: t=0.614 v=0.667 cv=1.5950
  [B] E30: t=0.658 v=0.649 cv=1.8004

=================================================================
STAGE 2: EXTRACT + PROCRUSTES ALIGN
=================================================================
  Teacher A embeddings: torch.Size([15000, 768])
  Teacher B embeddings: torch.Size([15000, 768])
  Raw cos(A, B): 0.4360
  GPA iter 1: delta=0.12673541
  GPA iter 5: delta=0.01321763
  GPA iter 10: delta=0.00224325
  cos(consensus, a): 0.8251
  cos(consensus, b): 0.8226
  Consensus CV: 0.1774
  Consensus anchors: torch.Size([30, 768])
  Teacher A anchors cos: 0.0008
  Teacher B anchors cos: -0.0160

=================================================================
STAGE 3: STUDENT — Consensus distillation + classification
=================================================================
  E 1: t=0.081 v=0.203 cos=0.230 cv=1.1871 rig=4.8/34.4 [polygon=0.04 curve=0.00 star=0.36 structure=0.35]
  E 5: t=0.610 v=0.618 cos=0.451 cv=0.6686 rig=12.9/98.8 [polygon=0.38 curve=0.83 star=0.67 structure=0.70]
  E10: t=0.660 v=0.659 cos=0.550 cv=0.5453 rig=15.5/99.6 [polygon=0.41 curve=0.94 star=0.71 structure=0.72]
  E15: t=0.711 v=0.702 cos=0.625 cv=0.4492 rig=18.7/97.8 [polygon=0.39 curve=0.88 star=0.93 structure=0.76]
  E20: t=0.735 v=0.703 cos=0.671 cv=0.4598 rig=18.8/96.4 [polygon=0.45 curve=1.00 star=0.84 structure=0.70]
  E25: t=0.745 v=0.736 cos=0.693 cv=0.4261 rig=18.3/92.9 [polygon=0.48 curve=1.00 star=0.92 structure=0.73]
  E30: t=0.763 v=0.761 cos=0.704 cv=0.3359 rig=17.9/90.4 [polygon=0.50 curve=0.98 star=0.97 structure=0.76]

=================================================================
FINAL COMPARISON
=================================================================

  Model            v_acc      cv  poly curve  star struct
  -------------------------------------------------------
  Teacher_A        0.699  1.4312  0.42  0.99  0.83  0.72
  Teacher_B        0.649  1.5969  0.38  0.95  0.79  0.66
  Student          0.761  0.3329  0.50  0.98  0.97  0.76

  Student anchor drift from consensus: mean=0.4458 max=0.6453

=================================================================
DONE
=================================================================

Experiment 4: Genetic Hierarchy

Lets make some inbred mutants and see how they behave.

=================================================================
EVOLUTION SUMMARY
=================================================================

  Model        Gen  v_acc      cv  poly curve  star struct
  -------------------------------------------------------
  F0_geo         0  0.663  1.8428  0.36  1.00  0.75  0.72
  F0_raw         0  0.586  1.2137  0.48  0.29  0.91  0.64
  F1_new         1  0.690  0.0000  0.00  0.00  0.00  0.00
  G1_0           1  0.746  0.5866  0.50  0.99  0.93  0.74
  G1_1           1  0.749  0.4433  0.49  1.00  0.95  0.74
  G1_2           1  0.761  0.2822  0.49  1.00  0.99  0.75
  F2_new         2  0.699  0.0000  0.00  0.00  0.00  0.00
  G2_0           2  0.750  0.4110  0.49  0.99  0.94  0.75
  G2_1           2  0.751  0.4170  0.49  0.95  0.97  0.76
  G2_2           2  0.766  0.3079  0.54  0.98  0.96  0.75
  G2_3           2  0.764  0.3613  0.55  0.98  0.97  0.73
  FINAL          3  0.764  0.2954  0.55  1.00  0.96  0.73

  Per-generation averages:
    Gen 0: mean_acc=0.625 best=0.663 n=2
    Gen 1: mean_acc=0.737 best=0.761 n=4
    Gen 2: mean_acc=0.746 best=0.766 n=5
    Gen 3: mean_acc=0.764 best=0.764 n=1

  Consensus CV progression: G1=0.1258 → G2=0.1031 → G3=0.1456

Turns out this is a bit more selective and a bit less jitter than something like genetic inheritance.

Not only that, but it's actually not bad.

The polygon gain shows the real story, the inheritance is the geometric structure that didn't collapse in the anchors.

So over multiple generations, the geometric complexity enhances based on the losses and autograd naturally enhancing the output.

Large Experimental Conclusion 1:

Experiment 1's patchmaker classifier was invalidly aligned to the anchors, the final must be reran.

The current adamw based modifier is killing geometric results.

AdamW is a limiter not a helper. Adam with trajectory and separation control is more reliable and not enough.

Discoveries:

Common case cross_entropy loses on every margin when using hypersphere coordinates as embeddings. Completely defeated by +12% or higher with just the geo losses.

This formula is missing a core component that cannot represent the necessary implications yet for full encoding cohesion.

Without the geodesic controllers applied by the more advanced novel controlling agents and losses, the system cannot differentiate useful measures on larger plane structures.

Though I knew that last one.

Reason:

Training the anchor itself with the bert structure caused a large state of drift, which decoupled many internal learned structures.

This unto itself caused the Bert model's internal CV to deform, and I will need to roll back the last 2 unfrozen epochs because of it, but I have a backup so it's fine.

The assessment shows that the rigidity was destroyed and smoothed into a similar state as the hypersphere, which meant the pressure from the hypersphere was predominantly being applied internally within the model through the averaging mechanisms rather than the structure fully preserving the manifold.

This wasn't catastrophic, captionbert is predominantly fine, but the damage is internally extensive and will require a rollback causing -1mil samples on the tally total.

Externally you would never know. captionbert looks predominantly fine, the measures are even better than before. Internally, the systems collapse was extensive.

Many functional systems collapsed into more generic functional systems, destroying the preserved geometry when things "bloated" too much from the anchor drifting. Natural attenuation will desire equilibrium, and there is 5 experts - there will never be true equilibrium.

Thus the anchor must be nearly completely frozen while training the core weights, but not completely frozen. True euclidean space requires some drift to compensate for capacity differentiation and growth, but this system is unique to the emulation of superposition differentiation, and thus many of the quirks will be...

Unpredictable.

Hypothesis for why:

The structural integrity must remain rigid while being prepared over a smooth surface. Some smoothing must occur to map multiplanar supported systems from multiple adjacent rigid complex associations. However, due to the nature of the multiplanar rigidity being misaligned by nature, the structure conformed to an invisible "MIDDLEGROUND" differentiation element. This middleground average formed a pooled structure in complete defiance of the anchor and the system, due to the anchor not being solidly enough preserved.

Potential Solution:

Control the autograd to preserve the anchor as the predominant choice, potentially causing instability. This will require multiple tests.

Experimental Hypothesis;

The euclidean autograd for pytorch is causing differential analysis to collapse in the final stages of MLP, reducing the overall capacity and destroying the attenutated geometric anchored structure in a way that isn't beneficial nor helpful towards the geometric goal.

Experiment 1;

gate-aware autograd interference autonomous adaptation

This will theoretically compensate for the autograd's tenacity to overly smooth complex structures, while compensating for those complex structural gains and ignoring important structural systems that exist within the anchored CV geometric spectrum established.

This will potentially preserve rigidity while allowing a multiplanar smoothing effect to occur, which is native to hypersphere-based architectures that sample rigid positioning from rigid manifolds to map the rigidity to smooth layered surfaces.

Part 1;

Simple benchmark. Outcomes showed differentiation and potential utility tested on a simple 3 synthetic shape classifier using actual geometric shapes.

=================================================================
COMPARISON
=================================================================

  Metric                      Baseline      Gated
  -----------------------------------------------
  Val accuracy                   0.999      0.998
  Train accuracy                 1.000      0.999
  Overfit gap                    0.001      0.001
  Val CV                        0.7888     0.8755
  Proto similarity              -0.242     -0.253
  CV tri                         0.854      0.721
  CV circle                      0.768      0.669
  CV pentagon                    1.238      1.009

  CV trajectory (std over epochs):
    Baseline: 0.1125
    Gated:    0.1127
    Baseline more stable

  Overfit gap trajectory (mean ± std):
    Baseline: -0.008 ± 0.041
    Gated:    -0.011 ± 0.043

=================================================================
DONE
=================================================================

The differentiation is radical, but the task was trivial. The solution was found strongly enough to potentially triadically bypass the problem.

Attempt 2; 10 shapes < mini shape mnist

As the original geometric patchwork system would showcase, the shapes themselves are in fact capable of classification and not very hard to do.

=================================================================
COMPARISON
=================================================================

  Metric                      Baseline      Gated
  -----------------------------------------------
  Val accuracy                   0.991      0.989
  Train accuracy                 0.993      0.996
  Overfit gap                    0.002      0.007
  Val CV                        0.7236     0.7476
  Proto similarity              -0.081     -0.085
  CV triangle                    0.474      0.436
  CV circle                      0.644      0.592
  CV pentagon                    0.522      0.559
  CV square                      0.521      0.508
  CV hexagon                     0.778      0.623
  CV star5                       0.464      0.519
  CV star7                       0.721      0.645
  CV octagon                     0.683      0.500
  CV cross                       0.771      0.829
  CV spiral                      1.267      0.914

  CV trajectory (std over epochs):
    Baseline: 0.1144
    Gated:    0.1307
    Baseline more stable

  Overfit gap trajectory (mean ± std):
    Baseline: -0.007 ± 0.056
    Gated:    0.002 ± 0.053

=================================================================
DONE
=================================================================

The outcome was still too trivial, the model managed to find nearly orthogonal solutions at 99.9% answers to the validity data.

The CV was meaningless due to the simplicity of the response. This would not yield the necessary implications to the result need.

Attempt 3; 30 shapes - captionbert anchors for embedding vectors

This is considerably more complex. It forces the model to learn the differences rather than simply bypass the losses and funnel.

Attempt 4; experimental hypersphere coordinate embedding

The results are okay and the 30 shapes make it harder to solve, but the fundamental issue still exists. The hypersphere does not conform with or without the autograd gate yet. The rigidity is smoothed instead of existing simultaneously.

  Final constellation:
    Mean cos: 0.0025
    CV:       0.3251
    Rigidity: mean=9.4 max=100.0

  Per-anchor rigidity:
    triangle       : 1.9 █
    square         : 1.9 █
    pentagon       : 1.8 █
    hexagon        : 2.0 █
    heptagon       : 2.0 ██
    octagon        : 2.0 ██
    nonagon        : 2.1 ██
    decagon        : 2.0 █
    dodecagon      : 2.1 ██
    circle         : 2.3 ██
    ellipse        : 1.9 █
    spiral         : 8.2 ████████
    wave           : 20.9 ████████████████████
    crescent       : 100.0 ████████████████████████████████████████████████████████████████████████████████████████████████████
    star3          : 2.3 ██
    star4          : 1.8 █
    star5          : 2.1 ██
    star6          : 2.4 ██
    star7          : 2.7 ██
    star8          : 3.2 ███
    cross          : 2.6 ██
    diamond        : 1.9 █
    arrow          : 1.8 █
    heart          : 1.3 █
    ring           : 1.3 █
    semicircle     : 100.0 ████████████████████████████████████████████████████████████████████████████████████████████████████
    trapezoid      : 1.8 █
    parallelogram  : 1.8 █
    rhombus        : 1.9 █
    chevron        : 1.6 █

This causes cascade bias rather than helpful behavior.

SOMEHOW the rigidity of a circle is recorded as MAXIMUM rigidity, which is likely true due to the nature of circles having such a dense complexion.

In that rite you can probably say yes circles are represented by potentially an indefinite number of representation points, which is why we're measuring around one - not supposed to be literally measuring with one.

IT DID IN FACT classify, the shape of it's own most supported embedding. Which, is expected, however, I did not expect the rigidity to conform as well.

Maximumly rigid, minimally curved, entirely... wrong.

Attempt 4; new understanding - AI liar paradox interferes with baseline geometric research

The AI systems I'm working through namely Claude, GPT, and Gemini are all running along the same banks of dirt as we attempt to debug this.

After analyzing the code I noticed that it was literally turned into a hypersphere analyzer, instead of using representation points to project a utility onto another surface.

The circle constraints cause the model internals to grind around the possibility of actuality, which is the result of the experimentation, in favor of some sort of internalized geometric bias established at a rudimentary level that simply does not conform to the actual results.

Simply put, they know something wrong, and they each are defining that incorrectness over and over as a normality.

I'm attempting to compensate so I can get this next experiment done and then move onward to the next point.

Getting to the bottom of broken taught theorem isn't on my list here, I need the experiment ready.

Attempt 5; returning to baseline data and running a sweep using the refined autograd.

Since none of the AI can guess their way out of it and I'm starting to see a pattern, we're going to run a full sweep here using simple autoregression MLP.

The resulting geometric alignment through the defined autograd will determine the direction of adjustment, and the formation of our constant intrinsic barrier for CV, assuming such a CV barrier CAN exist. Which I'm starting to think the very relational nature of CV is a dynamic one, may not actually be controllable without weights.

=================================================================
SWEEP RESULTS
=================================================================

  Config           v_acc  t_acc    gap      cv     Δcv  eq_std  poly curve  star struct
  ------------------------------------------------------------------------------------------
  baseline         0.691  0.714 +0.022  1.0222 +0.8222  0.4366  0.41  0.98  0.82  0.72
  tang_50          0.692  0.713 +0.021  1.1747 +0.9747  0.4328  0.41  0.99  0.83  0.72
  tang_100         0.705  0.717 +0.012  1.0296 +0.8296  0.4534  0.42  0.98  0.85  0.73
  equi_low         0.033  0.034 +0.001  0.0000 -0.2000  0.0427  0.00  0.00  0.00  0.10
  equi_med         0.033  0.033 -0.000  0.0000 -0.2000  0.0424  0.00  0.00  0.00  0.10
  equi_high        0.033  0.034 +0.001  0.0000 -0.2000  0.0422  0.00  0.00  0.00  0.10
  sep_low          0.657  0.718 +0.060  1.2410 +1.0410  0.4440  0.42  0.95  0.69  0.71
  sep_high         0.712  0.709 -0.003  1.7948 +1.5948  0.4317  0.45  0.96  0.91  0.71
  equi+sep         0.033  0.034 +0.001  0.0000 -0.2000  0.0429  0.00  0.00  0.00  0.10
  full_gentle      0.033  0.035 +0.001  0.0000 -0.2000  0.0421  0.00  0.00  0.00  0.10
  full_strong      0.033  0.034 +0.001  0.0000 -0.2000  0.0421  0.00  0.00  0.00  0.10
  max              0.033  0.035 +0.001  0.0000 -0.2000  0.0424  0.00  0.00  0.00  0.10

  Best accuracy: sep_high (val_acc=0.712)
  Best structure: tang_100 (struct=0.732)
  Closest to CV=0.2: full_strong (cv=0.0000, Δ=-0.2000)
  Most equidistant: full_gentle (equi_std=0.0421)
  Most stable CV: full_gentle (cv_std=0.0008)

=================================================================
DONE
=================================================================

As you can see, multiple toggles destroy the autograd procedure completely. I've devised a potential solution.

Attempt 6: Updated autograd system with better controls

The structure for the last was using both invalid and old losses, as well as incorrect spectral control of the gradients.

=================================================================
SWEEP RESULTS
=================================================================

  Config           v_acc  t_acc    gap      cv     Δcv  eq_std  poly curve  star struct
  ------------------------------------------------------------------------------------------
  baseline         0.719  0.710 -0.009  1.5066 +1.3066  0.4413  0.53  0.98  0.92  0.64
  cv_only_01       0.548  0.480 -0.068  0.2297 +0.0297  0.6814  0.12  0.93  0.76  0.61
  cv_only_05       0.478  0.472 -0.006  0.1963 -0.0037  0.6698  0.13  0.94  0.50  0.55
  cv_only_10       0.428  0.401 -0.027  0.2638 +0.0638  0.6322  0.11  0.94  0.44  0.45
  tang_50          0.711  0.706 -0.004  1.5108 +1.3108  0.4530  0.52  0.96  0.91  0.64
  tang_100         0.720  0.690 -0.030  1.5335 +1.3335  0.4431  0.52  0.98  0.92  0.65
  tang+cv          0.572  0.542 -0.030  0.4158 +0.2158  0.6602  0.22  0.84  0.78  0.63
  sep_low          0.709  0.723 +0.014  1.6462 +1.4462  0.4423  0.53  0.93  0.90  0.65
  sep_high         0.730  0.716 -0.014  1.6925 +1.4925  0.4530  0.56  0.96  0.94  0.65
  tang+cv+sep      0.552  0.507 -0.045  0.5011 +0.3011  0.5835  0.15  0.95  0.77  0.58
  full_med         0.575  0.540 -0.035  0.3207 +0.1207  0.7342  0.19  0.96  0.79  0.60
  full_strong      0.476  0.410 -0.066  0.2337 +0.0337  0.6569  0.16  0.91  0.50  0.53

  Best accuracy: sep_high (val_acc=0.730)
  Best structure: tang_100 (struct=0.649)
  Closest to CV=0.2: cv_only_05 (cv=0.1963, Δ=-0.0037)
  Most equidistant: baseline (equi_std=0.4413)
  Most stable CV: cv_only_01 (cv_std=0.0700)

=================================================================
DONE
=================================================================

The run is cleaner, but the geometrics all over the board. Getting closer.

Attempt 7: Tighter constraints and more specific backward control.

With the hand on the CV constraint as a pulse control, the system must be much more lenient than the loss.

The loss was an echo, this is a shockwave controller. Akin to applying frequency band control, thus too much CV is akin to destroying the actual model's growth.

We don't want to ELIMINATE the CV, we want to curate it. We don't want to trim incorrect branches, we want the system to retain those incorrect branches that are most useful.

With that information, I formatted a more subtle and reduced power CV sweep with the best tangental and separation from the last.

=================================================================
SWEEP RESULTS
=================================================================

  Config           v_acc  t_acc    gap      cv     Δcv  eq_std  poly curve  star struct
  ------------------------------------------------------------------------------------------
  no_cv            0.721  0.730 +0.009  1.7346 +1.5346  0.4400  0.54  0.99  0.89  0.65
  cv_0.001         0.712  0.706 -0.005  1.5553 +1.3553  0.4291  0.40  0.99  0.91  0.74
  cv_0.005         0.697  0.690 -0.006  1.5407 +1.3407  0.4622  0.38  0.97  0.88  0.73
  cv_0.01          0.640  0.649 +0.010  0.3353 +0.1353  0.6070  0.29  0.97  0.85  0.67
  cv_0.03          0.648  0.632 -0.016  0.2985 +0.0985  0.5909  0.30  0.98  0.85  0.67
  cv_0.06          0.586  0.568 -0.018  0.2331 +0.0331  0.5957  0.24  0.96  0.76  0.61

  Best accuracy: no_cv (val_acc=0.721)
  Best structure: cv_0.001 (struct=0.735)
  Closest to CV=0.2: cv_0.06 (cv=0.2331, Δ=+0.0331)
  Most equidistant: cv_0.001 (equi_std=0.4291)
  Most stable CV: cv_0.06 (cv_std=0.1087)

Less... is more. Next I'll be running the same cv with a 0.01 tangent and an increased sep.

I've got a good notion that this could work.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support