AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations
Needs a larger influx of data, I'm going to funnel a few million captions into it before I declare it complete. Currently it's still a learner, even though it stays perfect. The CV leaves tons of room for expansion.
This bert is currently being distilled using 5 bert teachers using the conceptual captions dataset. The recall accuracy is based on the whitened procrustes alignment, and the losses reflect keeping that rotation aligned correctly.
The expectation from the smaller prototypes show this model will align to 100% accuracy recall based on the most optimal opinions based on the correct answer, aligning specifically to the correct answers in conjunction with all the geometric losses.
No joke, this may be the smallest, least computation, most accurate, and fastest bert I've trained thus far - and it will be based entirely on five teachers simultaneously feeding opinions through a relay hub.
The bigG trainer is unstable, I'm ironing out the overflows and nans.
Takes nearly an hour per epoch on the bigG trainer, so it's going to be a bit before it's ready. Didn't think this one would be such a problem.
Lost some sleep getting that one set up, but clip_l and clip_g both have memory banks now. G had the most problems and most errors. It will require some additional mechanisms to ensure the sequence works correctly as well. However, surprisingly enough, it can approximate roughly 43% of clip_g's layer sequence similarly through reconstruction, and it handles over 84.4% modern bert accuracy with memory.
I believe the dimensional scaling problem is solved using the correct tokenization differentiation, similar to geolip-Bertenstein. This will allow direct translation effectiveness through a uniformly distributed existence rather than a conduit series.
Clip_G is a big one. The scaling system that uses SVD uses a series of alignment paradigms, and with those paradigms includes padding. The padding essentially only covers 1024 dims of modernbert, while simultaneously consuming 1280 dims of CLIP_G.
I cannot simply project modernbert upward, it will slowly introduce corruption and incorrectness. I cannot simply crush clip_g, it will introduce compounding rounding errors and corruption down the scale, and not produce the correct information.
Which means the complexity of the geometric state of clip_g being captured is simply more, and the geometric structure loss is compensating for that 256 dimensions directly instead of asymmetrically indirectly through inductive learning. This compensation effect caused a cascade fault of incorrect accumulation, low hanging triangulation, that accumulated within the sequential analysis toolkit. This bled into the bank as well, which required a series of tests to repair and 10 full epochs to reach close to the clip_l version.
Clip_l is smaller than modernbert, clip_g is larger.
Berteinstein showed the formula of correct multiscale access, but bertenstein was ran on considerably less attractors. This sort of memory experiment is substantially more aligned with meshing rather than a conduit, and yet I believe I can directly adapt a similar principle and it will work for direct complexity association.
This is a crossover from common machine learning projection practices. We've run along the same island and both came up with a similar reasoning behind why. I have a potential solution, but it requires setup and planning. We all use the same machines and the same ships to find the islands, these systems map them similarly and use the information similarly, but we're essentially speaking different dialects of the same outcomes.
Large dimensions overwhelm smaller dimensions, and I believe I have the solution for this. The smaller dimension doesn't necessarily have less information, but it is often treated as though it does by the processes of accumulation. MORE gives more credence to bias, and bias forms more likely through more values and more averaged rounding. Simple in the end, and the principality of this system runs along similar principles.
However, geometric structures use anchoring. This is why the modernbert's projection estimations survive, and the direct clip_g sequence learning failed - that and a lack of data, I couldn't simply jam all 32 layers in there I had to cap it, but this wasn't the core reason. You can learn a single layer and predict that single layer with high accuracy using these anchoring systems in differential anchoring, David ran that on pure linear systems with minimal geometry and it works along the lines of many layers.
These systems here are analytical differentiation injunctive biases that are not defined by the law of averages, they are defined by the complexity of accumulation through multitier association. This is a much more enriched elemental process, and yet, we still ran into the same island without the correct safeguards. I believed the sequential system would correctly accumulate based on the task just by accessing the formatted bank, but I was sorely mistaken and I apologize for my incorrect assumption.
I will install these safeguards and the sequential system will be more likely to align, but there is no guarantees.
Sequence cosine representation in relation to CLIP_L is forming, distilling the behavior using a distilled memory bank as a judicator, with a frozen clip-l sequence input data, with frozen modernbert leading in context behavioral adjustment.
head explode noise
Welp, sequence will be ready soon. It'll support modified 77 token spaces, rather than just a single pooled vector. The entire space will be slightly warped or modified depending on the input. Extended inputs trained clean into the sequence with nothing truncated.
https://youtu.be/XOnMNv_oQ4A?si=WoT4TEUkotST4uoB&t=60
There's no earthly way of knowing
Which direction we are going
There's no knowing where we're rowing
Or which way the river's flowing
Is it raining, is it snowing
Is a hurricane a-blowing
Not a speck of light is showing
So the danger must be growing
Are the fires of Hell a-glowing
Is the grisly reaper mowing
Yes, the danger must be growing
For the rowers keep on rowing
And they're certainly not showing
Any signs that they are slowing
I'll be adding a sequence reconstructor and train a potential clip sequence reconstruction MSE predictor. Not really certain currently if I can accomplish this in a reasonable amount of time but maybe.
If it works, it could be pretty powerful.
A little creativity allows me to extend the context window of sd15's unet fairly easily. Beyond the clip boundary, the current system can introduce additional details into the spectrum of the structure as-is.
It's highly unstable, but it can do some interesting things.
More than likely this isn't worth extending, sdxl has a clip-vit-g that can be extended however.
AbstractPhil/geolip-clip-vit-large-patch14-ctx576
The memory pod is specifically meant to tune everything based on final state pooling, which is fine if you aren't trying to actually use sequential utility.
HOWEVER, there are many elemental biases that present themselves if attempting to USE the standard sequence of 77 in conjunction with this final pooled state. Even though the standard 77 is predominantly noise past token 10 it still houses considerable amounts of information in terms of utility, so this should be handled carefully. Zero-shot structures are a tricky structure to analyze, especially structures based on attention mechanisms instead of true sequential accumulation. I've noticed I need to watch them for quite a while before the real bugs show up.
As it stands the token pool is essentially [B, 7+8, 768] for pools. This contains a robust and highly complex representation of useful accumulated bidirectional attention data, so it's quite powerful.
I'll build a few prototypes and tap into some papers. I'll either come up with something or a reason why I didn't. The end result will either produce an anchor bank set of tokens [B, 15, 768] for pooling, or [B, 15, 77, 768] ideally - which should expand the sequence of the clip to 1,155 if successful. That doesn't necessarily mean this sequence will be more useful than the [b, 15, 768], but it will be representationally valid to the context window expansion.
I wouldn't hold out for a single full-sequence option in a single day, that's a lot of moving parts to analyze, not to mention highly impractical to train with. A smaller dose of this information would be necessary for rapid prototyping so it'll likely be packaged as such.
Well I spoke too soon. It's ready to play with.
AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77
This is starting to get tedious. I'm going to need to start making geofractal routers to save time and form reusable components, which will enable a more reusable and easier to load structure that compiles. It'll be a little more annoying to run on other systems until I get things worked out overall, but it's going to be required soon.
This experiment has exposed a series of potential uses of this procrustes formula hybrid with geometry, and the largest most useful utility I can think is to directly encode huge amounts of information into compacted multishot memory space.
Collapsing huge amounts of tokens into small spaces for high-fidelity relational understanding and use.
So with that thought, I'll be creating a longterm and shortterm memory composite for context window expansion, and then give Bert-Large... a larger context window. Much larger. This isn't something I can decide for how much context I can give Bert, as I've tried larger Berts in the past and they collapse quickly to nearly useless.
This however, will hold. It does not collapse, there is no room to collapse. The real question now, is how to design it, which layers to utilize for expanding that structure, the most useful multi-shot spectrum to access bert to pool the encodings, and the most useful methodology for extracting those expected outcomes in a reasonable way... without needing an arm and a leg to train Bert.
So the real problem is cost now, rather than simply tests or experiment potentials. How much will it cost to train Bert, how large can the context window be within that cost, and how many days will it take to train this expanded bert.
A brief analysis as to what I plan to do, is essentially memory is an accumulation of tokens creating a series of points on a geometric manifold, allowing guaranteed anchored differential accumulation responses. This is akin to allowing high dimensional representational boundaries in a dimensional spectral boundary that exists outside of the current system and is not currently observed in standard short term nor long term AI paradigms.
Each token is represented as potentially one, a thousand, or 500,000 representative systemic accumulations within Bert - this value is based on the resolution I want to impose. This is the geometric vocabulary's manifold control access, and where the system will live. This isn't additive, this is accumulative geometric differentiation. A far different beast that includes a large series of formulas to manifest even a theorem for.
If this works, the results will be immediate.
It is not production ready yet, there needs to be a few upstream and downstream tools meant to consume and process the outputs to create useful representations.
This model will be able to text respond, use whisper, see with dinolip, code with codebert, and process proteins using esm2_t33_650m_ur50.
Our experts for the prototype are;
google-bert/bert-large-uncased
facebook/dinov2-large
microsoft/codebert-base
openai/whisper-large-v3
facebook/esm2_t33_650M_UR50
Not the smartest text model, but more than enough for this preliminary use case test setup. Text is predominantly meant to align and orient downward function, the entire machine is meant to be operated unilaterally as a collective, or independently through individual pairs requests via special token access.
This model will be capable of substantial power and feats as a prototype. It will be capable of seeing and processing differential equations utilizing dinov2 and esm2 data simultaneously, which can be used for downstream analysis - and I WILL use that data to create a more powerful connection between dinov2 tokens, protein tokens, video tokens, code tokens, and audio tokens.
This is the FIRST prototype of this case, and I will introduce video, genetics, shape analysis, pattern recognition processing, and a much more powerful and reusable text model.
The tests show the models can have differential communication through the geolip transformers after procrustes pairwise analysis and pentachoron CV protective measures.
Whitening procrustes for precalculation and center-aligning allows for a faster convergence, so that should help too.
The first real prototype with geometric alignment is named;
geolip-bertenstein - a collective of shared transformer aligned experts, not a mixture of experts.
AbstractPhil/geolip-procrustes
I encourage EVERYONE who is curious to check my work. Check it, double check it, and triple check it.
These were aligned using COCO and then validated with Flickr. Entirely different datasets. The experts arbitrated and the alignment yielded the correct answers. Preliminary tests show that with almost no alignment requirement, the models can reach 100% R1 retrieval accuracy.
Not to be confused with validation accuracy for a classification model or a text encoder's text response, this allows multispectral communication between entirely different models for direct downstream consumption with almost no training for the chosen models.
I have a working procrustes experiment that learns adjacent manifolds within a reasonable spectrum and the speed is... well, 1 epoch with COCO using Bert-Large and DinoV2 that allows the models to align nearly perfectly. For some scales in the experiment it shows that the 3 set epochs aren't quite enough to align R1 to highest, while many align nearly immediately.
These two were an obvious pair to pick, 60% similarity and >90% spectral similarity.
The trainer transfers layers, learns embeddings, and more - all by sticking strictly to geometric boundaries and procrustes informational accumulation within a modulation model's constraints.
I have many experiments to run.
After a very long set of days, with multiple setbacks, I have found a potential direction using a type of modulation attention I haven't named yet, in direct association with transformer structural boundaries.
This attention is essentially based on a form of geometric modulation and gated based on differentiation. This is likely one of the building blocks for a replacement to a hard-trained set of weights - instead formatted into one of the first legitimate safety-nets built specifically for geometric attenuation.
Experiments show a multitude of potential limitations. Those potentials are destroying certain objectives and combining others into new processes, rather than letting the original design sit in concrete. Everything must conform to the math, not the math conform to the everything in this structure.
The entire concept here is narrowing down the problem into a regressed solution that makes the most complementary sense to the least potential requirement of hardware in order to achieve the necessary goals.
https://huggingface.co/AbstractPhil/procrustes-analysis
You can find my current task-oriented experimentation stored here. As I deconstruct the models into their subsequent boundaries I accumulate a manifest of information and data. This is entirely meant to build that very same geometric structural awareness that models require to be stable.
I've discovered multiple very tight bottleneck points that uniform among models with the multitude of analysis I've ran. There are some that likely form based on the law of averages, and there are others that form... well, they are mostly the same among all models - but they are not the same for every model so I can refer to those as semi-constant. I've found some constant spaces, and some constant point of ranges, but I need to test more models and I need to test larger models.
I must sincerely apologize for not solving this problem quickly.
This will take time. Without the approximator it's going to be considerably slower, but this model I begin training will be providing the approximations in a different way over time. As iterations progress, the system will conform to a huge array of geometric potentials and be capable at predicting those, but it will not be as powerful as the full patchmaker up front, and it will be slow training.
If I can get my hands on a cluster of A100's or H100's for a measure I'll make a post immediately, until then I must default to the slower process.
I really banked that the smaller version would have worked, but it simply couldn't hold complex topological shape without the correct boundaries being learnable AND endure entropic decay simultaneously. The only way to have a predominant shot at a full geometric shared language, is to make those boundaries learnable in the full spectrum of potentials, or at least more than I have placed on it.
I'll be refining my process in the coming days further, and I do apologize for pre-emptively announcing a potential that I have yet to fully explore.
There will be a full upgraded 38 shape geolip patchwork trained asap to fully encompass the Flux 1 AE spectrum, and another trained for SD15, SDXL, and Flux 2's VAE as well. These will accommodate DIRECT complex geometric patchwork learning, but not to the scale as promised yet. Autoregression is a complex mistress as many of you know, and I will be spending a great deal of time and compute analyzing all of the information required to build a uniformly useful and powerful autoregression patchwork to utilize as invariance to teaching.
The small model did not breach a certain level of accuracy as required by my specifications, so I've defaulted to harvesting information from AI models until I get the comparative bounds required for a useful topology.
This will take time. Without the approximator it's going to be considerably slower, but this model I begin training will be providing the approximations in a different way over time. As iterations progress, the system will conform to a huge array of geometric potentials and be capable at predicting those, but it will not be as powerful as the full patchmaker up front, and it will be slow training.
If I can get my hands on a cluster of A100's or H100's for a measure I'll make a post immediately, until then I must default to the slower process.
I really banked that the smaller version would have worked, but it simply couldn't hold complex topological shape without the correct boundaries being learnable AND endure entropic decay simultaneously. The only way to have a predominant shot at a full geometric shared language, is to make those boundaries learnable in the full spectrum of potentials, or at least more than I have placed on it.
I'll be refining my process in the coming days further, and I do apologize for pre-emptively announcing a potential that I have yet to fully explore.
It's too small to just finetune something with ablation, it'll likely lose a huge percentage of it's behavior and become highly unstable in unseen ways.
Not to mention it's multimodal, accepting images AND videos for processing... There's no telling what sort of damage shared space will have when trained with ablation reinforcement without providing adjacent behavioral supplementation to it.
0.8B and I are going to be good friends.
I've managed to condense a prototype to substantially smaller size but it's not as accurate as the original due to the generic topology being more challenging. I'm working it out though.
I've figured many new formulas based on the results of the last, which enable more deterministic projection rather than requiring the learning process to be so dispersed among many different subsystems.
I've also managed to form a 5d deterministic projection scaffold that should enable the entire structure to be even smaller, assuming I can work out the edge cases.
It's considerably cheaper than expected to keep volume valid. This seems like a partial regression for now but I can improve it a bit before heading back in the original direction. Hopefully it's worth the time spent on the potentially improved more sleek structure.
The smaller one can handle more shapes, considerably more shapes per scene, at a much higher complexity than voxel association. This has drawbacks though, namely these are essentially a gate set for now and the gates aren't perfect. These CAN find the correct potential, however the subprocessing isn't enabled yet, meaning our little 400k param set here is powerful but in a different kind of way.





