SDXL–Aleph (geolip-sdxl-aleph)

Experimental research artifact — Phase 0. This is not a finished general-purpose text-to-image model. It is the first stage of an effort to retrain Stable Diffusion XL around a different text encoder (Qwen in place of CLIP-G) plus an encoder-invariant geometric address, using a rectified-flow objective rather than the original ε-prediction one. It is part of the wider geolip program on geometric, constraint-driven representation learning.

Phase 0 trains a cross-attention LoRA on the SDXL UNet together with a small conditioning front-end, while every large component (UNet base, VAE, CLIP-L, Qwen) stays frozen. CLIP-L is kept real; only the CLIP-G half of the conditioning is replaced.

6/8/2026 Phase 1 prelim complete; Epoch 12 prepared

A bit over 1 million samples processed, the model is about where you'd expect it to be. Took longer than expected. SDXL is a stubborn one. The toy script can be used to play with it. I'll likely spend the day working out the logistics of the comfyui playset. Still needs about another week of cooking to get to the calculated epoch 60, as it stands it seems the model has not fully converged to the expectation yet and much of SDXL was lost in the conversion.

The dataset requires a large infusion of additional data. 86,000 images simply pales in comparison to the laion flavors that created the Lune dataset with around the 400k samples or so.

Early Assessment

Even with the downsides the model is functional in other words, it WILL GENERATE IMAGES. The quality, utility, and distribution of those images being limited to around the 86,000 dataset - which means the conversion wasn't successful. More, repurposed. The earlier attempts to convert showed the conversion to flow matching would have been likely worse if I had left clip_l and clip_g pooling in there. The resulting FiD and KiD scores were not only substantially worse, but the visibility and quality of those images were... to say the least, bad. This variation produces visually - alebeit subborn - aesthetic images based on the QWEN extractions.

E12 Conversion Next

Not to mention there are missing elements from this variation such as the David distillation mechanism.

In any case, this model is proceeding on schedule and the first 1 million samples require a full assessment for aleph capacity. MEANING, I need to determine if and where the alephs exist, and this will probably take a couple days to get a baseline.

Statistically the projection weight is above zero, that doesn't mean the model was forming alephs though. They could just be a guidepost system, or a scaffold between topics, or potentially a full skeletal structure that bound the system together harmoniously.

The model is partially converged but requires additional training, and the only way to do this is more time and a huge battery of tests performed on the preliminary 12 epochs.

Likely Routes

Trajectory routes are pretty straightfoward.

  • I could mass extract SDXL for another 400k laion flavors and poison the model to remind the model where it came from.
    • Cost: 3 days or so and a big headache knowing it'll take a solid day for an epoch.
    • Risk: Putrid data overrides 2 days training.
    • Reward: Activates the laion pretraining with the new flow matching behavior.
  • I could mass extract QWEN for another 400k or so images using laion flavors and subject prompt curation
    • Teach baseline behaviors like the sky is blue, the color of hairs, and so on.
    • Cost: 3 days or so, roughly the same amount as SDXL
    • Risk: further pushing away from SDXL pretraining, potentially never surfacing it.
    • Reward: Further converges qwen with the possibility of activating the overweighted SDXL behavior
  • I could train another of the variations to around a million samples and test that one as a comparison
    • Attempt another that may or may not behave better after training
    • Cost: 2 days for another
    • Risk: The same or worse outcomes.
    • Reward: A further along variant with more SDXL active.
  • I can press onward to epoch 60 where the model's 5 million sample point hits, which is a fair saturation hit on the core.
    • Cost: 5 days
    • Risk: The model converges to clip_g and ignores qwen, rigid and difficult to finetune model, requires more data than before.
    • Reward: Fully converged qwen text encoder, useful and acceptable prompt extension capacity.

Likely Side-Effects

  • Rigidity. The model is going to be rigid through growing pains.
  • Difficult. Swapping text encoders is always going to have growing pains.
  • Extended Prompts. With the difficulty comes the reward.
  • Durable Subjects. The subjects themselves will solidify and be more controllable overall.

Difficult decision. The challenges present themselves where I simply need more compute and more time.


6/6/2026 Phase1 - Full Finetune Training Begins

10 epochs at 86k images; ETA 30 hours.

Toy Inference Script Prepared

CFG 2 is looking pretty good on epoch 5. Epoch 10 will be ready by tonight.

image

I've attached a script in the test_colab_phase_1.py which will autoload and prep everything, just need to run the pips in a cell and get it goin.

I don't have negative prompts hooked in yet, but it'll happen soon after some testing.

Have fun.

Epoch 2 image 0 samples

75%

CFG1 image

CFG3 image

CFG5 image

Epoch 2 Complete

CFG1 image

CFG3

image

CFG5

image

Epoch 1 image 0 samples

25%

CFG 3 and CFG 5 are likely invalid until the model kicks over to the new CFG dropout orientation.

CFG1 image

CFG3 image

CFG5 image

50%

CFG1 image

75%

CFG1 image

Epoch1 End

CFG 3 and CFG 5 are starting to show validity.

More epochs necessary.

CFG1 image

CFG3 image

CFG5 image

epoch 1 mean loss 0.6046
FID/KID @cfg 1.0: FID 205.3  KID 0.0197  (n=100)
FID/KID @cfg 3.0: FID 190.1  KID 0.0124  (n=100)
FID/KID @cfg 5.0: FID 232.3  KID 0.0447  (n=100)

REGULAR CHECKUPS

Regular applications of training assistances will be applied for phase 1.

  • FID score checkups, every epoch - no less than 100 images per cfg.
    • This is quite time consuming, but absolutely mission critical.
  • common prompt testing,
    • 4 times per epoch, 8 prompts, 3 cfg types to check.
  • CFG tweaking, to ensure we are getting the most out of dropout.
    • SCHEDULED decrease in dropout over time, with high dropout at first
  • SHIFT timing adjusted TWICE through the training
    • Epoch 0 = shift 2 <- sd15-flow-lune native pretrain
    • Epoch 4 = shift 2.25
    • Epoch 8 = shift 2.5
    • Lune trained under SHIFT 2 for a HUGE portion of pretraining
    • LATER Lune was moved to shift 2.5 AFTER coalescence and convergence
    • MUCH LATER Lune finetunes were moved BACK to 2 and 3 as testing agents
      • Shift 2 and Shift 3 both provided utilizable FID offsets after training shift agnostic behavior.

Phase0 - 9 run lora sweep

With 9 1000 image 100 epoch lora training runs we determine the best target for flow match conversion.

image

This is how we choose our target to flow match convert before we feed any major data.

We then FID score calculate.

Bug present

CFG isn't calibrated right, so lower CFG is required to correctly attune the model. The symptoms are similar to what happened to Lune originally before retraining, so the probability that this model will need a reinfusion is clear.

Bug Solution

I will run a sweep per 10 epochs for the 4 image samples when the runs conclude with cfg 1.0, cfg 2.0, and cfg 3.0.

Lune took a very long time to figure out CFG after conversion so I expect something similar from our new SDXL model.

I have yet to decide on a name for the weights.

Preliminary results in

And they are UUUUGLYYYY. As expected, I didn't expect CFG to be so rigid however.

https://huggingface.co/datasets/AbstractPhil/geolip-sdxl-fid-scoring

The most cost effective measure I can see based on this sweep, WAS IN FACT my original choice.

I have a very good track record just guessing.

image

THE TARGET will not be #1 however, we have two viable alternatives. The third on the list being the most promising.

We snap out the CLIP_L and we snap out the CLIP_G POOLED.

RETAINING the clip_g sequence.

This is our phase 1 goal, retrain SDXL to flow matching using the qwen instead of clip_l and the qwen instead of CLIP_G pooled.


I am fairly certain I know WHY the clip_g sequence failed, and it's because it was projected upward.

Traditionally, dense vectors when upcasted in this nature will comply with the stronger of the two. By effect, this model simply wasn't strong enough to fill the CLIP_G role with the chosen qwen model, even with the aleph scaffolding.

SIMPLY TOO WEAK!

Without CLIP_G's strength, the model entered catastrophic fault. Essentially sequential solidity couldn't fill the bag of tokens role. That's fine, and actually expected.

I have a viable solution for this, and it involves an aleph transformer. For now, we'll train this one!

86,000 images, 12~ epochs. Roughly 1 million samples full finetune.

Why this exists

A cheaper path — freezing the UNet and training an adapter to impersonate CLIP-G from a different encoder — was measured and ruled out: a pooled text vector transfers across encoder architectures (a rotation lands it in CLIP-G's space at reference quality), but CLIP's per-token sequence does not (cosine caps around 0.5, i.e. texture, not content). So the conditioning here is built from a pooled Qwen representation, and the UNet is allowed to co-adapt to it rather than being asked to read a faithful CLIP imitation.

Separately, the aleph address is computed from the bytes of the caption, not from any text encoder. It is therefore identical across the CLIP-G→Qwen swap — a deterministic, scale- and patch-agnostic geometric scaffold the UNet can lean on while the encoder changes underneath it. It enters as a near-zero-initialised anchor so the model starts well-behaved and learns to use the address as training proceeds.

Architecture

Frozen: SDXL UNet base weights · SDXL VAE · SDXL CLIP-L text encoder · Qwen3.5 (0.8B). Trainable: cross-attention LoRA (attn2, rank 16) + the conditioning front-end (pool_proj, a learned positional table, and a near-zero addr_adapter).

Text representation (Qwen, pooled — the proven causal-LM extraction): the caption is chat-templated, optionally re-described via two-shot generation, then encoded with last-token pooling (the [EOS]-position hidden state that aggregates the sequence under causal masking) — not mean-pooling, which smears that aggregate. This yields one rich pooled vector per caption.

Conditioning assembly (cross-attention encoder_hidden_states, width 2048):

positions CLIP-L half (768) CLIP-G half (1280)
0…76 real CLIP-L penultimate hidden states pool_proj(qwen_pool) broadcast + learned positional offsets
77…77+N zeros addr_adapter(aleph_address) (near-zero init)

The same pooled-projected vector also supplies the SDXL added-condition text_embeds (1280). Micro-conditioning time_ids are the standard [H, W, 0, 0, H, W].

Objective (rectified flow, "Lune" recipe): with shift = 2.5,

s  ~ U(0,1);   s' = shift·s / (1 + (shift-1)·s);   t = s'·1000
x_t = (1 - s')·x0 + s'·noise;   target v = noise - x0;   loss = MSE(unet(x_t, t, cond), v)

10% classifier-free-guidance dropout (the full conditioning is zeroed). SDXL VAE scaling is 0.13025. Sampling is an Euler integration of the velocity field from σ=1 to σ=0 — not the default SDXL scheduler.

Training data

AbstractPhil/sdxl-qwen-phase0 — SDXL-self-generated (image, caption, aleph_address) triples. The SDXL render is the flow-matching target (x0 source); the student learns to reproduce it from the new conditioning. See that dataset's card for construction details.

Training procedure

Optimizer AdamW (β=(0.9, 0.999), wd 0.01) — to be replaced with pure Adam in a later pass
Learning rate 1e-4 with linear warmup (LoRA + front-end are trained from scratch)
Grad clip 1.0
Precision fp32 (bf16 optional)
LoRA rank 16, α 16, targets attn2.{to_q,to_k,to_v,to_out.0}
Resolution 1024×1024 (128×128 latent)

Latents, CLIP-L sequences, and Qwen-pooled vectors are precomputed once and cached; the trainable parts then run on the cached frozen features.

How to use

This is a checkpoint of LoRA + front-end weights, not a standard diffusers pipeline. Loading and sampling require the front-end module and the rectified-flow sampler that ship with the training code. Sketch:

# 1. load frozen SDXL UNet, apply the cross-attn LoRA, load LoRA weights
# 2. instantiate SDXLQwenFrontEnd, load its state_dict
# 3. load frozen Qwen (rich-pooled encoder), CLIP-L, VAE, and the aleph model
# 4. for a prompt:
#      qwen_pool  = QwenPooledEncoder.encode([prompt])      # rich, last-token pooled
#      clip_l_seq = CLIP-L penultimate hidden states        # real, 77×768
#      address    = aleph address from the caption bytes    # (32, 128)
#      ehs, text_embeds = frontend(qwen_pool, clip_l_seq, address)
# 5. Euler-integrate the velocity field (σ: 1→0) with added_cond_kwargs, then VAE-decode

(The QwenPooledEncoder, SDXLQwenFrontEnd, and the sampler are defined in the project's Phase-0 training script.)

Status, limitations, intended use

  • Research only. Phase 0 is a feasibility stage; expect rough or incomplete generations. It is not a drop-in SDXL replacement and does not use the standard SDXL sampler.
  • The open question Phase 0 answers is qualitative: do samples show content (the pooled Qwen signal + the aleph address actually steering the denoise) or merely texture? Content advances the program to a full-UNet Phase 1; texture sends the conditioning design back for revision.
  • Only the CLIP-G→Qwen swap is made here; CLIP-L is still the original encoder.
  • The aleph address is a surface-form reconstruction code, not a semantic embedding (see the dataset card) — it contributes geometric structure, not meaning.
  • Inherits the biases and failure modes of SDXL, Qwen, and the caption source.

Licensing

The UNet/VAE/CLIP components and any weights derived from them are governed by the CreativeML Open RAIL++-M license of SDXL base, and use is subject to that license's restrictions. The Qwen text encoder carries its own license (Apache-2.0 for the Qwen3 line); the aleph model comes from AbstractEyes/geolip-svae. Confirm each component's terms before redistribution or deployment.

Acknowledgements / context

Built on Stability AI's Stable Diffusion XL and the Qwen text models. The rectified-flow recipe follows the Lune SD1.5 flow-matching lineage; the aleph address comes from the geolip-svae spectral-VAE work. Part of the geolip program (geofractal towers/routers, geolip-core constellation, geovocab) on parameter-efficient geometric learning.

Maintainer: AbstractPhil. Status: active research, Phase 0.

Downloads last month
-
Inference Providers NEW

Model tree for AbstractPhil/geolip-sdxl-aleph

Adapter
(8943)
this model

Dataset used to train AbstractPhil/geolip-sdxl-aleph