Gravity-2

IMAGE 2026-06-16 19:46:27

Experimental research model by squ11z1.

A 3B reasoning model in which the standard scaled-dot-product attention is replaced by a physically-motivated gravity attention, then adapted with LoRA. This card documents a stage-1 proof-of-mechanism

The experiment

Transformer attention scores tokens by alignment — the dot product q·k. Gravity-2 asks a different question: what if tokens attended by proximity instead? We replace the score with an inverse-square law borrowed from gravitation — each token is pulled toward others that are close in query/key space, weighted by a learnable per-head "mass":

                         M_h²
score(i, j)  =  ─────────────────────          →   softmax_j( score )
                  ‖q_i − k_j‖²  +  ε
  • M_h = softplus(gravity_mass_log[h]) — one learnable mass per query head (16 / layer), initialised at 0.5; softplus keeps it strictly positive.
  • ‖q_i − k_j‖² — squared L2 distance, computed stably as ‖q‖² + ‖k‖² − 2·q·k.
  • ε = 0.1 — softening length; prevents the q → k singularity.
  • The raw gravity scores are then passed through the usual softmax (see Limitations).

Why it's interesting

  • Different inductive bias. Dot-product attention rewards directional alignment; inverse-distance rewards locality in the learned embedding geometry — a metric prior rather than an inner-product one.
  • Interpretable per-head masses. Each head learns a scalar "mass" controlling how sharply it concentrates — a compact, inspectable knob (see figures/04_mass_heatmap.png).
  • A bridge to physics-style sparsity. An inverse-square field is naturally local, which later stages (pruning / QUBO, "Gravity-6") aim to exploit for structured sparsity.

Architecture

Qwen2-3B class: 36 layers, hidden 2048, 16 query heads / 2 KV heads (GQA, group size 8), head_dim 128. The 2 KV heads are repeat_kv-expanded to 16 before the distance, so each query head gets its own mass. Integrated via the transformers-5.x AttentionInterface (a registered "gravity" op + eager causal-mask reuse) — RoPE / KV-cache / masking are left to the framework; only the score function changes.

Results

loss masses
grad heatmap
aer concept

Honest limitations

  • Not "pure" gravity. The inverse-square scores are renormalised by a softmax on top (softmax_j(M²/(d²+ε))). Without it training was unstable, but it means this is a distance-biased softmax attention, not a literal gravitational field — the normalisation reintroduces global competition between keys.
  • MHA → GQA transfer is an open question. The mechanism was first prototyped on MHA (1 KV head per query head). Here it runs on GQA by repeat_kv-expanding 2 KV heads to 16 and giving each query head its own mass; whether this is the right granularity (vs. one mass per KV group) is unresolved and may matter for convergence.
  • Loading requires the patch (below). GGUF builds run standard attention, not gravity (llama.cpp has no kernel for M²/(‖q−k‖²+ε)) — the *.gguf files are format placeholders and produce incorrect output.

Loading (requires the gravity patch)

python load_gravity2.py   # from_pretrained -> patch_qwen_with_gravity -> load gravity_mass_log.pt

Weights are LoRA-merged into the base but were trained under gravity scoring; loading them under vanilla attention gives garbage. config.json ships _attn_implementation="eager" only so the checkpoint loads — the patch switches it to gravity.

License & attribution

Released under the MIT License. This is a derivative work of WeiboAI/VibeThinker-3B (the base model for the experiment), which is distributed under the MIT License; that license is inherited here and the original authors are credited accordingly.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support