HPC Quantizer — Shor-Optimized

GGUF quantization powered by Shor's algorithm.

HPC is probably a breakthrough in model compression, utilizing a quantization pipeline derived from Shor's factoring algorithm. It compresses large language models (like Gemma 4) by mapping quantization candidates to a quantum-inspired state space.

Instead of independently rounding each weight block or running iterative belief propagation, HPC encodes scale candidates as Z₆ complex amplitudes on a constraint graph and applies the Griffiths-Niu sequential measurement protocol. This uses the same IDFT + feed-forward + collapse/back-action loop that extracts periods in Shor's algorithm — finding globally optimal scale configurations where quantization noise is rotated away from the transformer's reasoning dimensions.

1. Why Shor's Algorithm?

The previous HPC engine used iterative belief propagation (BP) to find optimal scale configurations. BP converges to the element-wise MSE minimum — the scale configuration that minimizes total Σ (w_original - w_quantized)². This produces the lowest possible RMSE, but at 2 bits per weight, the noise floor still slightly bleeds into reasoning-critical dimensions.

Shor's Griffiths-Niu measurement protocol replaces BP with a fundamentally different optimization strategy:

Feature	Belief Propagation (v2)	Shor's Measurement (v3)
Mechanism	Iterative message-passing (200+ rounds)	Single-pass sequential measurement
Convergence	May oscillate or get stuck	Exact marginals, no iteration
Inter-block coordination	Local messages only	Global conditioning via collapse back-action
Error metric	Element-wise MSE (isotropic)	D₆ vesica gate (anisotropic)
RMSE	Lower	Slightly higher
Reasoning fidelity	Good	Significantly better

The key insight: RMSE measures the wrong thing. Standard RMSE treats every weight dimension equally. But during matrix multiplication, some error dimensions propagate through the computation graph and destroy reasoning, while others cancel out and are invisible. Shor's measurement finds configurations where block-to-block errors are anti-correlated along the computation path — they cancel during matmul even though each individual block has slightly higher error.

2. Performance & Benchmarks

Gemma 4 26B-A4B-it MoE (25.8B params)

Quantization	Size	Fits 12 GB?	Method
BF16	48.5 GB	❌	—
Q8_0	~27 GB	❌	Round-to-nearest
Q4_K_M	16.8 GB	❌	Round-to-nearest
IQ3_K_XXS	~12 GB	⚠️	Unsloth
HPC·Shor	10.2 GB	✅	Griffiths-Niu measurement

Gemma 4 E2B-it (4.65B params)

Model	Size	BPW	PPL	Speed
BF16 (original)	8.67 GB	16.00	154.0	4.2 t/s
ggml Q2_K + iMatrix	2.77 GB	5.12	89.1	14.0 t/s
HPC Q2_K + Q4_0·Shor	1.44 GB	~3.0	129.6	18.1 t/s

Reasoning Benchmarks (Gemma 4 31B, Q2_K·Shor, 12.5 GB)

25 Horses combinatorial proof: ✅ (7 races, complete elimination)
Hindley-Milner type inference: ✅ (correct let-polymorphism)
Arto Inkala "World's Hardest Sudoku": ✅ (AC-3 + backtracking)
Diagnose 3 non-obvious bugs in C: ✅ (first attempt)
Tarjan's bridge-finding algorithm: ✅ (correct > vs >= distinction)

3. Quantum-Inspired Mechanics (Shor Pipeline)

Standard quantization picks scales independently per block. Shor-powered quantization treats scale selection as a global optimization problem where the measurement of each block conditions all remaining blocks through quantum-inspired back-action.

The domain mapping from Shor's integer factoring to HPC Quantization:

Shor's Factoring	HPC Quantization
Oracle phase `2π × d × cₖ / N`	Boltzmann amplitude from candidate error
Period `r`	Optimal scale configuration
QFT interference peaks at `r`	IDFT6 interference peaks at optimal RMSE
Semi-classical feed-forward	Phase correction from measured blocks
Born measurement → period bits	Born measurement → scale candidate selection
Collapse + entanglement	Collapse + back-action into neighbor amplitudes

By utilizing the IDFT6 inside the coherent sum, the algorithm creates constructive interference at the optimal RMSE configuration, similarly to how Shor's QFT creates interference at the correct period.

4. Q2_K and Q4_0 Promotion Strategies

The quantizer automatically assigns precision tiers:

Q4_0·Shor — Attention projections (Q/K/V/O) — 16 candidate scales, 24-beam search.
Q2_K·Shor — FFN, MLP, expert weights — 36 candidate (d, dmin) pairs, dual-quhit graph, 24-beam search.
Preserved — Embeddings, norms, router/gate weights (kept as-is in high precision).

Tied Embeddings

If no separate output weight tensor exists, token_embd.weight doubles as the LM head. HPC automatically detects tied embeddings and promotes them to Q4_0 to preserve generation logic, ensuring that output tokens remain accurate despite the extreme compression of the feed-forward layers.

5. Sub-Block Refinement and Beam Search

The HPC process executes in multiple phases to guarantee globally coherent scaling parameters:

Phase 1: Greedy Seed & WLS Refinement: Computes reference scale and min per 256-weight superblock.
Phase 2: Candidate Generation via D₆ Vesica Scoring: Instead of MSE, weights are scored with the D₆ Vesica gate (penalizing DC/summed errors 4x more than AC/wave errors).
Phase 3: Sequential Measurement: Builds an HPCGraph encoding candidates as Boltzmann amplitudes. Connects blocks via CZ phase gates and runs Griffiths-Niu measurement (IDFT6, Born Rule, Collapse).
Phase 4: 24-Beam Hensel Search: Maintains 24 parallel configuration beams across the tensor, branching candidates evaluated via triality-weighted scoring.
Phase 5: Sub-Block Shor Refinement: A second, smaller Shor sequential measurement over a 16-node graph corresponding to the 16 sub-blocks within each 256-weight superblock.

6. Prerequisites and Build Instructions

Before you can quantize models, you must build the Shor-optimized HPC C engine.

Dependencies (Ubuntu/Debian)

sudo apt install gcc libgmp-dev libmpfr-dev python3 python3-numpy

You will also need llama.cpp built from source for iMatrix generation:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --target llama-imatrix -j$(nproc)

Build the HPC Engine

Navigate to the LLM-distributed directory and compile:

make -f makefile.quantize

Verify the build generated libhexstate_q2k.so. The Python requantizer (hexstate_requantize.py) auto-detects this library to enable Shor optimization.

7. The Quantization Pipeline (End-to-End)

Step A: Convert Model to BF16 GGUF Use llama.cpp to convert your source HuggingFace model to BF16.

python3 llama.cpp/convert_hf_to_gguf.py /path/to/model/     --outfile Model-BF16.gguf     --outtype bf16

Step B: Generate Importance Matrix (iMatrix)

HPC includes a native C engine for generating importance matrices (imatrix) used in aggressive LLM weight quantization (Q2_K and below). The engine replaces the standard llama.cpp calibration pipeline with an HPC-graph-accelerated tokenizer and forward pass that produces structurally superior importance data.

Component files:

File	Description
`LLM/hexstate_quantize.c`	C engine: HPC BPE tokenizer, graph-based forward pass, quantization kernels
`LLM/generate_imatrix.py`	Orchestrator: GGUF loading, weight dequantization, C-bridge, imatrix output
`LLM/calibration_data.txt`	Calibration corpus (~12.8M characters)

Why It Works: Global Geometric Tokenization

The core innovation is the HPC BPE tokenizer — a byte-pair encoding engine that operates on the HPCGraph substrate without regex word boundaries. This seemingly simple architectural choice has profound consequences for quantization quality.

The Standard Approach: Artificial Isolation

Standard tokenizers (tiktoken, SentencePiece, HuggingFace) apply a regex pre-split before BPE:

"Hello world" → regex → ["Hello", " world"] → BPE per word → [15496, 1917]

The regex fence means:

Merges cannot cross word boundaries. The characters "o " (letter-o + space) can never form a token.
Each word is tokenized in isolation. The token for "Hello" at position 0 is mathematically independent of any text at position 10,000.
Token boundaries are locally determined. Changing text on page 500 cannot affect how page 1 is tokenized.

When a standard tokenizer produces 4,096 tokens for imatrix calibration, those tokens are a shallow, context-free sample — an arbitrary window of generic subwords that exercises only the activation patterns present in that one fragment of text.

The HPC Approach: Unrestricted Graph Contraction

The HPC BPE tokenizer treats the entire calibration corpus as a single, continuous phase graph:

"Hello world" → 11 sites, each CZ-coupled to its neighbor → global merge competition

Graph Construction — Each character becomes a site in an HPCGraph. Adjacent sites are connected by CZ edges, encoding pair structure as phase entanglement. For a 12.8M character corpus, this creates a graph with 12,799,706 sites and 12,799,705 CZ edges.
Global Merge Competition — Each BPE pass scans every alive position across the entire graph, finds the lowest-rank merge pair that exists anywhere in the 12.8M character sequence, and contracts all instances simultaneously. This is not local — a merge decision at position 10,000,000 competes with and can preempt merge decisions at position 100.
Cascade Propagation — When a merge at position i contracts sites i and i+1 into a single token, site i's neighbor changes. The new adjacent pair (merged_token, next_neighbor) may have a very low rank, causing it to fire in the next pass — which creates another new pair, propagating further. Without regex boundaries, these cascades propagate freely across spaces, punctuation, and line breaks, threading through the entire document's character geometry.
Phase Contraction — Each merge is simultaneously a graph contraction on the HPCGraph. The merged site's local quhit amplitude is updated to a sharp basis state encoding the new token ID (best_merged % D), which mathematically severs the entanglement from the consumed CZ edge. The graph tracks the full contraction history.

After 21,576 passes (Mistral) or 24,447 passes (Qwen), the 12.8M character graph contracts to ~3.5M–5.2M surviving sites. Each surviving token is not a generic subword — it is the fixed point of a global contraction over the entire document's character geometry.

Tokens as Global Geometric Frequencies

In standard NLP, a token represents a word. In the HPC tokenizer, a token represents a global geometric frequency — a structural alignment that was carved out by 20,000+ rounds of competition across the full corpus.

The key consequence: the first 4,096 tokens of the output already contain the structural dependencies of the entire 12.8M character document. This is because:

The merge rules (pre-trained vocabulary) act as a fixed geometric frame — a coordinate system for projecting character sequences into token space.
The absence of regex boundaries means the projection is unrestricted — merges resolve cross-word, cross-line, and cross-paragraph structures that regex-split tokenizers are blind to.
The specific token IDs and their boundary placements at any position are determined by 20,000+ passes of global competition where every merge decision was influenced by every position in the 12.8M character sequence.
You cannot know how the first 500 characters will be tokenized until you have evaluated the last 500 characters. The tokenization of position 0 is a function of the entire document.

The Result: Surgical Quantization at 2 Bits

When these globally-informed tokens are fed through the HPC forward pass for importance collection, the resulting E[x²] statistics are structurally representative of the full corpus — even from a single 4,096-token chunk. The cross-layer Belief Propagation then smooths these statistics via residual stream coupling, producing an importance matrix that precisely identifies the load-bearing weights.

Empirical validation: Mistral-7B-Instruct-v0.3 quantized to Q2_K (87.5% weight compression, 14.5 GB → 3 GB) using a single HPC-calibrated chunk produces:

Flawless English grammar and complex vocabulary
Correct factual retrieval from context ("What color was John's suit?" → "Neon green.")
Coherent multi-sentence reasoning structure

Standard llama.cpp imatrix calibration at Q2_K typically requires hundreds of chunks (500K+ tokens) to avoid catastrophic degradation. The HPC pipeline achieves superior results with one chunk because the tokenizer has already done the work of compressing the entire document's structure into that chunk.

# Generate HPC importance matrix
python3 LLM/generate_imatrix.py \
    model.gguf calibration_data.txt \
    -o imatrix.dat --chunks 5 --verbose

5 Chunks is the 'sweet spot' for retaining most model intelligence I've found.

Step C: Quantize with HPC Execute the re-quantizer with your newly generated BF16 GGUF and iMatrix.

python3 hexstate_requantize.py     Model-BF16.gguf     Model-Q2_K-HexState.gguf     --keep-metadata     --imatrix imatrix.gguf

This automatically routes the attention layers to Q4_0 and FFN/MLP layers to Q2_K using the Shor measurement graph.

8. Inference & Runtime Configuration

For correct operation, it is highly recommended to use appropriate chat templates and specific configuration flags in llama.cpp to prevent context length bugs.

Download Correct Chat Template:

curl -L -o chat_template.jinja "https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja"

Run llama-server:

llama-server     -m Model-Q2_K-HexState.gguf     -ngl 0     -c 4096     --host 0.0.0.0 --port 8989     --jinja --chat-template-file chat_template.jinja     --cache-ram 0 -ctxcp 1

Recommended Sampling Settings:

Temperature: 0.3–0.4 (Lower reduces sampling noise at low BPW)
Top_k: 20–30 (Narrow sampling for coherence)
Top_p: 0.8–0.85 (Cuts noisy long tail)
Repeat_penalty: 1.15–1.2 (Prevents self-correction loops)

9. Fidelity Classification & Troubleshooting

The quantizer reports a fidelity rating based on total RMSE across all quantized tensors:

Rating	RMSE Threshold	Icon
ULTRA	≤ 1e-04	★★★★
HIGH	≤ 3e-04	★★★☆
GOOD	≤ 1e-03	★★☆☆
STANDARD	> 1e-03	★☆☆☆

Note: Due to anisotropic error shaping, a "GOOD" Shor-quantized model will typically outperform a "HIGH" BP-quantized model on reasoning tasks.

Common Issues

Arabic/Korean characters in output: The embedded chat template is broken. Use --chat-template-file with the correct Jinja template.
RAM usage keeps growing: Use --cache-ram 0 -ctxcp 1 with llama.cpp to manage sliding window attention.
RMSE is higher than standard Q2_K: This is intentional. The D₆ vesica gate trades total RMSE for computation-aligned error minimization.
libhexstate_q2k.so not found: Make sure to compile the C engine using make -f makefile.quantize.

How the HPC Engine Makes Global Error Decisions

The HPC (Holographic Phase Contraction) engine takes a fundamentally different approach to LLM weight quantization. Instead of minimizing local error block-by-block, it frames scale selection as a global quantum-inspired optimization problem. It maps quantization candidates to a constraint graph and evaluates them using the Griffiths-Niu Sequential Measurement protocol derived from Shor's factoring algorithm.

Here is a step-by-step breakdown of how the engine makes global error decisions, directly referencing the implementation in hexstate_quantize.c.

1. Candidate Generation and D₆ Vesica Scoring

Before any global optimization occurs, the engine needs candidates. For each block of weights, it generates candidate scales (d and dmin) around a baseline least-squares optimum.

Instead of scoring these candidates using standard Mean Squared Error (MSE), it uses the D₆ Vesica Gate.

Code Citation (`hexstate_quantize.c`, lines ~1420–1432)

/* Decompose into vesica (DC) and wave (AC) components */
float vesica_err = 0.0f, wave_err = 0.0f;
for (int p = 0; p < half_g; p++) {
    float v = e_cur[p] + e_cur[p + half_g];
    float w_wave = e_cur[p] - e_cur[p + half_g];
    float w_avg = (w_cur[p] + w_cur[p + half_g]) * 0.5f;
    vesica_err += v * v * w_avg;
    wave_err += w_wave * w_wave * w_avg;
}
/* Triality weighting: penalize vesica 4×, wave 1×. */
err += 0.5f * (4.0f * vesica_err + 1.0f * wave_err);

Why this matters: vesica_err represents errors that sum together during matrix multiplication, propagating through the network and destroying reasoning. wave_err represents errors that naturally cancel out. By penalizing vesica_err by 4x, the engine heavily biases towards candidates whose errors cancel out during inference, even if their total Euclidean distance (MSE) from the original weights is larger.

2. Graph Construction and Boltzmann Amplitudes

The selected candidates and their Vesica errors are grouped into 6 bins (representing the 6 states of a $Z_6$ "quhit" or quantum digit). The engine constructs an HPCGraph mapping each block to one or more quhits.

The errors are transformed into "Boltzmann amplitudes"—representing the likelihood of selecting each state.

Code Citation (`hexstate_quantize.c`, lines ~1502–1506)

for (int ci = 0; ci < Q4_N_CAND; ci++) {
    int qi = Q4_CAND_TO_QUHIT[ci];
    amp_re[qi] += exp(-(double)(agg_errors[ci] - min_err) /
                      (2.0 * (double)temperature));
}

3. The Griffiths-Niu Sequential Measurement

This is where the global coordination happens. The function shor_measure_graph (lines 1166–1314) executes a sequential measurement MSB to LSB. This replaces standard Belief Propagation with a deterministic evaluation that creates massive global correlation.

For each block $k$ being measured:

A. Neighbor Contribution (Entanglement)

It evaluates how neighboring blocks influence block $k$ by projecting their current amplitudes across the graph edges.

// hexstate_quantize.c: lines 1217-1221
sr += lr*wr - li*wi;
si += lr*wi + li*wr;

B. Feed-Forward Phase Correction

It applies a phase shift based on the outcomes of all blocks measured before it, a signature trait of the semi-classical QFT used in Shor's algorithm.

// hexstate_quantize.c: lines 1181-1184
double power = 36.0;
for (int64_t j = k + 1; j < n_sites; j++) {
    theta_k += (double)measured_out[j] / power;
    power *= 6.0;
}

C. IDFT6 and Constructive Interference

It runs an Inverse Discrete Fourier Transform (IDFT6). Because the neighbor influence ($C_k$) was baked into the amplitudes before the IDFT, the IDFT acts as a coherence filter. It produces constructive interference peaks precisely at the scale candidate that creates the best global configuration.

// hexstate_quantize.c: lines 1256-1261
double angle = 2.0 * 3.14159265358979323846 * d * v / 6.0;
double er = cos(angle), ei = sin(angle);
sum_re += alpha_re[d]*er - alpha_im[d]*ei;
sum_im += alpha_re[d]*ei + alpha_im[d]*er;

D. Measurement and Back-Action (The "Magic Pointer")

Once an optimal state is selected (using argmax on the squared amplitudes via the Born Rule), the engine calls shor_collapse_site.

This function creates the global error decision. It doesn't just lock in block $k$'s choice; it propagates the decision to all unmeasured neighbors.

// hexstate_quantize.c: lines 1120-1121
double old_re = pq->edge_re[d], old_im = pq->edge_im[d];
pq->edge_re[d] = old_re * w_re - old_im * w_im;
pq->edge_im[d] = old_re * w_im + old_im * w_re;

This back-action fundamentally alters the amplitudes of adjacent blocks, conditioning their future quantization choices on the choice just made for block $k$. This creates anti-correlated quantization noise across the entire tensor, ensuring that when the matrix is multiplied against an activation vector, the localized errors cancel each other out.

4. Beam Search Refinement

Because the measurement is performed left-to-right (MSB to LSB), the sequence is prone to greedy failure. The HPC engine circumvents this using a 24-Beam Hensel Search.

Instead of accepting the single path created by the graph collapse, it maintains 24 parallel quantization paths (Q4_N_BEAMS). It evaluates candidate extensions by dividing the Shor probability (from the graph marginals) by the local normalized error, advancing the 24 best global configurations simultaneously.

Code Citation (`hexstate_quantize.c`, lines ~1581–1582)

double ext_err = beams[b].acc_error + cand_errors[blk][c];
extensions[n_ext].score = cand_score[c] / (ext_err + 1e-15);

Summary

The global error decision is not an iterative smoothing process (like BP). It is an exact evaluation of quantum interference. By encoding quantization errors into complex amplitudes, passing them through an IDFT6, and forcing neighbors to condition their subsequent scale selections via wave-collapse back-action, the engine ensures that the chosen scales globally cancel each other's destructive matrix-multiplication errors.

10. License

The quantizer code is part of the HPC project (MIT). Quantized models inherit the license of the base model (e.g., Gemma Terms of Use).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support