- HPC Quantizer β Shor-Optimized
- 1. Why Shor's Algorithm?
- 2. Performance & Benchmarks
- 3. Quantum-Inspired Mechanics (Shor Pipeline)
- 4. Q2_K and Q4_0 Promotion Strategies
- 5. Sub-Block Refinement and Beam Search
- 6. Prerequisites and Build Instructions
- 7. The Quantization Pipeline (End-to-End)
- 8. Inference & Runtime Configuration
- 9. Fidelity Classification & Troubleshooting
- 1. Why Shor's Algorithm?
- How the HPC Engine Makes Global Error Decisions
HPC Quantizer β Shor-Optimized
GGUF quantization powered by Shor's algorithm.
HPC is probably a breakthrough in model compression, utilizing a quantization pipeline derived from Shor's factoring algorithm. It compresses large language models (like Gemma 4) by mapping quantization candidates to a quantum-inspired state space.
Instead of independently rounding each weight block or running iterative belief propagation, HPC encodes scale candidates as Zβ complex amplitudes on a constraint graph and applies the Griffiths-Niu sequential measurement protocol. This uses the same IDFT + feed-forward + collapse/back-action loop that extracts periods in Shor's algorithm β finding globally optimal scale configurations where quantization noise is rotated away from the transformer's reasoning dimensions.
1. Why Shor's Algorithm?
The previous HPC engine used iterative belief propagation (BP) to find optimal scale configurations. BP converges to the element-wise MSE minimum β the scale configuration that minimizes total Ξ£ (w_original - w_quantized)Β². This produces the lowest possible RMSE, but at 2 bits per weight, the noise floor still slightly bleeds into reasoning-critical dimensions.
Shor's Griffiths-Niu measurement protocol replaces BP with a fundamentally different optimization strategy:
| Feature | Belief Propagation (v2) | Shor's Measurement (v3) |
|---|---|---|
| Mechanism | Iterative message-passing (200+ rounds) | Single-pass sequential measurement |
| Convergence | May oscillate or get stuck | Exact marginals, no iteration |
| Inter-block coordination | Local messages only | Global conditioning via collapse back-action |
| Error metric | Element-wise MSE (isotropic) | Dβ vesica gate (anisotropic) |
| RMSE | Lower | Slightly higher |
| Reasoning fidelity | Good | Significantly better |
The key insight: RMSE measures the wrong thing. Standard RMSE treats every weight dimension equally. But during matrix multiplication, some error dimensions propagate through the computation graph and destroy reasoning, while others cancel out and are invisible. Shor's measurement finds configurations where block-to-block errors are anti-correlated along the computation path β they cancel during matmul even though each individual block has slightly higher error.
2. Performance & Benchmarks
Gemma 4 26B-A4B-it MoE (25.8B params)
| Quantization | Size | Fits 12 GB? | Method |
|---|---|---|---|
| BF16 | 48.5 GB | β | β |
| Q8_0 | ~27 GB | β | Round-to-nearest |
| Q4_K_M | 16.8 GB | β | Round-to-nearest |
| IQ3_K_XXS | ~12 GB | β οΈ | Unsloth |
| HPCΒ·Shor | 10.2 GB | β | Griffiths-Niu measurement |
Gemma 4 E2B-it (4.65B params)
| Model | Size | BPW | PPL | Speed |
|---|---|---|---|---|
| BF16 (original) | 8.67 GB | 16.00 | 154.0 | 4.2 t/s |
| ggml Q2_K + iMatrix | 2.77 GB | 5.12 | 89.1 | 14.0 t/s |
| HPC Q2_K + Q4_0Β·Shor | 1.44 GB | ~3.0 | 129.6 | 18.1 t/s |
Reasoning Benchmarks (Gemma 4 31B, Q2_KΒ·Shor, 12.5 GB)
- 25 Horses combinatorial proof: β (7 races, complete elimination)
- Hindley-Milner type inference: β (correct let-polymorphism)
- Arto Inkala "World's Hardest Sudoku": β (AC-3 + backtracking)
- Diagnose 3 non-obvious bugs in C: β (first attempt)
- Tarjan's bridge-finding algorithm: β
(correct
>vs>=distinction)
3. Quantum-Inspired Mechanics (Shor Pipeline)
Standard quantization picks scales independently per block. Shor-powered quantization treats scale selection as a global optimization problem where the measurement of each block conditions all remaining blocks through quantum-inspired back-action.
The domain mapping from Shor's integer factoring to HPC Quantization:
| Shor's Factoring | HPC Quantization |
|---|---|
Oracle phase 2Ο Γ d Γ cβ / N |
Boltzmann amplitude from candidate error |
Period r |
Optimal scale configuration |
QFT interference peaks at r |
IDFT6 interference peaks at optimal RMSE |
| Semi-classical feed-forward | Phase correction from measured blocks |
| Born measurement β period bits | Born measurement β scale candidate selection |
| Collapse + entanglement | Collapse + back-action into neighbor amplitudes |
By utilizing the IDFT6 inside the coherent sum, the algorithm creates constructive interference at the optimal RMSE configuration, similarly to how Shor's QFT creates interference at the correct period.
4. Q2_K and Q4_0 Promotion Strategies
The quantizer automatically assigns precision tiers:
- Q4_0Β·Shor β Attention projections (Q/K/V/O) β 16 candidate scales, 24-beam search.
- Q2_KΒ·Shor β FFN, MLP, expert weights β 36 candidate (d, dmin) pairs, dual-quhit graph, 24-beam search.
- Preserved β Embeddings, norms, router/gate weights (kept as-is in high precision).
Tied Embeddings
If no separate output weight tensor exists, token_embd.weight doubles as the LM head. HPC automatically detects tied embeddings and promotes them to Q4_0 to preserve generation logic, ensuring that output tokens remain accurate despite the extreme compression of the feed-forward layers.
5. Sub-Block Refinement and Beam Search
The HPC process executes in multiple phases to guarantee globally coherent scaling parameters:
- Phase 1: Greedy Seed & WLS Refinement: Computes reference scale and min per 256-weight superblock.
- Phase 2: Candidate Generation via Dβ Vesica Scoring: Instead of MSE, weights are scored with the Dβ Vesica gate (penalizing DC/summed errors 4x more than AC/wave errors).
- Phase 3: Sequential Measurement: Builds an HPCGraph encoding candidates as Boltzmann amplitudes. Connects blocks via CZ phase gates and runs Griffiths-Niu measurement (IDFT6, Born Rule, Collapse).
- Phase 4: 24-Beam Hensel Search: Maintains 24 parallel configuration beams across the tensor, branching candidates evaluated via triality-weighted scoring.
- Phase 5: Sub-Block Shor Refinement: A second, smaller Shor sequential measurement over a 16-node graph corresponding to the 16 sub-blocks within each 256-weight superblock.
6. Prerequisites and Build Instructions
Before you can quantize models, you must build the Shor-optimized HPC C engine.
Dependencies (Ubuntu/Debian)
sudo apt install gcc libgmp-dev libmpfr-dev python3 python3-numpy
You will also need llama.cpp built from source for iMatrix generation:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --target llama-imatrix -j$(nproc)
Build the HPC Engine
Navigate to the LLM-distributed directory and compile:
make -f makefile.quantize
Verify the build generated libhexstate_q2k.so. The Python requantizer (hexstate_requantize.py) auto-detects this library to enable Shor optimization.
7. The Quantization Pipeline (End-to-End)
Step A: Convert Model to BF16 GGUF
Use llama.cpp to convert your source HuggingFace model to BF16.
python3 llama.cpp/convert_hf_to_gguf.py /path/to/model/ --outfile Model-BF16.gguf --outtype bf16
Step B: Generate Importance Matrix (iMatrix)
HPC includes a native C engine for generating importance matrices (imatrix) used in aggressive LLM weight quantization (Q2_K and below). The engine replaces the standard llama.cpp calibration pipeline with an HPC-graph-accelerated tokenizer and forward pass that produces structurally superior importance data.
Component files:
| File | Description |
|---|---|
LLM/hexstate_quantize.c |
C engine: HPC BPE tokenizer, graph-based forward pass, quantization kernels |
LLM/generate_imatrix.py |
Orchestrator: GGUF loading, weight dequantization, C-bridge, imatrix output |
LLM/calibration_data.txt |
Calibration corpus (~12.8M characters) |
Why It Works: Global Geometric Tokenization
The core innovation is the HPC BPE tokenizer β a byte-pair encoding engine that operates on the HPCGraph substrate without regex word boundaries. This seemingly simple architectural choice has profound consequences for quantization quality.
The Standard Approach: Artificial Isolation
Standard tokenizers (tiktoken, SentencePiece, HuggingFace) apply a regex pre-split before BPE:
"Hello world" β regex β ["Hello", " world"] β BPE per word β [15496, 1917]
The regex fence means:
- Merges cannot cross word boundaries. The characters
"o "(letter-o + space) can never form a token. - Each word is tokenized in isolation. The token for
"Hello"at position 0 is mathematically independent of any text at position 10,000. - Token boundaries are locally determined. Changing text on page 500 cannot affect how page 1 is tokenized.
When a standard tokenizer produces 4,096 tokens for imatrix calibration, those tokens are a shallow, context-free sample β an arbitrary window of generic subwords that exercises only the activation patterns present in that one fragment of text.
The HPC Approach: Unrestricted Graph Contraction
The HPC BPE tokenizer treats the entire calibration corpus as a single, continuous phase graph:
"Hello world" β 11 sites, each CZ-coupled to its neighbor β global merge competition
Graph Construction β Each character becomes a site in an
HPCGraph. Adjacent sites are connected by CZ edges, encoding pair structure as phase entanglement. For a 12.8M character corpus, this creates a graph with 12,799,706 sites and 12,799,705 CZ edges.Global Merge Competition β Each BPE pass scans every alive position across the entire graph, finds the lowest-rank merge pair that exists anywhere in the 12.8M character sequence, and contracts all instances simultaneously. This is not local β a merge decision at position 10,000,000 competes with and can preempt merge decisions at position 100.
Cascade Propagation β When a merge at position
icontracts sitesiandi+1into a single token, sitei's neighbor changes. The new adjacent pair(merged_token, next_neighbor)may have a very low rank, causing it to fire in the next pass β which creates another new pair, propagating further. Without regex boundaries, these cascades propagate freely across spaces, punctuation, and line breaks, threading through the entire document's character geometry.Phase Contraction β Each merge is simultaneously a graph contraction on the HPCGraph. The merged site's local quhit amplitude is updated to a sharp basis state encoding the new token ID (
best_merged % D), which mathematically severs the entanglement from the consumed CZ edge. The graph tracks the full contraction history.
After 21,576 passes (Mistral) or 24,447 passes (Qwen), the 12.8M character graph contracts to ~3.5Mβ5.2M surviving sites. Each surviving token is not a generic subword β it is the fixed point of a global contraction over the entire document's character geometry.
Tokens as Global Geometric Frequencies
In standard NLP, a token represents a word. In the HPC tokenizer, a token represents a global geometric frequency β a structural alignment that was carved out by 20,000+ rounds of competition across the full corpus.
The key consequence: the first 4,096 tokens of the output already contain the structural dependencies of the entire 12.8M character document. This is because:
- The merge rules (pre-trained vocabulary) act as a fixed geometric frame β a coordinate system for projecting character sequences into token space.
- The absence of regex boundaries means the projection is unrestricted β merges resolve cross-word, cross-line, and cross-paragraph structures that regex-split tokenizers are blind to.
- The specific token IDs and their boundary placements at any position are determined by 20,000+ passes of global competition where every merge decision was influenced by every position in the 12.8M character sequence.
- You cannot know how the first 500 characters will be tokenized until you have evaluated the last 500 characters. The tokenization of position 0 is a function of the entire document.
The Result: Surgical Quantization at 2 Bits
When these globally-informed tokens are fed through the HPC forward pass for importance collection, the resulting E[xΒ²] statistics are structurally representative of the full corpus β even from a single 4,096-token chunk. The cross-layer Belief Propagation then smooths these statistics via residual stream coupling, producing an importance matrix that precisely identifies the load-bearing weights.
Empirical validation: Mistral-7B-Instruct-v0.3 quantized to Q2_K (87.5% weight compression, 14.5 GB β 3 GB) using a single HPC-calibrated chunk produces:
- Flawless English grammar and complex vocabulary
- Correct factual retrieval from context ("What color was John's suit?" β "Neon green.")
- Coherent multi-sentence reasoning structure
Standard llama.cpp imatrix calibration at Q2_K typically requires hundreds of chunks (500K+ tokens) to avoid catastrophic degradation. The HPC pipeline achieves superior results with one chunk because the tokenizer has already done the work of compressing the entire document's structure into that chunk.
# Generate HPC importance matrix
python3 LLM/generate_imatrix.py \
model.gguf calibration_data.txt \
-o imatrix.dat --chunks 5 --verbose
5 Chunks is the 'sweet spot' for retaining most model intelligence I've found.
Step C: Quantize with HPC Execute the re-quantizer with your newly generated BF16 GGUF and iMatrix.
python3 hexstate_requantize.py Model-BF16.gguf Model-Q2_K-HexState.gguf --keep-metadata --imatrix imatrix.gguf
This automatically routes the attention layers to Q4_0 and FFN/MLP layers to Q2_K using the Shor measurement graph.
8. Inference & Runtime Configuration
For correct operation, it is highly recommended to use appropriate chat templates and specific configuration flags in llama.cpp to prevent context length bugs.
Download Correct Chat Template:
curl -L -o chat_template.jinja "https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja"
Run llama-server:
llama-server -m Model-Q2_K-HexState.gguf -ngl 0 -c 4096 --host 0.0.0.0 --port 8989 --jinja --chat-template-file chat_template.jinja --cache-ram 0 -ctxcp 1
Recommended Sampling Settings:
- Temperature: 0.3β0.4 (Lower reduces sampling noise at low BPW)
- Top_k: 20β30 (Narrow sampling for coherence)
- Top_p: 0.8β0.85 (Cuts noisy long tail)
- Repeat_penalty: 1.15β1.2 (Prevents self-correction loops)
9. Fidelity Classification & Troubleshooting
The quantizer reports a fidelity rating based on total RMSE across all quantized tensors:
| Rating | RMSE Threshold | Icon |
|---|---|---|
| ULTRA | β€ 1e-04 | β β β β |
| HIGH | β€ 3e-04 | β β β β |
| GOOD | β€ 1e-03 | β β ββ |
| STANDARD | > 1e-03 | β βββ |
Note: Due to anisotropic error shaping, a "GOOD" Shor-quantized model will typically outperform a "HIGH" BP-quantized model on reasoning tasks.
Common Issues
- Arabic/Korean characters in output: The embedded chat template is broken. Use
--chat-template-filewith the correct Jinja template. - RAM usage keeps growing: Use
--cache-ram 0 -ctxcp 1with llama.cpp to manage sliding window attention. - RMSE is higher than standard Q2_K: This is intentional. The Dβ vesica gate trades total RMSE for computation-aligned error minimization.
- libhexstate_q2k.so not found: Make sure to compile the C engine using
make -f makefile.quantize.
How the HPC Engine Makes Global Error Decisions
The HPC (Holographic Phase Contraction) engine takes a fundamentally different approach to LLM weight quantization. Instead of minimizing local error block-by-block, it frames scale selection as a global quantum-inspired optimization problem. It maps quantization candidates to a constraint graph and evaluates them using the Griffiths-Niu Sequential Measurement protocol derived from Shor's factoring algorithm.
Here is a step-by-step breakdown of how the engine makes global error decisions, directly referencing the implementation in hexstate_quantize.c.
1. Candidate Generation and Dβ Vesica Scoring
Before any global optimization occurs, the engine needs candidates. For each block of weights, it generates candidate scales (d and dmin) around a baseline least-squares optimum.
Instead of scoring these candidates using standard Mean Squared Error (MSE), it uses the Dβ Vesica Gate.
Code Citation (hexstate_quantize.c, lines ~1420β1432)
/* Decompose into vesica (DC) and wave (AC) components */
float vesica_err = 0.0f, wave_err = 0.0f;
for (int p = 0; p < half_g; p++) {
float v = e_cur[p] + e_cur[p + half_g];
float w_wave = e_cur[p] - e_cur[p + half_g];
float w_avg = (w_cur[p] + w_cur[p + half_g]) * 0.5f;
vesica_err += v * v * w_avg;
wave_err += w_wave * w_wave * w_avg;
}
/* Triality weighting: penalize vesica 4Γ, wave 1Γ. */
err += 0.5f * (4.0f * vesica_err + 1.0f * wave_err);
Why this matters: vesica_err represents errors that sum together during matrix multiplication, propagating through the network and destroying reasoning. wave_err represents errors that naturally cancel out. By penalizing vesica_err by 4x, the engine heavily biases towards candidates whose errors cancel out during inference, even if their total Euclidean distance (MSE) from the original weights is larger.
2. Graph Construction and Boltzmann Amplitudes
The selected candidates and their Vesica errors are grouped into 6 bins (representing the 6 states of a $Z_6$ "quhit" or quantum digit). The engine constructs an HPCGraph mapping each block to one or more quhits.
The errors are transformed into "Boltzmann amplitudes"βrepresenting the likelihood of selecting each state.
Code Citation (hexstate_quantize.c, lines ~1502β1506)
for (int ci = 0; ci < Q4_N_CAND; ci++) {
int qi = Q4_CAND_TO_QUHIT[ci];
amp_re[qi] += exp(-(double)(agg_errors[ci] - min_err) /
(2.0 * (double)temperature));
}
3. The Griffiths-Niu Sequential Measurement
This is where the global coordination happens. The function shor_measure_graph (lines 1166β1314) executes a sequential measurement MSB to LSB. This replaces standard Belief Propagation with a deterministic evaluation that creates massive global correlation.
For each block $k$ being measured:
A. Neighbor Contribution (Entanglement)
It evaluates how neighboring blocks influence block $k$ by projecting their current amplitudes across the graph edges.
// hexstate_quantize.c: lines 1217-1221
sr += lr*wr - li*wi;
si += lr*wi + li*wr;
B. Feed-Forward Phase Correction
It applies a phase shift based on the outcomes of all blocks measured before it, a signature trait of the semi-classical QFT used in Shor's algorithm.
// hexstate_quantize.c: lines 1181-1184
double power = 36.0;
for (int64_t j = k + 1; j < n_sites; j++) {
theta_k += (double)measured_out[j] / power;
power *= 6.0;
}
C. IDFT6 and Constructive Interference
It runs an Inverse Discrete Fourier Transform (IDFT6). Because the neighbor influence ($C_k$) was baked into the amplitudes before the IDFT, the IDFT acts as a coherence filter. It produces constructive interference peaks precisely at the scale candidate that creates the best global configuration.
// hexstate_quantize.c: lines 1256-1261
double angle = 2.0 * 3.14159265358979323846 * d * v / 6.0;
double er = cos(angle), ei = sin(angle);
sum_re += alpha_re[d]*er - alpha_im[d]*ei;
sum_im += alpha_re[d]*ei + alpha_im[d]*er;
D. Measurement and Back-Action (The "Magic Pointer")
Once an optimal state is selected (using argmax on the squared amplitudes via the Born Rule), the engine calls shor_collapse_site.
This function creates the global error decision. It doesn't just lock in block $k$'s choice; it propagates the decision to all unmeasured neighbors.
// hexstate_quantize.c: lines 1120-1121
double old_re = pq->edge_re[d], old_im = pq->edge_im[d];
pq->edge_re[d] = old_re * w_re - old_im * w_im;
pq->edge_im[d] = old_re * w_im + old_im * w_re;
This back-action fundamentally alters the amplitudes of adjacent blocks, conditioning their future quantization choices on the choice just made for block $k$. This creates anti-correlated quantization noise across the entire tensor, ensuring that when the matrix is multiplied against an activation vector, the localized errors cancel each other out.
4. Beam Search Refinement
Because the measurement is performed left-to-right (MSB to LSB), the sequence is prone to greedy failure. The HPC engine circumvents this using a 24-Beam Hensel Search.
Instead of accepting the single path created by the graph collapse, it maintains 24 parallel quantization paths (Q4_N_BEAMS). It evaluates candidate extensions by dividing the Shor probability (from the graph marginals) by the local normalized error, advancing the 24 best global configurations simultaneously.
Code Citation (hexstate_quantize.c, lines ~1581β1582)
double ext_err = beams[b].acc_error + cand_errors[blk][c];
extensions[n_ext].score = cand_score[c] / (ext_err + 1e-15);
Summary
The global error decision is not an iterative smoothing process (like BP). It is an exact evaluation of quantum interference. By encoding quantization errors into complex amplitudes, passing them through an IDFT6, and forcing neighbors to condition their subsequent scale selections via wave-collapse back-action, the engine ensures that the chosen scales globally cancel each other's destructive matrix-multiplication errors.
10. License
The quantizer code is part of the HPC project (MIT). Quantized models inherit the license of the base model (e.g., Gemma Terms of Use).