馃寣 Qwen3.5-9B-Singularity-Max

This model is a hybrid mixed-precision model (INT8 + FP16) based on the Singularity Principle (P3 Trace-class Admissibility Law). By preserving the 48 most critical topological cores in pure FP16, it prevents transient growth and entropy collapse while halving VRAM usage.

馃殌 Quick Start

You don't need any special scripts or custom kernels. Just use standard Hugging Face transformers:


language: - en pipeline_tag: text-generation tags: - qwen - mixed-precision - singularity-principle - bitsandbytes

馃寣 Qwen3.5-9B-Singularity-Max

This model is a hybrid mixed-precision model (INT8 + FP16) based on the Singularity Principle (P3 Trace-class Admissibility Law). By diagnosing the spectral collapse of the neural network, we preserved the 48 most critical topological cores in pure FP16, while compressing the stable background space into INT8.

馃弳 Key Achievement: The PPL Inversion Paradox

Standard 8-bit quantization typically degrades a model's logical reasoning (increasing Perplexity). However, by strictly protecting the "Singularity Cores" identified by the P3 scanner, this model exhibits a groundbreaking phenomenon: The Perplexity actually improved.

  • Original FP16 PPL: 14.539
  • Singularity-Max PPL: 14.471 (螖 -0.068, Improved!)

Compressing the background manifold into INT8 acted as a form of Noise Regularization, while the 48 FP16 cores perfectly prevented non-normal transient growth.

Furthermore, the Time-To-First-Token (TTFT) was accelerated by 2.6x:

  • FP16 TTFT: 3.213s
  • Singularity-Max TTFT: 1.230s

馃殌 Quick Start (No custom scripts required!)

You don't need any complex .so files or custom kernels. The hybrid architecture (INT8 background + 48 FP16 cores) is permanently baked into the config.json and will automatically load onto your GPU using the standard Hugging Face transformers library.

This 9B model stably runs on limited hardware (e.g., dual Kaggle T4s) with a peak VRAM footprint of ~10.98 GB.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "SingularityPrinciple/Qwen3.5-9B-Singularity-Max"

# The hybrid architecture (INT8 + 48 FP16 layers) will automatically be loaded!
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

inputs = tokenizer("Explain the Singularity Principle in plain language.", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=150)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F32
BF16
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for SingularityPrinciple/Qwen3.5-9B-Singularity-Max

Finetuned
Qwen/Qwen3.5-9B
Quantized
(77)
this model