🌌 Qwen3.5-9B-Singularity-Max

This model is a hybrid mixed-precision model (INT8 + FP16) based on the Singularity Principle (P3 Trace-class Admissibility Law). By preserving the 48 most critical topological cores in pure FP16, it prevents transient growth and entropy collapse while halving VRAM usage.

🚀 Quick Start

You don't need any special scripts or custom kernels. Just use standard Hugging Face transformers:

language: - en pipeline_tag: text-generation tags: - qwen - mixed-precision - singularity-principle - bitsandbytes

🌌 Qwen3.5-9B-Singularity-Max

This model is a hybrid mixed-precision model (INT8 + FP16) based on the Singularity Principle (P3 Trace-class Admissibility Law). By diagnosing the spectral collapse of the neural network, we preserved the 48 most critical topological cores in pure FP16, while compressing the stable background space into INT8.

🏆 Key Achievement: The PPL Inversion Paradox

Standard 8-bit quantization typically degrades a model's logical reasoning (increasing Perplexity). However, by strictly protecting the "Singularity Cores" identified by the P3 scanner, this model exhibits a groundbreaking phenomenon: The Perplexity actually improved.

Original FP16 PPL: 14.539
Singularity-Max PPL: 14.471 (Δ -0.068, Improved!)

Compressing the background manifold into INT8 acted as a form of Noise Regularization, while the 48 FP16 cores perfectly prevented non-normal transient growth.

Furthermore, the Time-To-First-Token (TTFT) was accelerated by 2.6x:

FP16 TTFT: 3.213s
Singularity-Max TTFT: 1.230s

🚀 Quick Start (No custom scripts required!)

You don't need any complex .so files or custom kernels. The hybrid architecture (INT8 background + 48 FP16 cores) is permanently baked into the config.json and will automatically load onto your GPU using the standard Hugging Face transformers library.

This 9B model stably runs on limited hardware (e.g., dual Kaggle T4s) with a peak VRAM footprint of ~10.98 GB.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "SingularityPrinciple/Qwen3.5-9B-Singularity-Max"

# The hybrid architecture (INT8 + 48 FP16 layers) will automatically be loaded!
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

inputs = tokenizer("Explain the Singularity Principle in plain language.", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=150)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

F32

BF16

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SingularityPrinciple/Qwen3.5-9B-Singularity-Max

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(77)

this model