🚀 Introducing FINAL-Bench Quantum — an open, neutral benchmark that finally puts quantum-computing methods on one fair yardstick.
Quantum results are notoriously hard to compare. The same "logical error rate" or "query fidelity" means very different things depending on the code, noise model, hardware, and shot count. FINAL-Bench Quantum fixes that: five events judged under identical, published protocols, where every number is labeled as either measured here or quoted from a source.
Five events: ① QEC Decoder ② Optimization (Max-Cut) ③ VQE ④ QRAM ⑤ Quantum Simulation
The rules are simple and strict: ✅ Track A (measured here, with 95% confidence intervals) is kept separate from Track B (quoted from papers, not directly comparable). 🔬 Simulation and real hardware are clearly distinguished, and no quantum-advantage claims are made. 🌍 Methods from Google, IBM, NVIDIA, USTC, Riverlane and more sit side by side, with origin flags and author credits. 📤 Anyone can submit their own method via the Submit tab for review and listing.
Already on the board: real IBM Heron r2 measurements (repetition-code distance boundary, 29–175× error reduction from d3 to d5), a real-chip QRAM query fidelity of 0.92, and H₂ VQE at chemical accuracy — always labeled honestly as simulation vs hardware.
A leaderboard is only useful if you can trust it, so neutrality is the whole point: strong competitors stay in even when they beat the host, sources are quoted faithfully, and a simulation is never rounded up into a hardware claim.
Darwin-60B-DUO: Two SOTAs, One Endpoint — 88.38% on GPQA Diamond 🚀
We're excited to release Darwin-60B-DUO, the Darwin family's first DUO model. Take two domain-verified specialists, hide them behind a single OpenAI-compatible endpoint, and let a router decide which one (or both) answers. You see one model, one API — but get the best of both.
The number that matters: on the full 198-question GPQA Diamond, Darwin-60B-DUO hits 88.38%. The constituents alone land at 69.70% (Darwin-28B-REASON) and 77.27% (AWAXIS-Think-31B); a naive cascade only reaches 83.84%. The DUO clears them all. Two small specialists, intelligently routed, beat one big generalist on cost and quality. Both are independently verified — Darwin-28B-REASON is #3 on the HF GPQA Diamond leaderboard, AWAXIS-Think-31B is #1 on Korea's national K-AI Leaderboard (MSIT).
The brains is a Hybrid-A router picking one of five strategies on the fly. Korean → AWAXIS, English/STEM → Darwin (single-backend, ~70% of traffic at 1× cost). When a Korean answer needs rigorous English reasoning, split_refine fires — Darwin drafts, AWAXIS polishes; MCQ/short-answer runs both with self-consistency + cross-verify. Net effective cost: only ~1.3× a single 30B model.
The part the community will care about: the gateway is model-agnostic and Apache-2.0. Point it at any two OpenAI-compatible backends and you've got a DUO in minutes — teach router.py when to use which, and parallel calls, response merging, and routing transparency via _duo_route are handled for you. Fork it and tell us what you built.
Painless deploy: docker compose up for both vLLM backends + gateway; FP8 ~30GB colocates on a single B200/H100. One git clone (~120GB). Text-only for now, streaming in v1.1. Two SOTAs, one endpoint. Come build your own on the Community tab.
🧬 Darwin Family: Zero Gradient Steps, GPQA Diamond 88.89%
How far can we push LLM reasoning *without* training?
Our team at VIDRAFT submitted this paper to Daily Papers yesterday, and it's currently #3. Huge thanks to everyone who upvoted — sharing the core ideas below.
Darwin Family is a training-free evolutionary merging framework. By recombining the weight spaces of existing LLM checkpoints — with zero gradient-based training — it reaches frontier-level reasoning.
- 🏆 Darwin-28B-Opus: GPQA Diamond 88.89% - 💸 Zero gradient steps — not a single B200 or H200 hour needed - 🧬 Consistent gains across 4B → 35B scale - 🔀 Cross-architecture breeding between Transformer and Mamba families - 🔁 Stable recursive multi-generation evolution
#Three Core Mechanisms
① 14-dim Adaptive Merge Genome — fine-grained recombination at both component level (Attention / FFN / MLP / LayerNorm / Embedding) and block level, expanding the prior evolutionary-merge search space.
② MRI-Trust Fusion — we diagnose each layer's reasoning contribution via an **MRI (Model Reasoning Importance)** signal and fuse it with evolutionary search through a **learnable trust parameter**. Trust the diagnostic too much and search collapses; ignore it and search becomes inefficient — Darwin learns the balance from data.
③ Architecture Mapper — weight-space breeding across heterogeneous families. Attention × SSM crossover actually works.
Why It Matters > Diagnose latent capabilities already encoded in open checkpoints, > and recombine them — no gradients required.
This Space is a fork of the brilliant Eliahu/Model-Atlas, the official demo of "Charting and Navigating Hugging Face's Model Atlas" (Horwitz et al., arXiv 2503.10633). Their pre-computed HF model graph is the foundation of every node and edge you see, and we are deeply grateful for its open release.
The original atlas is a static snapshot of early 2025. Model Galaxy turns it into a living, multimodal map. We injected the 2026 trending originals that did not exist when the atlas was frozen — DeepSeek-V4, Hy3-preview, GLM-5.1, Kimi-K2, gpt-oss, Nemotron-3 Super / Nano / Omni, Hermes-4.3, Qwen3-Coder-Next, Llama-3.3, Granite-4.1, plus the latest multimodal releases (FLUX.2, ERNIE-Image, HunyuanImage / Video, LTX-2.3, Wan2.2, Kokoro-82M, VoxCPM2, Voxtral-TTS, whisper-v3-turbo, Gemma-4, Qwen3-Omni, Phi-4-mm) — each with proper base_model lineage edges.
We also added the complete VIDRAFT Darwin family ontology: 120 nodes covering Darwin Core, AETHER, every brand variant (Rogue, AWAXIS, TenOS, Warecube), NOESIS-Darwin multimodal extensions, and 40+ community quantizations — the most complete Darwin lineage view anywhere.
The name "Galaxy" is now literal: our three injected clusters are re-laid out as logarithmic spiral galaxies, with bigger models near the bright cores and quantizations scattering to the outer arms — just like real star mass distribution. A top-right toggle switches between Galaxy mode (deep-space gradient with 220 animated stars) and Atlas mode (clean white panels for reports). A 15-second progress bar narrates the render, and per-modality / per-company colors make every cluster legible at a glance.
Final scale: 22,480 nodes in the default Modalities atlas, 137,324 in the Large NLP atlas, and a 277-node compact Darwin + Trending view for instant exploration. Feedback and PRs welcome.