𧬠Darwin-27B-Opus: 86.9% on GPQA Diamond ā World #5, Zero Training We are excited to share Darwin-27B-Opus, a 27B model that achieved 86.9% on GPQA Diamond ā ranking #5 globally on the HuggingFace leaderboard ā without a single gradient update.
How? Darwin breeds pretrained models through evolutionary FFN crossbreeding. The father (Qwen3.5-27B) provides the reasoning architecture; the mother (Claude 4.6 Opus Reasoning Distilled) contributes structured chain-of-thought knowledge. CMA-ES automatically discovers optimal per-layer blending ratios ā no human tuning required.
The result surpasses the original Qwen3.5-27B (85.5%), GLM-5.1 (744B, 86.2%), and Qwen3.5-122B (86.6%). A 27B model outperforming 744B ā with zero training, zero data, one GPU, ~2 hours.
We also confirmed hybrid vigor on Korean benchmarks: Darwin-27B-KR (2nd generation offspring) surpassed both parents on CLIcK, winning 7 out of 11 categories. The evolutionary optimizer independently assigned 93% of FFN from the Korean-specialized mother while preserving 93% of attention from the reasoning-specialized father ā autonomously validating our core principle: FFN carries knowledge, Attention carries reasoning.
š Public release: 10 days ā 300+ community derivatives, 120K+ downloads.
We're releasing Darwin-4B-David, the first second-generation model in the Darwin Opus family. By evolving an already-evolved model, it achieves 85.0% on GPQA Diamond ā surpassing its 58.6% original ancestor and even gemma-4-31B (84.3%) ā with just 4.5B parameters.
Second-Generation Evolution Most merges start from a base model and produce a single offspring. Darwin-4B-David breaks this pattern. The Father (Darwin-4B-Opus) was already evolved from gemma-4-E4B-it with Claude Opus reasoning distillation ā a Gen-1 model. The Mother (DavidAU's DECKARD-Expresso-Universe) brings Unsloth deep tuning across 5 in-house datasets with thinking mode by default. Crossbreeding these two produced the first Gen-2 Darwin model.
Darwin V6's Model MRI scanned both parents across all 42 layers, assigning independent optimal ratios per layer. The Mother's creativity and Korean language hotspot (Layer 22-25, weight 0.95) was maximally absorbed, while the Father's reasoning core (Layer 30-40, weight 0.48) was preserved. This is "Merge = Evolve" applied recursively ā evolution of evolution.
Benchmarks Darwin-4B-David scores 85.0% on GPQA Diamond (+26.4%p over original 58.6%), evaluated generatively with maj@8 (8 generations per question, majority vote), Epoch AI prompt format, thinking mode enabled, 50 sampled questions. On ARC-Challenge (25-shot, loglikelihood), both score 64.93% ā expected, as loglikelihood doesn't capture thinking-mode reasoning differences.
Why This Matters gemma-4-31B (30.7B) scores 84.3%. Darwin-4B-David surpasses it at 1/7th the size ā no training, no RL, just 45 minutes of MRI-guided DARE-TIES on one H100. The name "David" honors Mother creator DavidAU and evokes David vs. Goliath.
𧬠Darwin V6: Diagnostic-Guided Evolutionary Model Merging
We are releasing Darwin-31B-Opus ā a reasoning-enhanced model merging Google's Gemma-4-31B-it and TeichAI's Claude Opus Distill using the Darwin V6 engine.
Conventional merging tools (mergekit, etc.) apply a single ratio to all tensors. Set ratio=0.5 and all 1,188 tensors blend identically, with no distinction between which tensors matter for reasoning versus coding.
Darwin V6 diagnoses both parents at the tensor level before merging. It measures Shannon entropy, standard deviation, and L2 norm for every tensor, then passes 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) through the model to determine layer-wise functional importance. Each of the 1,188 tensors receives an independent optimal ratio.
combined = static(entropy/std/norm) x 0.4 + probe(cosine_distance) x 0.6 final_ratio = mri_ratio x mri_trust + genome_ratio x (1 - mri_trust)
When one parent is overwhelmingly superior for a tensor (ratio < 0.15 or > 0.85), Darwin transplants it directly without interpolation. The mri_trust parameter itself is optimized by CMA-ES evolutionary search, so optimal transplant intensity is determined automatically. After merging, a Health Check compares the child against both parents layer-by-layer to detect interference or function loss.
š Try it now: FINAL-Bench/Gemma-4-Multi Two Models, One Space Switch between both Gemma 4 variants in a single interface:
ā” Gemma 4 26B-A4B ā MoE with 128 experts, only 3.8B active params. 95% of the 31B's quality at ~8x faster inference. AIME 88.3%, GPQA 82.3%. š Gemma 4 31B ā Dense 30.7B. Best quality among Gemma 4 family. AIME 89.2%, GPQA 84.3%, Codeforces 2150. Arena open-model top 3.
Features
Vision ā Upload images for analysis, OCR, chart reading, document parsing Thinking Mode ā Toggle chain-of-thought reasoning with Gemma 4's native <|channel> thinking tokens System Prompts ā 6 presets (General, Code, Math, Creative, Translate, Research) or write your own Streaming ā Real-time token-by-token response via ZeroGPU Apache 2.0 ā Fully open, no restrictions
Technical Details Built with the dev build of transformers (5.5.0.dev0) for full Gemma 4 support including multimodal apply_chat_template, variable-resolution image processing, and native thinking mode. Runs on HF ZeroGPU with @spaces.GPU ā no dedicated GPU needed. Both models support 256K context window and 140+ languages out of the box.