Auron-510M / README.md
nyxia's picture
Upload Chimera 510M at step 249000
d498dc3 verified
---
license: apache-2.0
tags:
- auron
- chimera
- gdn
- ouroboros
- hybrid-architecture
language:
- en
thumbnail: auron_banner.png
---
![Auron](auron_banner.png)
# Auron-510M
**Auron** — Chimera hybrid GDN-Attention language models with Ouroboros weight sharing.
**Paper:** [Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing](https://github.com/Fy-/Auron)
**Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron)
## Architecture
- **Type:** Chimera (ChimeraConfig)
- **Dim:** 1536
- **Layers:** 16 virtual
- **Params:** 510,217,280 (510M)
- **Vocab:** 151936 (Qwen 3 tokenizer)
- **Context:** 2048 tokens
- **Topology:** 4 unique bottom + 4×3 shared top
- **GDN:Attn ratio:** 3:1 (every 4th layer is attention)
- **Virtual equivalent:** ~1,020,434,560 params
## Training Curves
![Training Curves](training_curves.png)
## Training
- **Step:** 249,000
- **Data:** Mixed (75% FineWeb-Edu, 18% StarCoder, 5% FineMath, 2% UltraChat)
- **Optimizer:** Muon + AdamW (decoupled embedding LR)
- **Schedule:** WSD (Warmup-Stable-Decay)
## Usage
```bash
git clone https://github.com/Fy-/Auron && cd Auron && rye sync
```
```python
from ouro import load_model, generate
model, tokenizer, device = load_model("nyxia/Auron-510M")
generate(model, tokenizer, device, "The history of")
```
## Sampling
Default: T=0.7, top_k=20, top_p=0.9, rep_pen=1.0, presence_pen=1.5 (Ouroboros weight sharing requires presence penalty >= 1.5 to prevent attractor wells).
## Links
- **Paper:** [Auron: Depth-Efficient Language Models via Hybrid Recurrent-Attention Weight Sharing](https://github.com/Fy-/Auron/blob/master/Auron_chimera_topology_paper.pdf)
- **Code:** [github.com/Fy-/Auron](https://github.com/Fy-/Auron)
- **Models:** [huggingface.co/nyxia](https://huggingface.co/nyxia)
Built by [Florian Gasquez](https://fyx.jp) ([@nyxia](https://huggingface.co/nyxia)). Part of the [Soulkyn](https://soulkyn.com) project.