Improve model card
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,22 +1,31 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
datasets:
|
| 4 |
- HuggingFaceFW/fineweb-edu
|
|
|
|
| 5 |
metrics:
|
| 6 |
- accuracy
|
| 7 |
- perplexity
|
|
|
|
| 8 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
## Description
|
| 10 |
-
|
|
|
|
|
|
|
| 11 |
|
| 12 |
**Note that the models under the pretrain/ directory should not be used directly for evaluation.**
|
| 13 |
-
These models need to undergo the resolution adaptation phase with the corresponding inference dilation size, local size, and initial size.
|
| 14 |
-
Since there can be various configurations, we provide only one example, D=16 and W=256, under the adapt/ directory.
|
| 15 |
-
For other configurations, we leave the adaptation to the readers. This resolution adaptation process is fast and stable: 1B tokens are sufficient for all pretrained models; we used a simple optimization scheme and found it to work well, and we observed that other optimization hyperparameters also work well.
|
| 16 |
|
| 17 |
## Citation
|
| 18 |
If you find it useful, please consider citing the paper:
|
| 19 |
-
```
|
| 20 |
@article{wei2026rat+,
|
| 21 |
title={RAT+: Train Dense, Infer Sparse--Recurrence Augmented Attention for Dilated Inference},
|
| 22 |
author={Wei, Xiuying and Gulcehre, Caglar},
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
datasets:
|
| 3 |
- HuggingFaceFW/fineweb-edu
|
| 4 |
+
license: apache-2.0
|
| 5 |
metrics:
|
| 6 |
- accuracy
|
| 7 |
- perplexity
|
| 8 |
+
pipeline_tag: text-generation
|
| 9 |
---
|
| 10 |
+
|
| 11 |
+
# RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
|
| 12 |
+
|
| 13 |
+
Official checkpoints for **RAT+**, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning.
|
| 14 |
+
|
| 15 |
+
- **Paper:** [RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference](https://huggingface.co/papers/2602.18196)
|
| 16 |
+
- **Repository:** [https://github.com/wimh966/rat-plus](https://github.com/wimh966/rat-plus)
|
| 17 |
+
|
| 18 |
## Description
|
| 19 |
+
RAT+ bridges pretraining architectures and flexible inference. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models.
|
| 20 |
+
|
| 21 |
+
This repository contains checkpoints for 1B (100BT), 7B (100BT), and 3B (200BT) models.
|
| 22 |
|
| 23 |
**Note that the models under the pretrain/ directory should not be used directly for evaluation.**
|
| 24 |
+
These models need to undergo the resolution adaptation phase with the corresponding inference dilation size, local size, and initial size. Since there can be various configurations, we provide only one example, D=16 and W=256, under the adapt/ directory. For other configurations, we leave the adaptation to the readers. This resolution adaptation process is fast and stable: 1B tokens are sufficient for all pretrained models.
|
|
|
|
|
|
|
| 25 |
|
| 26 |
## Citation
|
| 27 |
If you find it useful, please consider citing the paper:
|
| 28 |
+
```bibtex
|
| 29 |
@article{wei2026rat+,
|
| 30 |
title={RAT+: Train Dense, Infer Sparse--Recurrence Augmented Attention for Dilated Inference},
|
| 31 |
author={Wei, Xiuying and Gulcehre, Caglar},
|