barpitf
/

ratplus

Text Generation

Model card Files Files and versions

xet

Community

Improve model card

by nielsr HF Staff - opened 28 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+15

-6

Files changed (1) hide show

README.md +15 -6

README.md CHANGED Viewed

@@ -1,22 +1,31 @@
 ---
-license: apache-2.0
 datasets:
 - HuggingFaceFW/fineweb-edu
 metrics:
 - accuracy
 - perplexity
 ---
 ## Description
-Models trained from [RAT+ Paper](https://arxiv.org/pdf/2602.18196). Datasets we release the 100BT tokenized version. For the 200BT version, it's tokenized from fineweb_edu_350B raw dataset. Readers can use our code for tokenization.
 **Note that the models under the pretrain/ directory should not be used directly for evaluation.**
-These models need to undergo the resolution adaptation phase with the corresponding inference dilation size, local size, and initial size.
-Since there can be various configurations, we provide only one example, D=16 and W=256, under the adapt/ directory.
-For other configurations, we leave the adaptation to the readers. This resolution adaptation process is fast and stable: 1B tokens are sufficient for all pretrained models; we used a simple optimization scheme and found it to work well, and we observed that other optimization hyperparameters also work well.
 ## Citation
 If you find it useful, please consider citing the paper:
-```
 @article{wei2026rat+,
   title={RAT+: Train Dense, Infer Sparse--Recurrence Augmented Attention for Dilated Inference},
   author={Wei, Xiuying and Gulcehre, Caglar},

 ---
 datasets:
 - HuggingFaceFW/fineweb-edu
+license: apache-2.0
 metrics:
 - accuracy
 - perplexity
+pipeline_tag: text-generation
 ---
+# RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
+Official checkpoints for **RAT+**, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning.
+- **Paper:** [RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference](https://huggingface.co/papers/2602.18196)
+- **Repository:** [https://github.com/wimh966/rat-plus](https://github.com/wimh966/rat-plus)
 ## Description
+RAT+ bridges pretraining architectures and flexible inference. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models.
+This repository contains checkpoints for 1B (100BT), 7B (100BT), and 3B (200BT) models.
 **Note that the models under the pretrain/ directory should not be used directly for evaluation.**
+These models need to undergo the resolution adaptation phase with the corresponding inference dilation size, local size, and initial size. Since there can be various configurations, we provide only one example, D=16 and W=256, under the adapt/ directory. For other configurations, we leave the adaptation to the readers. This resolution adaptation process is fast and stable: 1B tokens are sufficient for all pretrained models.
 ## Citation
 If you find it useful, please consider citing the paper:
+```bibtex
 @article{wei2026rat+,
   title={RAT+: Train Dense, Infer Sparse--Recurrence Augmented Attention for Dilated Inference},
   author={Wei, Xiuying and Gulcehre, Caglar},