Text Generation

Improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +15 -6
README.md CHANGED
@@ -1,22 +1,31 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - HuggingFaceFW/fineweb-edu
 
5
  metrics:
6
  - accuracy
7
  - perplexity
 
8
  ---
 
 
 
 
 
 
 
 
9
  ## Description
10
- Models trained from [RAT+ Paper](https://arxiv.org/pdf/2602.18196). Datasets we release the 100BT tokenized version. For the 200BT version, it's tokenized from fineweb_edu_350B raw dataset. Readers can use our code for tokenization.
 
 
11
 
12
  **Note that the models under the pretrain/ directory should not be used directly for evaluation.**
13
- These models need to undergo the resolution adaptation phase with the corresponding inference dilation size, local size, and initial size.
14
- Since there can be various configurations, we provide only one example, D=16 and W=256, under the adapt/ directory.
15
- For other configurations, we leave the adaptation to the readers. This resolution adaptation process is fast and stable: 1B tokens are sufficient for all pretrained models; we used a simple optimization scheme and found it to work well, and we observed that other optimization hyperparameters also work well.
16
 
17
  ## Citation
18
  If you find it useful, please consider citing the paper:
19
- ```
20
  @article{wei2026rat+,
21
  title={RAT+: Train Dense, Infer Sparse--Recurrence Augmented Attention for Dilated Inference},
22
  author={Wei, Xiuying and Gulcehre, Caglar},
 
1
  ---
 
2
  datasets:
3
  - HuggingFaceFW/fineweb-edu
4
+ license: apache-2.0
5
  metrics:
6
  - accuracy
7
  - perplexity
8
+ pipeline_tag: text-generation
9
  ---
10
+
11
+ # RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
12
+
13
+ Official checkpoints for **RAT+**, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning.
14
+
15
+ - **Paper:** [RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference](https://huggingface.co/papers/2602.18196)
16
+ - **Repository:** [https://github.com/wimh966/rat-plus](https://github.com/wimh966/rat-plus)
17
+
18
  ## Description
19
+ RAT+ bridges pretraining architectures and flexible inference. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models.
20
+
21
+ This repository contains checkpoints for 1B (100BT), 7B (100BT), and 3B (200BT) models.
22
 
23
  **Note that the models under the pretrain/ directory should not be used directly for evaluation.**
24
+ These models need to undergo the resolution adaptation phase with the corresponding inference dilation size, local size, and initial size. Since there can be various configurations, we provide only one example, D=16 and W=256, under the adapt/ directory. For other configurations, we leave the adaptation to the readers. This resolution adaptation process is fast and stable: 1B tokens are sufficient for all pretrained models.
 
 
25
 
26
  ## Citation
27
  If you find it useful, please consider citing the paper:
28
+ ```bibtex
29
  @article{wei2026rat+,
30
  title={RAT+: Train Dense, Infer Sparse--Recurrence Augmented Attention for Dilated Inference},
31
  author={Wei, Xiuying and Gulcehre, Caglar},