Datadog
/

Toto-2.0-313m

@@ -1,4 +1,9 @@
 ---
 tags:
 - time-series-forecasting
 - foundation-models
@@ -9,64 +14,65 @@ tags:
 - observability
 - safetensors
 - pytorch_model_hub_mixin
-license: apache-2.0
-pipeline_tag: time-series-forecasting
 thumbnail: https://web-assets.dd-static.net/42588/1778691695-toto-2-hero.png
 model-index:
 - name: Toto-2.0-313m
   results:
-    - task:
-        type: time-series-forecasting
-      dataset:
-        name: BOOM
-        type: BOOM
-      metrics:
-        - name: CRPS
-          type: CRPS
-          value: 0.351
-        - name: MASE
-          type: MASE
-          value: 0.585
-      source:
-        name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
-        url: https://huggingface.co/spaces/Datadog/BOOM
-    - task:
-        type: time-series-forecasting
-      dataset:
-        name: GIFT-Eval
-        type: GIFT-Eval
-      metrics:
-        - name: CRPS
-          type: CRPS
-          value: 0.481
-        - name: MASE
-          type: MASE
-          value: 0.703
-      source:
-        name: GIFT-Eval Time Series Forecasting Leaderboard
-        url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
-    - task:
-        type: time-series-forecasting
-      dataset:
-        name: TIME
-        type: TIME
-      metrics:
-        - name: CRPS
-          type: CRPS
-          value: 0.535
-        - name: MASE
-          type: MASE
-          value: 0.642
-      source:
-        name: TIME Benchmark Leaderboard
-        url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard
 ---
 # Toto-2.0-313m
-Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/). Toto 2.0 is the current generation, featuring u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.
-The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
 ## 📊 Performance
@@ -114,7 +120,7 @@ For more examples, see the [Quick Start notebook](https://github.com/DataDog/tot
 ## 💾 Available Checkpoints
-All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
 | Model | Params | Weights (fp32) | Latency | Recommended for |
 |:---:|:---:|:---:|---|---|
@@ -136,7 +142,7 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
 <figure>
 <img src="assets/architecture.png" alt="Overview of the Toto 2.0 architecture.">
-<figcaption>A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds <b>contiguous patch masking (CPM)</b> for single-pass parallel decoding, a <b>quantile output head</b> trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the <a href="#-additional-resources">technical report</a> for details.</figcaption>
 </figure>
 ## 🔗 Additional Resources
@@ -160,4 +166,4 @@ All five Toto 2.0 sizes share the same training recipe; pick a size based on you
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2605.20119},
 }
-```

 ---
+license: apache-2.0
+pipeline_tag: time-series-forecasting
+library_name: pytorch
+datasets:
+- Datadog/BOOM
 tags:
 - time-series-forecasting
 - foundation-models
 - observability
 - safetensors
 - pytorch_model_hub_mixin
+- u-mup
 thumbnail: https://web-assets.dd-static.net/42588/1778691695-toto-2-hero.png
 model-index:
 - name: Toto-2.0-313m
   results:
+  - task:
+      type: time-series-forecasting
+    dataset:
+      name: BOOM
+      type: BOOM
+    metrics:
+    - type: CRPS
+      value: 0.351
+      name: CRPS
+    - type: MASE
+      value: 0.585
+      name: MASE
+    source:
+      url: https://huggingface.co/spaces/Datadog/BOOM
+      name: BOOM 💥 Observability Time-Series Forecasting Leaderboard
+  - task:
+      type: time-series-forecasting
+    dataset:
+      name: GIFT-Eval
+      type: GIFT-Eval
+    metrics:
+    - type: CRPS
+      value: 0.481
+      name: CRPS
+    - type: MASE
+      value: 0.703
+      name: MASE
+    source:
+      url: https://huggingface.co/spaces/Salesforce/GIFT-Eval
+      name: GIFT-Eval Time Series Forecasting Leaderboard
+  - task:
+      type: time-series-forecasting
+    dataset:
+      name: TIME
+      type: TIME
+    metrics:
+    - type: CRPS
+      value: 0.535
+      name: CRPS
+    - type: MASE
+      value: 0.642
+      name: MASE
+    source:
+      url: https://huggingface.co/spaces/Real-TSF/TIME-leaderboard
+      name: TIME Benchmark Leaderboard
 ---
 # Toto-2.0-313m
+Toto (Time Series Optimized Transformer for [Observability](https://www.datadoghq.com/knowledge-center/observability/)) is a family of time series foundation models for multivariate forecasting developed by [Datadog](https://www.datadoghq.com/).
+This model, **Toto-2.0-313m**, was presented in the paper [Toto 2.0: Time Series Forecasting Enters the Scaling Era](https://huggingface.co/papers/2605.20119) by Emaad Khwaja, Chris Lettieri, Gerald Woo, Eden Belouadah, Marc Cenac, Guillaume Jarry, Enguerrand Paquin, Xunyi Zhao, Viktoriya Zhukov, Othmane Abou-Amal, Chenghao Liu, Ameet Talwalkar, and David Asker.
+Toto 2.0 is a generation of u-μP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family. The family sets a new state of the art on three forecasting benchmarks: [BOOM](https://huggingface.co/spaces/Datadog/BOOM), our observability benchmark; [GIFT-Eval](https://huggingface.co/spaces/Salesforce/GIFT-Eval), the standard general-purpose benchmark; and the recent contamination-resistant [TIME](https://arxiv.org/abs/2602.12147) benchmark.
 ## 📊 Performance
 ## 💾 Available Checkpoints
+All five Toto 2.0 sizes share the same training recipe. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.
 | Model | Params | Weights (fp32) | Latency | Recommended for |
 |:---:|:---:|:---:|---|---|
 <figure>
 <img src="assets/architecture.png" alt="Overview of the Toto 2.0 architecture.">
+<figcaption>A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds <b>contiguous patch masking (CPM)</b> for single-pass parallel decoding, a <b>quantile output head</b> trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the <a href="https://arxiv.org/abs/2605.20119">technical report</a> for details.</figcaption>
 </figure>
 ## 🔗 Additional Resources
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2605.20119},
 }
+```