Toto-2.0-2.5B

Toto (Time Series Optimized Transformer for Observability) is a family of time series foundation models for multivariate forecasting developed by Datadog. Toto 2.0 is the current generation, featuring u-ΞΌP-scaled transformers ranging from 4m to 2.5B parameters, all trained from a single recipe. Forecast quality improves reliably with parameter count across the family.

The family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark.

πŸ“Š Performance

Pareto frontier on BOOM and GIFT-Eval
Every Toto 2.0 size sits on or near the Pareto frontier on both BOOM and GIFT-Eval. The three largest sizes rank first, second, and third among foundation models on GIFT-Eval CRPS rank. On TIME, Toto 2.0 sizes take the top three spots on every metric, ahead of every other external foundation model evaluated.

⚑ Quick Start

Inference code is available on GitHub.

Installation

pip install "toto-2 @ git+https://github.com/DataDog/toto.git#subdirectory=toto2"

Inference Example

import torch
from toto2 import Toto2Model

model = Toto2Model.from_pretrained("Datadog/Toto-2.0-2.5B")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

# (batch, n_variates, time_steps)
target = torch.randn(1, 1, 512, device=device)
target_mask = torch.ones_like(target, dtype=torch.bool)
series_ids = torch.zeros(1, 1, dtype=torch.long, device=device)

# Returns quantiles of shape (9, batch, n_variates, horizon)
# Quantile levels: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quantiles = model.forecast(
    {"target": target, "target_mask": target_mask, "series_ids": series_ids},
    horizon=96,
    decode_block_size=768,
    has_missing_values=False,
)

For more examples, see the Quick Start notebook and GluonTS integration notebook.

πŸ’Ύ Available Checkpoints

All five Toto 2.0 sizes share the same training recipe; pick a size based on your accuracy/latency budget. Latency is forward-pass time for a 1,024-step single-pass forecast at batch size 8 on a single A100.

Model Params Weights (fp32) Latency Recommended for
Toto‑2.0‑4m 4m 16 MB ~3.8 ms Edge / CPU deployment; tightest latency or memory budgets.
Toto‑2.0‑22m 22m 84 MB ~5.0 ms Efficient default β€” matches or beats Toto 1.0 quality with ~7Γ— fewer parameters.
Toto‑2.0‑313m 313m 1.2 GB ~15.4 ms Strong general-purpose checkpoint; top-3 foundation model on GIFT-Eval.
Toto‑2.0‑1B 1B 3.9 GB ~20.9 ms Best quality / cost tradeoff for production workloads.
Toto‑2.0‑2.5B 2.5B 9.1 GB ~36.2 ms Highest accuracy; #1 foundation model on every benchmark.

✨ Key Features

  • Zero-Shot Forecasting: Forecast without fine-tuning on your specific time series.
  • Multi-Variate Support: Efficiently process multiple variables using alternating time/variate attention.
  • Probabilistic Predictions: Generate point forecasts and uncertainty estimates via a quantile output head.
  • Decoder-Only Architecture: Support for variable prediction horizons and context lengths.
  • u-ΞΌP Scaling: A single training recipe transfers cleanly across all five sizes (4m β†’ 2.5B).

πŸ—οΈ Architecture

Overview of the Toto 2.0 architecture.
A decoder-only patched transformer whose attention layers alternate between time-axis (causal) and variate-axis (full) views of the input. Toto 2.0 adds contiguous patch masking (CPM) for single-pass parallel decoding, a quantile output head trained with pinball loss, a robust arcsinh input scaler, residual MLP patch projections, and is trained with NorMuon. See the technical report for details.

πŸ”— Additional Resources

πŸ“– Citation

(citation coming soon)
Downloads last month
1,042
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Datadog/Toto-2.0-2.5B

Finetunes
1 model

Collection including Datadog/Toto-2.0-2.5B

Paper for Datadog/Toto-2.0-2.5B