Papers
arxiv:2603.26017

QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

Published on Mar 27
· Submitted by
Hang Yu
on Apr 2
Authors:
,
,
,
,
,
,
,
,
,

Abstract

QuitoBench addresses the lack of large-scale time series benchmarks by introducing a regime-balanced dataset with eight TSF regimes, revealing that foundation models outperform deep learning at long contexts while scaling data provides greater benefits than scaling model size.

AI-generated summary

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce QuitoBench, a regime-balanced benchmark for time series forecasting with coverage across eight trendtimesseasonalitytimesforecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon Quito, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context (L=96) but foundation models dominate at long context (L ge 576); (ii) forecastability is the dominant difficulty driver, producing a 3.64 times MAE gap across regimes; (iii) deep learning models match or surpass foundation models at 59 times fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.

Community

Paper submitter

A large-scale high-quality time series dataset with interesting findings!

the regime-balanced sampling in QuitoBench is a clever way to push models toward handling intrinsic data properties rather than relying on domain labels. one big question i have is how stable those eight TSF regimes are when you encounter nonstationary shifts or when trend/seasonality drift over time. an ablation that reweights or re-samples by real-world frequencies or tests a rolling regime assignment could reveal whether the relative strengths of foundation vs deep models persist under drift. the arxivlens breakdown helped me parse the method details; it’s a nice companion to the paper and clarifies the regime construction and the dense rolling-window evaluation. it would be helpful to see a targeted ablation isolating regime coverage vs just adding more data to confirm where the gains really come from.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.26017
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.26017 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.26017 in a Space README.md to link it from this page.

Collections including this paper 1