---
tags:
- forecasting
- time-series
- zero-shot
- foundation-models
datasets:
- thuml/UTSD
- autogluon/chronos_datasets
- Salesforce/GiftEvalPretrain
---
Code : https://github.com/vilhess/PatchFM
# A tutorial on how to build a Foundation Model for Univariate Time Series Forecasting
A concise, reproducible recipe for training a transformer-based, patch-to-patch forecasting model for univariate time series. The approach mirrors Large Language Model (LLM) practices (next-token → next-patch) while remaining lightweight compared to a classic LLM and practical.
## Highlights
- Next-patch prediction objective (autoregressive, causal)
- Patch-based representation of time series (tokens ↔ patches)
- Causal masking self-attention with RoPE (relative positions)
- RevIN (Reversible Instance Normalization)
- SwiGLU feed-forward networks
- Autoregressive multi-quantile decoding [MOIRAI2.0](https://arxiv.org/pdf/2511.11698)
- KV-cache for efficient long-horizon inference
- flips equivariance during inference (optional) [Reverso](https://arxiv.org/pdf/2602.17634v1)
## Quick Start
### from source code
1. Clone the repository and install dependencies
```bash
git clone https://github.com/vilhess/PatchFM
cd PatchFM
pip install -r requirements.txt
```
2. Run inference with a pretrained model from Huggingface Hub
```python
import torch
from configs import PatchFMConfig
from model import Forecaster
# --- Instantiate model ---
config = PatchFMConfig(load_from_hub=True)
model = Forecaster(config)
# --- Inference ---
forecast_horizon = 64
seq = torch.randn(1, 1024) # (batch, time)
pred_median, pred_quantiles = model(seq, forecast_horizon=forecast_horizon, quantiles=[0.1, 0.5, 0.9], flip_equivariance=True) # (batch, time), (batch, time, quantiles)
```
### from pip package
1. Install the package from PyPI
```bash
pip install patchfm
```
2. Run inference with a pretrained model from Huggingface Hub
```python
import torch
from patchfm import Forecaster, PatchFMConfig
# same as above
pred_median, pred_quantiles = model(seq, forecast_horizon=forecast_horizon, quantiles=[0.1, 0.5, 0.9], flip_equivariance=True) # (batch, time), (batch, time, quantiles)
```
We provide an extended quick start example in [notebooks/tutorial.ipynb](./notebooks/tutorial.ipynb).
If you dont have suitable hardware you can run the the extended quick start example example also in Google Colab:
## Method (TL;DR)
- Patching: Split a context signal of length $w$ into $P_{num} = w / P_{len}$ patches of length $P_{len}$.
- Causal RevIN: Normalize input signal and denormalize outputs to the original scale without statistics leakage.
- Architecture: Input residual MLP → stacked Transformer blocks (MHA + SwiGLU FFN, pre-norm, residual) → $|\mathcal{Q}|$ output heads mapping back to patch space.
- Positional encoding: Rotary Position Embeddings (RoPE) applied to queries/keys.
- Training: Multi-quantile (pinball) loss across positions, elements, and quantiles $\mathcal{Q}$.
- Inference: Predict next patch; roll out autoregressively for long horizons.
- KV-cache: during inference, cache keys/values to avoid redundant computations.
- Flip-equivariance: during inference, flip input sequence and average predictions to improve robustness (at cost of doubling batch size).
## Problem Formulation
Given context patches $x_{p_1}, \ldots, x_{p_n}$, predict the next patch $x_{p_{i+1}}$ for each position $i$ using only past patches (causality). The model outputs quantiles $\{\hat{x}_{p_{i+1}}^{(q)}: q \in \mathcal{Q}\}$ with median (q=0.5) as the point forecast.
## Loss: Multi-Quantile (Pinball)
For residual $u = x - \hat{x}^{(q)}$:
$$\rho_q(u) = \begin{cases} q\,u, & u \ge 0,\\ (q-1)\,u, & u < 0. \end{cases}$$
Aggregate over positions, patch elements, and quantiles.
## Architecture
- Input MLP: $\mathbb{R}^{P_{len}} \to \mathbb{R}^{dim}$ residual 2-layer MLP (ReLU)
- Multi-Head Attention: causal mask, RoPE; queries/keys/values per head
- FFN: SwiGLU (SiLU-gated), pre-norm + residual
- Output heads: |Q| linear maps $\mathbb{R}^{dim} \to \mathbb{R}^{P_{len}}$ (one per quantile)
### Model Details
- Patch size: 32
- Max context: 32 patches (1024 steps)
- Forecast horizon: 32 steps per forward pass
- Quantiles $\mathcal{Q}$: {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}
- Layers: 6
- Attention heads: 64 (head dim 32)
- Model dim: 2048
- Parameters: ~300M
## Inference
- Single step: predict next patch ($P_{len}$ values)
- Long-horizon: append prediction to context and repeat (optionally drop oldest patch to keep window fixed)
- Flip-equivariance [Reverso](https://arxiv.org/pdf/2602.17634v1): optionally flip input sequence and average predictions to improve robustness (at cost of doubling batch size):
$$y = \frac{1}{2} \left( f(x) - f(-x) \right)$$
- ### Autoregressive Inference with Quantile Forecasting ([Moirai 2.0](https://arxiv.org/pdf/2511.11698v1))
During autoregressive inference, the model generates forecasted values patch by patch. At each time step, the predicted patch is fed back into the model as input for the next step. This iterative process continues until the desired forecast horizon is reached.
When performing quantile forecasting, the situation becomes more complex. Instead of producing a single patch per step, the model outputs multiple patches corresponding to different quantiles (e.g., 0.1, 0.5, 0.9). Since the model expects a single patch for the next time step, it is not straightforward to feed all quantile predictions back into the model simultaneously.
A common workaround is to feed only the median prediction (the 0.5 quantile) back into the model at each step. While this approach preserves the autoregressive structure, it discards the uncertainty information captured by the other quantiles.
An alternative approach is **autoregressive multi-quantile decoding**, as proposed in [Moirai 2.0](https://arxiv.org/pdf/2511.11698v1). This method enables consistent autoregressive generation while preserving the full predictive distribution across quantiles. However, it is computationally more expensive than the median-only approach as it requires duplicating the context for each quantile.
Classic Autoregressive Inference
Autoregressive Multi-Quantile Decoding