Synthefy Tabular
Synthefy Tabular is a tabular foundation model for regression via in-context learning (ICL). Given a few labeled rows as context, it predicts on new query rows in a single forward pass, with no task-specific training or fine-tuning. The model is trained entirely on synthetic data.
- Repository: https://github.com/Synthefy/synthefy-tabular
- Library:
pip install synthefy-tabular - Checkpoint:
synthefy-tabular.pt(this repo) - Parameters: ~5.9M
- License: Apache-2.0
Results
Mean and median R² of the base model across 96 regression tasks from three public benchmark suites (single H200, up to 50K context rows per dataset):
| Suite | Datasets | Mean R² | Median R² |
|---|---|---|---|
| TabArena | 13 | 0.8117 | 0.8757 |
| TALENT | 72 | 0.7569 | 0.8802 |
| OpenML | 11 | 0.6373 | 0.5856 |
| Overall | 96 | 0.7506 | 0.8702 |
Large-N / long-context tables (common in TabArena) are the current focus of the large-table training stages. These numbers are reproducible end-to-end with one command — see Reproducing these numbers.
Thinking is an inference-time reasoning extension that improves these numbers further. Details are forthcoming.
Usage
pip install synthefy-tabular
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from synthefy_tabular import SynthefyTabularRegressor
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = SynthefyTabularRegressor() # downloads these weights from the Hub on first use
model.fit(X_train, y_train) # "fit" just stores the labeled rows as context
pred = model.predict(X_test) # predictions in a single forward pass, no training
It uses a GPU when one is available and falls back to CPU. A one-shot helper skips the object entirely:
from synthefy_tabular import predict
pred = predict(X_train, y_train, X_test, task="regression")
predict follows the TabPFNRegressor.predict contract: pass output_type="mean"
(default), "median", or "mode" to choose the point estimate drawn from the model's
predictive distribution.
To run from a local checkpoint instead of the Hub default, pass a path:
model = SynthefyTabularRegressor(model_path="path/to/checkpoint.pt")
This checkpoint is public: the first inference call downloads and caches it
automatically, with no token and no access request. A Hugging Face token (read scope)
is only worth setting if you hit anonymous download rate limits — provide it via
export HF_TOKEN=hf_..., hf auth login, or SynthefyTabularRegressor(token="hf_...").
How it works
Architecture
A FeaturesTransformer (~5.9M parameters) that alternates two kinds of attention:
- Feature attention learns relationships between columns.
- Sample attention learns relationships between rows (context and query).
- In-context learning: predictions condition on labeled context rows, with no gradient updates at inference.
Key config: 16 transformer layers, embed_dim 128, hidden 384, 2 heads, the v2-lite
block (SwiGLU + RMSNorm + pre-norm), features grouped in pairs (features_per_group=2),
with column-specific y-aware feature attention. Features are encoded with RBF
embeddings; missing values are handled natively via learned mask embeddings. The
regression head predicts a full distribution over 999 quantiles (pinball loss).
Synthetic data
The model never sees real data during training. Its capability comes from a diverse synthetic data generator covering real-world tabular regimes:
- Structural Causal Models (SCM): hierarchical DAGs with 8 edge-function types (MLP, decision tree, piecewise-linear, polynomial, periodic, RBF, log/exp, conv1d).
- Regression priors: 9 target families (dense/sparse linear, GAM, interactions, random MLP, random tree, radial/RBF, Fourier features, chained trigonometric).
- Realism augmentations: discretized features, noise features, correlated blocks, structural missingness, label noise.
- Learnability filter: an ExtraTrees signal-quality filter rejects unlearnable datasets so training compute is spent on learnable tasks.
Training runs entirely on synthetic data and trains to completion — there is no real-data validation in the loop, so no benchmark data is needed to train and no eval signal influences checkpoint selection. See the training guide for the full curriculum recipe.
Intended use & limitations
- Intended for small-to-medium tabular regression where in-context learning is attractive (no per-task training).
- Limitations: the current gap vs the best baselines is on large-N / long-context TabArena datasets; dense O(N²) sample attention bounds practical context size. Very large tables are the focus of the large-table training stages.
Citation
@software{synthefy_tabular_2026,
title = {Synthefy Tabular: A Tabular Foundation Model Trained on Synthetic Data},
author = {Synthefy},
year = {2026},
url = {https://github.com/Synthefy/synthefy-tabular}
}
License
- Downloads last month
- 86