Spaces:
Running
Running
| from __future__ import annotations | |
| TITLE = """<h1 align="center" id="space-title">TabArena Leaderboard for Predictive Machine Learning on IID Tabular Data</h1>""" | |
| INTRODUCTION_TEXT = """ | |
| TabArena is a living benchmark system for predictive machine learning on tabular data. | |
| The goal of TabArena and its leaderboard is to assess the peak performance of | |
| model-specific pipelines. | |
| Expand the boxes below to learn more about the datasets, models, metrics, and reference pieplines. | |
| You can find more details and links to additional resources in the `About` section below. | |
| """ | |
| OVERVIEW_DATASETS = """ | |
| The leaderboard is based on a manually curated collection of | |
| 51 tabular classification and regression datasets for independent and identically distributed | |
| (IID) data, spanning the small to medium data regime. The datasets were carefully | |
| curated to represent various real-world predictive machine learning use cases. | |
| **Subsets:** We present results for various subsets of the datasets based on tasks and dataset size. Select your | |
| subset of interest from the tabs above the leaderboard. | |
| """ | |
| OVERVIEW_MODELS = """ | |
| The focus of the leaderboard is on model-specific pipelines. Each pipeline | |
| is evaluated with default and tuned hyperparameter configuration or as an ensemble of | |
| tuned configurations. Each model is implemented in a tested real-world pipeline that was | |
| optimized to get the most out of the model by the maintainers of TabArena, and where | |
| possible together with the authors of the model. | |
| **Verified Models:** Some models were contributed and evaluated, but their implementations have | |
| not been verified by the original authors or maintainers of the model. We indicate whether the implementation | |
| of a model is verified in an extra column in the leaderboard. Results for unverified and recent models should | |
| be interpreted with more caution. We count an older, stable model (such as XGBoost, LightGBM, | |
| CatBoost, Random Forests, or baselines) as verified as long as the its implementation is verified by the | |
| the maintainers of TabArena. | |
| """ | |
| OVERVIEW_METRICS = """ | |
| The leaderboards are ranked based on Elo. We present several additional | |
| metrics. See `About` for more information on the metrics. | |
| **Imputation:** We also present results with imputation. The `Imputed` tab presents all results where we impute the | |
| performance for models that cannot run on all datasets due to task or dataset size constraints. In general, imputation | |
| negatively represents the model performance, punishing the model for not being able to run on all datasets. | |
| **Repeats:** We also present results for TabArena-Lite, where we only repeat the experiments once instead of multiple | |
| times per dataset. By selecting the `Lite` tab, you can see results for TabArena-Lite. Results for TabArena-Lite are | |
| less reliable than for `All Repeats` but often present a good proxy for the overall performance while being much | |
| cheaper to compute. | |
| """ | |
| OVERVIEW_REF_PIPE = """ | |
| The leaderboard includes reference pipelines, which are applied | |
| independently of the tuning protocol and constraints we constructed for models within TabArena. | |
| A reference pipeline aims to represent the performance quickly achievable by a | |
| practitioner on a dataset. The current reference pipeline is the predictive machine | |
| learning system AutoGluon. AutoGluon represents an ensemble pipeline across various model | |
| types and thus provides a reference for model-specific pipelines. | |
| """ | |
| ABOUT_TEXT = r""" | |
| ### Extended Overview of TabArena (References / Papers) | |
| We introduce TabArena and provide an overview of TabArena-v0.1.1 in our paper: https://tabarena.ai/paper-tabular-ml-iid-study. | |
| Moreover, you can find a presentation of TabArena-v0.1.1 here: https://www.youtube.com/watch?v=mcPRMcJHW2Y | |
| ### Using TabArena for Benchmarking | |
| To compare your own methods to the pre-computed results for all models on the leaderboard, | |
| you can use the TabArena framework. For examples on how to use TabArena for benchmarking, | |
| please see https://tabarena.ai/code-examples | |
| ### Contributing to the Leaderboard; Contributing Models | |
| For guidelines on how to contribute your model to TabArena, or the result of your model | |
| to the official leaderboard, please see the appendix of our paper: https://tabarena.ai/paper-tabular-ml-iid-study. | |
| ### Contributing Data | |
| For anything related to the datasets used in TabArena, please see https://tabarena.ai/data-tabular-ml-iid-study | |
| --- | |
| ### Leaderboard Documentation | |
| The leaderboard is ranked by Elo and includes several other metrics. Here is a short | |
| description for these metrics: | |
| #### Elo | |
| We evaluate models using the Elo rating system. Elo is a | |
| pairwise comparison-based rating system where each model's rating predicts its expected | |
| win probability against others, with a 400-point Elo gap corresponding to a 10 to 1 | |
| (91\%) expected win rate. We calibrate 1000 Elo to the performance of our default | |
| random forest configuration across all figures, and perform bootstrapping | |
| to obtain 95\% confidence intervals. Elo scores are computed using ROC AUC for binary | |
| classification, log-loss for multiclass classification, and RMSE for regression. | |
| #### Improvability | |
| We introduce improvability as a metric that measures how many percent lower the error | |
| of the best method is than the current method on a dataset. This is then averaged over | |
| datasets. Formally, for a single dataset improvability is (err_i - besterr_i)/err_i * 100\%. | |
| Improvability is always between 0\% and 100\%. | |
| #### Score | |
| Following TabRepo, we compute a normalized score to provide an additional relative | |
| comparison. We linearly rescale the error such that the best method has a normalized | |
| score of one, and the median method has a normalized score of 0. Scores below zero | |
| are clipped to zero. These scores are then averaged across datasets. | |
| #### Average Rank | |
| Ranks of methods are computed on each dataset (lower is better) and averaged. | |
| #### Harmonic Rank | |
| We compute the harmonic mean of ranks across datasets. The harmonic mean of ranks, | |
| 1/((1/N) * sum(1/rank_i for i in range(N))), more strongly favors methods having very | |
| low ranks on some datasets. It therefore favors methods that are sometimes very good | |
| and sometimes very bad over methods that are always mediocre, as the former are more | |
| likely to be useful in conjunction with other methods. | |
| --- | |
| ### Contact | |
| For most inquires, please open issues in the relevant GitHub repository or here on | |
| HuggingFace. | |
| For any other inquiries related to TabArena, please reach out to: mail@tabarena.ai | |
| ### Core Maintainers | |
| The current core maintainers of TabArena are: | |
| [Nick Erickson](https://github.com/Innixma), | |
| [Lennart Purucker](https://github.com/LennartPurucker/), | |
| [Andrej Tschalzev](https://github.com/atschalz), | |
| [David Holzmüller](https://github.com/dholzmueller) | |
| """ | |
| CITATION_BUTTON_LABEL = ( | |
| "If you use TabArena or the leaderboard in your research please cite the following:" | |
| ) | |
| CITATION_BUTTON_TEXT = r"""@inproceedings{erickson2025tabarena, | |
| title = {TabArena: A Living Benchmark for Machine Learning on Tabular Data}, | |
| author = {Erickson, Nick and Purucker, Lennart and Tschalzev, Andrej and Holzm{\"u}ller, David and Desai, Prateek Mutalik and Salinas, David and Hutter, Frank}, | |
| booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS)}, | |
| year = {2025}, | |
| url = {https://arxiv.org/abs/2506.16791} | |
| } | |
| """ | |
| VERSION_HISTORY_BUTTON_TEXT = """ | |
| **Current Version: TabArena-v0.1.2.1** | |
| The following details updates to the leaderboard (date format is YYYY/MM/DD): | |
| * 2025/12/18-v0.1.2.1: | |
| * Make tuning trajectories start from the default configuration. | |
| * UI improvements and more user-friendly explanations. | |
| * 2025/11/22-v0.1.2: Add newest version of TabArena LB for NeurIPS 2025 | |
| * New UI and new leaderboard subsets for different dataset sizes, tasks, and imputation + general polish. | |
| * Some metrics have been refactored and made more stable (see GitHub for details). | |
| * Updated Reference Pipeline to include AutoGluon v1.4 with the extreme preset. | |
| * Updated existing models: RealMLP, TabDPT, EBM | |
| * Add new verified models: Mitra, xRFM, RealTabPFN-v2.5 | |
| * Add new unverified models: TabFlex, BetaTabPFN, LimiX | |
| * 2025/06/13-v0.1.1: Add data for all subsets and re-runs on GPU; Add leaderboards for subsets; | |
| new overview; add Figures to LBs. | |
| * 2025/05-v0.1.0: Initialization of the TabArena-v0.1 leaderboard. | |
| Old Leaderboards (with major changes) can be found at: | |
| * Tabarena-v0.1 and TabArena-v0.1.1: https://huggingface.co/spaces/TabArena-Legacy/TabArena-v0.1.1 | |
| """ | |