MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources Paper β’ 2509.25531 β’ Published Sep 29, 2025 β’ 8
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper β’ 2510.08697 β’ Published Oct 9, 2025 β’ 36
Bridging the Data Provenance Gap Across Text, Speech and Video Paper β’ 2412.17847 β’ Published Dec 19, 2024 β’ 10
Consent in Crisis: The Rapid Decline of the AI Data Commons Paper β’ 2407.14933 β’ Published Jul 20, 2024 β’ 14
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Paper β’ 2406.15877 β’ Published Jun 22, 2024 β’ 48
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order Paper β’ 2404.00399 β’ Published Mar 30, 2024 β’ 42
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning Paper β’ 2402.06619 β’ Published Feb 9, 2024 β’ 56
BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing Paper β’ 2206.15076 β’ Published Jun 30, 2022 β’ 5
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper β’ 2211.05100 β’ Published Nov 9, 2022 β’ 35
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper β’ 2303.03915 β’ Published Mar 7, 2023 β’ 7