Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
itsnotsplat 's Collections
Ai/real image classifier
Post-training
Pretraining

Pretraining

updated 17 days ago

This is general pretraining data for training a model from scratch. Around ~5.37 trillion tokens.

Upvote
1

  • ronantakizawa/github-top-code

    Viewer • Updated about 1 month ago • 1.12M • 1.9k • 121

  • HuggingFaceFW/fineweb-edu

    Viewer • Updated Jul 11, 2025 • 3.5B • 291k • 997

  • openbmb/UltraData-Math

    Viewer • Updated Feb 20 • 181M • 48.5k • 264

  • nick007x/github-code-2025

    Viewer • Updated Oct 15, 2025 • 147M • 5.49k • 116

  • angie-chen55/python-github-code

    Viewer • Updated May 31, 2022 • 7.23M • 1.24k • 37

  • tiiuae/falcon-refinedweb

    Viewer • Updated Jun 20, 2023 • 968M • 15.2k • 897

  • nick007x/arxiv-papers

    Viewer • Updated Oct 14, 2025 • 2.55M • 5.41k • 179

  • hoskinson-center/proof-pile

    Viewer • Updated Aug 19, 2023 • 363k • 1.66k • 63
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs