This dataset is designed to post-train Metareasoning agents, or those agents whose job it is to quickly (and importantly, cheaply) reason through whether it makes sense to launch a full reasoning job or simply use a simple completions job.
Generation notebook (linked in dataset) is open source and pretty well generalized if I don't say so myself, so you can use it to make your own Metareasoning datasets.
Shoutout to @onekq for his inspiring comment on this topic.
Meta Researchers: How many compute hours should we use to train Llama 3.1? Mr. Zuck: Yes! π€πͺ
Good folks at @AIatMeta did not just release the models but also published a 92-page detailed paper π on their findings and technical aspects of the models and their training process!
Generally, we just gobble up these weights and forget the compute infrastructure used to train these models. π₯οΈπ
Here are some interesting findings about the computing infrastructure of Llamas:
- Llama 1 and 2 models were trained on @Meta 's AI Research SuperCluster. Llama 3 was migrated to Metaβs production clusters! π
- That's 16,000 H100 GPUs, with each GPU featuring 700W TDP and 80GB HBM3, arranged in Metaβs Grand Teton AI server platform. π₯οΈπ
- What about storing checkpoints? Used Tectonic, a distributed file system, for storage, with capacities reaching 240 PB and peak throughput of 7 TB/s. πΎπ
- Meta's mad lads saved each GPUβs model state, ranging from 1 MB to 4 GB per GPU, for recovery and debugging. π οΈπ
If this sounds big, well, they document the humungous challenges that come with it:
- In the 54-day training period, there were 466 job interruptions. ππ
- About 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. Mostly GPUs! π₯π₯οΈ
- Saving all checkpoints is cool until you do it for the 300B+ parameters model. The bursty nature of checkpoint writes, essential for state-saving during training, periodically saturated the storage fabric, impacting performance. ππΎ
- With all this, effective training timeβmeasured as the time spent on useful training over the elapsed timeβwas higher than 90%. β±οΈπ
I think this is the stuff that movies can be made on! π¬π