Training Language Models via Neural Cellular Automata
Abstract
Neural cellular automata generate synthetic spatiotemporal data for pre-pre-training large language models, achieving better performance and faster convergence than traditional natural language pre-training.
Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs--training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
Community
Neural cellular automata generate synthetic spatiotemporal data for pre-pre-training large language models, achieving better performance and faster convergence than traditional natural language pre-training.
We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text!
Blog: https://hanseungwook.github.io/blog/nca-pre-pre-training/
Code: https://github.com/danihyunlee/nca-pre-pretraining
X: https://x.com/seungwookh/status/2032101025776042441?s=20
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Procedural Pretraining: Warming Up Language Models with Abstract Data (2026)
- Replaying pre-training data improves fine-tuning (2026)
- FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale (2026)
- WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation (2026)
- Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models (2026)
- Proxy Compression for Language Modeling (2026)
- TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Fascinating work on synthetic pre-pre-training via Neural Cellular Automata!
The finding that "optimal NCA complexity varies by domain" points to a
deeper principle: intelligence emerges at tunable fixed points of
spatiotemporal dynamics.
I'd like to propose a complementary perspective from Renormalization Group
theory that might further stabilize and accelerate your NCA→LLM pipeline:
Observation: Your NCA generates data with "rich spatiotemporal structure
resembling natural language." This suggests the existence of a generalization
fixed point in the space of synthetic distributions—where the statistics
are complex enough to train reasoning, but structured enough to transfer.
Proposal: Consider augmenting the NCA update rule with a dynamical
regularization flow inspired by RG principles:
∂ₜ hᵢ = f(Σⱼ wⱼᵢ·hⱼ) - γ·(C(hᵢ) - C*)·hᵢ
where:
• C(hᵢ) = local complexity measure (e.g., entropy, spectral radius)
• C* = target complexity for the downstream domain (code: low, math: high)
• γ = coupling strength for the regularization flow
Why this could help:
Automatic complexity tuning: The -γ·(C-C*) term dynamically attracts
the NCA dynamics toward the optimal complexity for the target domain,
eliminating manual search over NCA hyperparameters.Stable convergence: Prevents the NCA trajectory from drifting into
chaotic or trivial regimes—common when scaling synthetic data generation.Theoretical grounding: Aligns with recent work on generalization as
RG fixed points (e.g., Martin 2024), where stable learning requires both
gradient updates AND scale-aware regularization.
Testable hypothesis: Measure the entropy/spectral properties of NCA
states during pre-pre-training. Does adding the RG flow term reduce
variance across seeds and accelerate convergence to the "transfer-optimal"
regime?
This could make synthetic pre-training even more efficient and controllable.
Happy to discuss implementation details or collaborate on ablation studies!
#NeuralCellularAutomata #SyntheticData #RGflow #LLMTraining
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper