PonderLM-2-Pythia-1.4b

Pythia-1.4b architecture pretrained with PonderLM-2, the method introduced in PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space (ICML 2026 Spotlight).

TL;DR. Chain-of-Thought scales test-time compute by generating extra tokens. PonderLM-2 does the same at pretraining time, but in continuous space: before predicting each next token the model first emits a few latent thoughts — extra last-hidden-state vectors — and feeds them back into itself. Result: this 1.4B model trained on 300 B Pile tokens beats vanilla Pythia-2.8B at equal inference flops, on language modelling and a range of downstream tasks.

   vanilla:      x₁ ──► x₂ ──► x₃ ──► x₄

   PonderLM-2:   x₁ ──► z₁ ──► x₂ ──► z₂ ──► x₃ ──► z₃ ──► x₄ ──► z₄
                       z_i = latent thought emitted before predicting x_{i+1}

Usage

The model ships with a custom modeling_gpt_neox.py that runs the pondering forward pass. Loading via AutoModelForCausalLM requires trust_remote_code=True:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

ckpt = "zeng123/PonderLM-2-Pythia-1.4b"
tok = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForCausalLM.from_pretrained(
    ckpt,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).cuda()

prompt = "The mitochondria is "
out = model.generate(
    **tok(prompt, return_tensors="pt").to(model.device),
    max_new_tokens=64,
    use_cache=True,
)
print(tok.decode(out[0], skip_special_tokens=True))

Model details

Architecture GPT-NeoX (Pythia family)
Parameters 1.4 B
Hidden size 2048
Layers 24
Attention heads 16
Context length 2048
Vocabulary 50 304
Tokenizer GPT-NeoX BPE (same as Pythia)
Precision BF16

Citation

@article{zeng2025ponderlm,
  title={Ponderlm-2: Pretraining llm with latent thoughts in continuous space},
  author={Zeng, Boyi and Li, He and Song, Shixiang and Wang, Yixuan and Wang, Zitong and He, Ziwei and Wang, Xinbing and Lin, Zhouhan},
  journal={arXiv preprint arXiv:2509.23184},
  year={2025}
}

Acknowledgements

Built on top of the Pythia training stack and LLaMA-Factory. The PonderLM baseline implementation is adapted from LUMIA-Group/PonderingLM.

Downloads last month
38
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zeng123/PonderLM-2-Pythia-1.4b

Quantizations
1 model

Dataset used to train zeng123/PonderLM-2-Pythia-1.4b

Paper for zeng123/PonderLM-2-Pythia-1.4b