The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context
Abstract
StateLM enables language models to actively manage their own memory and context through internal reasoning loops and memory tools, significantly improving performance on long-document tasks and chat memory challenges.
In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.
Community
It’s time to evolve from context engineering to Model-as-Context-Engineer.
By equipping LLMs with the intrinsic free() operation, effectively handing them the wand to master their own memory, we take a decisive step closer to AGI. This work presents the first agent architecture that generalizes remarkably well across Long-Document QA, Multi-Turn Dialogue, and Deep Search, proving that sustainable intelligence isn't just about remembering everything, but accurate forgetting.
Great work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Structured Reasoning for Large Language Models (2026)
- Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers (2026)
- LLMs Can't Play Hangman: On the Necessity of a Private Working Memory for Language Agents (2026)
- Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards (2026)
- AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts (2026)
- MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models (2026)
- UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Masterpiece!
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper