arxiv:2606.14667

Memento: Reconstruct to Remember for Consistent Long Video Generation

Published on Jun 12

· Submitted by

Longbin Ji on Jun 16

BAIDU

Upvote

Authors:

Shuohuan Wang ,

Abstract

Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

robingg1

Paper submitter about 3 hours ago

•

edited about 3 hours ago

What if long-video subject drift is not caused by insufficient memory, but by the fact that the model is never required to truly remember?

Memento introduces a simple principle: if a memory bank really preserves a subject, the model should be able to reconstruct that subject from memory alone. By turning subject preservation into an explicit and verifiable objective, Memento achieves state-of-the-art long-term consistency across shots and scenes.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.14667

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14667 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14667 in a Space README.md to link it from this page.