🥯 BAGEL-World-model
A agentic data-centric framework for producing large-scale interleaved Visual Question–Visual Answering (VQ-VA) data.
The BAGEL-World framework outputs high-quality VQ-VA data via the following steps:
🔄Preprocessing
Filters and classify noisy web-interleaved data into design- and knowledge-related documents.
🤖Agentic Pipeline for VQ-VA Data Creation
1. Retriever selects image pairs containing non-trivial transformations from interleaved documents that can serve as the basis for free-form questions.
2. Instruction Generator write a natural-language question about one image so that the other image serves as the correct answer.
3. Filterer removes low-quality triplets ⟨Question Image, Question Text, Answer Image⟩.
4. Rewriter increases instruction diversity by producing multiple variants of the original questions.
5. Reasoner generates a language-based chain-of-thought explanation describing how the source image should be transformed to obtain the target image.
The framework at last outputs interleaved quadruplets:
- 🧠 Question Image
- 💬 Visual Question / Instruction
- 🔍 Reasoning Trace
- 🎨 Answer Image
Stay tuned for updates and examples!