🥯 BAGEL-World-model

A agentic data-centric framework for producing large-scale interleaved Visual Question–Visual Answering (VQ-VA) data.


The BAGEL-World framework outputs high-quality VQ-VA data via the following steps:

🔄Preprocessing

Filters and classify noisy web-interleaved data into design- and knowledge-related documents.

🤖Agentic Pipeline for VQ-VA Data Creation

1. Retriever selects image pairs containing non-trivial transformations from interleaved documents that can serve as the basis for free-form questions.

2. Instruction Generator write a natural-language question about one image so that the other image serves as the correct answer.

3. Filterer removes low-quality triplets ⟨Question Image, Question Text, Answer Image⟩.

4. Rewriter increases instruction diversity by producing multiple variants of the original questions.

5. Reasoner generates a language-based chain-of-thought explanation describing how the source image should be transformed to obtain the target image.

The framework at last outputs interleaved quadruplets:

  • 🧠 Question Image
  • 💬 Visual Question / Instruction
  • 🔍 Reasoning Trace
  • 🎨 Answer Image

Stay tuned for updates and examples!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support