Instructions to use internlm/ETCHR-FLUX.2-klein-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use internlm/ETCHR-FLUX.2-klein-9B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("internlm/ETCHR-FLUX.2-klein-9B", dtype=torch.bfloat16, device_map="cuda") prompt = "Turn this cat into a dog" input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") image = pipe(image=input_image, prompt=prompt).images[0] - Diffusion Single File
How to use internlm/ETCHR-FLUX.2-klein-9B with Diffusion Single File:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
ETCHR-FLUX.2-klein-9B
๐Paper | ๐ Homepage | ๐คETCHR-FLUX.2-klein-9B Model | ๐คETCHR SFT-400K Dataset | ๐คETCHR GRPO-10K Dataset | ๐คDL3DV-2K Benchmark
ETCHR-FLUX.2-klein-9B is a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models. By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.
๐ข News
- ๐ [2026/05/22] We have released the training and evaluation code of ETCHR.
- ๐ [2026/05/21] We have released the ETCHR-FLUX.2-klein-9B Model, ETCHR-SFT-400K Dataset and ETCHR GRPO-10K Dataset.
๐ Overview
We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor built on FLUX.2-klein-base-9B designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs). By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.
๐ก Highlights
- ๐ฅ Decoupled & Plug-and-Play: ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves.
- ๐ฅ Naturally Reflective Pipeline: Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails.
๐ Results
We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy:
๐ ๏ธ Evaluation
Prepare your environment:
git clone https://github.com/InternLM/ETCHR.git
conda create -n ETCHR python==3.11
conda activate ETCHR
cd RL/Pref-GRPO
bash env_setup.sh fastvideo
pip install "vllm>=0.11.0"
pip install qwen-vl-utils==0.0.14
We Provide an example code running ETCHR on DL3DV-2K Benchmark in Evaluation/inference_dl3dv.py, you can start the evaluation with the following two steps:
Step 1: start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...).
cd Evaluation
bash launch_vllm.sh
Step 2: Run ETCHR atop any understanding model
python inference_dl3dv.py
Cases
ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding.
๐ License
Our work is based on FLUX.2-klein-base-9B, so please follow FLUX Non-Commercial License.
โ๏ธCitation
If you find this project useful, please kindly cite:
@article{zhang2026etchr,
title={ETCHR: Editing To Clarify and Harness Reasoning},
author={Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin},
journal={arXiv preprint arXiv:2605.23897},
year={2026}
}
โค๏ธ Acknowledgement
The base model is FLUX.2-klein-base-9B, a powerful image-to-image model.
The work is built upon DiffSynth-Studio and Pref-GRPO, two excellent codebases for Diffusion models training!
- Downloads last month
- 112