How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("xiaomoguhzz/VisionEncoder", dtype="auto")
Quick Links

VisionEncoder

Hosted artifacts (derived data + trained checkpoints) for the VisionEncoder research project.

Training code + full reproduction guide: https://github.com/xiaomoguhz/VisionEncoder

The repo is organized into three top-level folders.

data/ β€” current (V9.x) reproduction data (~6.5G)

Path Content
data/vmllm_cached/qwen3vit/ S2 cached_dataset arrow (image/video, 10pct + full); fed directly to stage-2
data/ms-swift-data/ sampled sharegpt jsonl (10pct + full)
data/llava_video/ V9 decode-probed good_manifest for the video path

ckpts/ β€” ready-made 4B MLLM inference weights

Path Content
ckpts/4b_stock 4B stock baseline (raw Qwen3.5 ViT, skips declip), checkpoint-505, 9.5G
ckpts/4b_v9_1 4B V9.1 (V-JEPA 2.1 video self-distill), checkpoint-505, 9.5G

Download either and feed it straight to evaluation (see the GitHub README, section 4 β€” MLLM evaluation) to skip declip + S1 + S2.

legacy/ β€” historical assets (~368G)

Early-line products, not needed to reproduce the current main line: declip_siglip2/spatial_align, kd_mllm, self_refine, video_mllm_swift (old SigLIP2 / image-only S1+S2 ckpts), and old ViT-family arrow caches.

Download

# current dev data
huggingface-cli download xiaomoguhzz/VisionEncoder --include "data/*" --local-dir .
# ready-made 4B MLLM ckpt (eval directly)
huggingface-cli download xiaomoguhzz/VisionEncoder --include "ckpts/4b_v9_1/*" --local-dir .

Related

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support