Instructions to use xiaomoguhzz/VisionEncoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use xiaomoguhzz/VisionEncoder with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("xiaomoguhzz/VisionEncoder", dtype="auto") - Notebooks
- Google Colab
- Kaggle
VisionEncoder
Hosted artifacts (derived data + trained checkpoints) for the VisionEncoder research project.
Training code + full reproduction guide: https://github.com/xiaomoguhz/VisionEncoder
The repo is organized into three top-level folders.
data/ β current (V9.x) reproduction data (~6.5G)
| Path | Content |
|---|---|
data/vmllm_cached/qwen3vit/ |
S2 cached_dataset arrow (image/video, 10pct + full); fed directly to stage-2 |
data/ms-swift-data/ |
sampled sharegpt jsonl (10pct + full) |
data/llava_video/ |
V9 decode-probed good_manifest for the video path |
ckpts/ β ready-made 4B MLLM inference weights
| Path | Content |
|---|---|
ckpts/4b_stock |
4B stock baseline (raw Qwen3.5 ViT, skips declip), checkpoint-505, 9.5G |
ckpts/4b_v9_1 |
4B V9.1 (V-JEPA 2.1 video self-distill), checkpoint-505, 9.5G |
Download either and feed it straight to evaluation (see the GitHub README, section 4 β MLLM evaluation) to skip declip + S1 + S2.
legacy/ β historical assets (~368G)
Early-line products, not needed to reproduce the current main line: declip_siglip2/spatial_align, kd_mllm, self_refine, video_mllm_swift (old SigLIP2 / image-only S1+S2 ckpts), and old ViT-family arrow caches.
Download
# current dev data
huggingface-cli download xiaomoguhzz/VisionEncoder --include "data/*" --local-dir .
# ready-made 4B MLLM ckpt (eval directly)
huggingface-cli download xiaomoguhzz/VisionEncoder --include "ckpts/4b_v9_1/*" --local-dir .
Related
- Code + reproduction guide: https://github.com/xiaomoguhz/VisionEncoder
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("xiaomoguhzz/VisionEncoder", dtype="auto")