YAML Metadata Warning:The pipeline tag "text-to-motion" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
KV-Control (T-Concat v4 backbone)
Sparse-keyframe, multi-joint controllable text-to-motion generation. The repository at github.com/CHDTevior/KV-Control contains the full training and inference code.
What is here
| Path | Content | Size |
|---|---|---|
base_t_concat_v4/model/net_best_fid.tar |
Pre-trained T-Concat v4 masked-transformer base (the paper main backbone, Ep 400) | 168 MB |
kv_control/model/net_best_top3.tar |
Cross multi-joint KV-Control adapter β paper Tab 4 multi-joint block (net_best_top3 @ Ep 6000, control=cross) |
520 MB |
kv_control_trajectory/model/net_best_kps.tar |
Single-joint pelvis KV-Control adapter β paper Tab 4 headline row (net_best_kps @ Ep 6000, control=trajectory) |
520 MB |
vqvae/net_best_fid.pth |
Part-aware VQ-VAE tokenizer (128 codes Γ 6 parts) | 236 MB |
vqvae/skeleton_partition.json |
Skeleton partition for the part-aware VQ | 1 KB |
stats/{mean,std}.npy |
Normalization stats matching the released VQ | 4 KB |
clip/ViT-B-32.pt |
OpenAI CLIP ViT-B/32 visual + text encoder | 336 MB |
t2m/Comp_v6_KLD005/opt.txt + meta/ |
Frozen evaluation encoder config & stats | 3 KB |
t2m/text_mot_match/model/finest.tar |
Pre-trained text-motion eval encoder (Guo et al., 2022) | 235 MB |
t2m/length_estimator/model/finest.tar |
Pre-trained motion-length predictor | 1.7 MB |
aux/body_models/ |
SMPL neutral mesh + face / J_regressor (SMPL license) | 234 MB |
aux/glove/ |
Vocab files for the length estimator | 10 MB |
How to use
git clone https://github.com/CHDTevior/KV-Control.git
cd KV-Control
bash scripts/download_checkpoints.sh # populates checkpoints/, aux/ β glove/, body_models/
Refer to the GitHub README for installation and quick-start commands.
Checkpoint provenance & expected metrics
Both released KV-Control adapters are evaluated with the paper M3 hybrid
protocol on the HumanML3D test split (Stage-1 dynamic TTT each_iter=35 --ttt_dynamic T=10; Stage-2 600-step embedding opt; cfg=3.25,
--cond_drop_prob 0.0 --pred_num_batch 16 --seed 3407):
| Checkpoint | --control |
Paper row | Expected (5r mean) |
|---|---|---|---|
kv_control/model/net_best_top3.tar |
cross |
Tab 4 multi-joint | KPS β 0.80 cm (best 0.71) |
kv_control_trajectory/model/net_best_kps.tar |
trajectory |
Tab 4 headline | KPS β 0.40 cm, FID β 0.065, Top-3 β 0.799 |
The single-joint pelvis row is the paper headline; the cross checkpoint is the
multi-joint result. They come from two separate fine-tuning runs (pelvis vs
cross), both on the same frozen base_t_concat_v4 backbone. See the GitHub
README Β§3 for the exact reproduction commands. scripts/sanity_check_equivalence.py
regenerates one designed trajectory and reports KPS (β 1.7 cm on that
hand-crafted 6-joint sample); it is an install smoke test, not a benchmark
or an external-reference diff.
Licenses
- Our weights (
base_t_concat_v4,kv_control,vqvae,stats) β MIT. - CLIP ViT-B/32 β released by OpenAI under MIT.
- SMPL body model under
aux/body_models/β original SMPL license (research-only). - Text-motion eval encoder / length estimator under
t2m/β re-distributed from the HumanML3D / Guo et al. 2022 release for reproducibility.
Citation
@article{kvcontrol2026,
title = {KV-Control: Sparse-Keyframe Multi-Joint Text-to-Motion Generation},
author = {... (under review) ...},
year = {2026},
}