Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
Abstract
A novel text-motion retrieval approach uses joint-angle-based motion representation and Vision Transformer-compatible pseudo-images to achieve superior performance with interpretable fine-grained alignments.
Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
Community
This paper introduces a novel approach to text-motion retrieval by addressing the limitations of traditional global-embedding paradigms. Most existing methods compress 3D human motion sequences and textual descriptions into single global vectors, which is computationally efficient but inevitably discards fine-grained local correspondences. To overcome this information bottleneck, the authors propose replacing global embeddings with a structurally grounded, fine-grained late interaction mechanism.
The core innovation relies on explicitly decoupling local joint movements from the body's global trajectory using a joint-angle-based motion representation. These joint-level features are mapped into a structured 224x224 pseudo-image, making them compatible with pre-trained Vision Transformers. For the retrieval process, the framework utilizes MaxSim, a token-wise late interaction operator that matches text tokens to motion patches, which is further stabilized by a Masked Language Modeling regularization on the text encoder.
Extensive experiments conducted on the HumanML3D and KIT-ML datasets demonstrate that this approach consistently outperforms state-of-the-art text-motion retrieval baselines. Furthermore, the token-to-patch matching produces highly interpretable correspondence maps that transparently align textual semantics with specific body joints and temporal phases. This level of transparency and granularity provides a robust foundation for downstream applications like language-driven motion generation and localized editing.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning (2026)
- MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment (2026)
- Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval (2026)
- Language-Guided Transformer Tokenizer for Human Motion Generation (2026)
- Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval (2026)
- Temporal consistency-aware text-to-motion generation (2026)
- ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper