ALM2Vec-FT

Paper Project Page GitHub

ALM2Vec is a universal audio embedding model for retrieval, derived from a pretrained large audio–language model (LALM). Instead of being optimized only for audio–caption matching like conventional contrastive dual-encoders, it transfers the audio understanding, instruction-following, and reasoning abilities of LALMs into a single unified embedding space that works across audio domains, task types, and user intents.

Its key feature is instruction-aware retrieval: a natural-language instruction guides the embedding, so the same audio can be encoded differently for different needs. This supports:

  • Instruction-aware retrieval — focus the embedding on a specific aspect of the audio.
  • Text ↔ audio retrieval — bidirectional matching between audio and text.
  • Audio question answering — match an audio query plus a question against candidate answers.

ALM2Vec achieves competitive results on standard audio and speech retrieval benchmarks while adding these controllable retrieval capabilities. See the project page for interactive demos.

This repository hosts the finetune checkpoint, built on MiDashengLM.

Requirements: transformers>=4.52, torch, safetensors, and torchaudio for non-WAV audio. Requires a GPU (~31GB weights) and trust_remote_code=True.

Example

import torch
from transformers import AutoModel, AutoTokenizer

# switch between pretrain and finetune
repo_id = "cara-ai/ALM2Vec-FT"
# repo_id = "cara-ai/ALM2Vec-PT"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True, torch_dtype=torch.float32).cuda()
model.eval()

QUERY_INSTRUCTION = "Based on the question asked in the text query and context in the audio query, retrieve the relevant text document associated with that question."
DOC_INSTRUCTION = "Represent the user's input."

query_text = ["What is the gender of speaker in this audio?"] * 4

remote_prefix = "https://huggingface.co/cara-ai/ALM2Vec-PT/resolve/main/example/"
query_audio = [remote_prefix + "en_male_music.wav", remote_prefix + "en_male.wav", 
               remote_prefix + "en_female_music.wav", remote_prefix + "en_female.wav"]

doc_text = [
    "male",
    "female",
]

query_embeddings = model.encode(
    text=query_text,
    audio=query_audio,
    task="query",
    instruction=QUERY_INSTRUCTION,
    normalize=True,
    device="cuda",
)
doc_embeddings = model.encode(
    text=doc_text,
    task="document",
    instruction=DOC_INSTRUCTION,
    normalize=True,
    device="cuda",
)

similarity = query_embeddings @ doc_embeddings.T
print(similarity)

similarity = query_embeddings @ query_embeddings.T
print(similarity)

Results

ALM2Vec-FT is the checkpoint hosted in this repository; ALM2Vec-PT is the pretrain variant. In every table, bold marks the best score and underline the second best.

Text–audio retrieval — AudioCaps

Method T→A R@1 T→A R@5 T→A R@10 A→T R@1 A→T R@5 A→T R@10
LAION-CLAP 36.1 71.8 83.9 46.8 82.9 90.7
MS-CLAP 15.4 47.2 64.5 32.0 66.0 79.2
WavCaps-CLAP-PT 39.7 74.5 86.1 51.7 82.3 90.6
WavCaps-CLAP-FT 42.2 76.5 87.1 54.6 85.2 92.4
JINA-Embed.-v5 20.4 50.3 64.4 23.1 52.7 67.2
ALM2Vec-PT 40.0 74.5 85.9 43.8 74.3 86.5
ALM2Vec-FT 43.2 78.0 87.8 55.5 80.0 88.2

Text–audio retrieval — Clotho

Method T→A R@1 T→A R@5 T→A R@10 A→T R@1 A→T R@5 A→T R@10
LAION-CLAP 16.1 38.3 51.1 22.7 48.5 60.8
MS-CLAP 15.6 38.9 51.4 22.1 48.9 62.0
WavCaps-CLAP-PT 19.5 45.2 58.2 23.4 50.9 63.4
WavCaps-CLAP-FT 19.7 45.7 59.4 26.9 52.6 64.9
JINA-Embed.-v5 9.2 23.9 35.0 10.5 24.7 34.3
ALM2Vec-PT 19.2 43.4 55.7 17.9 39.4 52.2
ALM2Vec-FT 24.8 52.9 65.8 27.9 52.7 66.3

Speech retrieval — LibriSQA

Method T→S R@1 T→S R@5 T→S R@10 S→T R@1 S→T R@5 S→T R@10
LAION-CLAP † 0.0 0.1 0.8 0.1 0.2 0.6
Whisper+BGE 83.7 93.3 94.9 85.2 93.4 95.3
CLSR 85.0 93.4 95.0 85.5 94.0 95.6
ALM2Vec-PT 43.7 64.5 72.8 11.2 24.9 34.1
ALM2Vec-FT 84.7 94.1 95.8 86.0 95.2 97.2

Audio understanding — MMAU-mini (accuracy)

Method Overall Music Sound Speech
GPT-4o Audio ‡ 60.8 63.2 64.6 56.3
Gemini 2.5 Pro ‡ 71.6 75.1 71.5 68.3
Qwen2.5-Omni ‡ 71.5 65.9 78.1 70.6
Audio Flamingo 3 ‡ 73.1 76.9 66.1 73.9
ALM2Vec-PT 66.3 62.3 78.7 58.0
ALM2Vec-FT 63.0 61.7 74.8 52.6

† LAION-CLAP is not trained for speech and effectively fails on LibriSQA; shown for reference. ‡ Generative large audio–language models, listed as reference upper bounds rather than directly comparable retrieval baselines.

Citation

If you find this work useful, please consider citing:

@article{ALM2Vec2026,
  title={ALM2Vec: Learning Audio Embeddings for Universal
        Audio Retrieval with Large Audio-Language Models},
  author={TBD},
  journal={arXiv preprint arXiv:TBD},
  year={2026}
}

Acknowledgement

ALM2Vec is built on MiDashengLM and further trained for universal audio retrieval. We thank MiDashengLM and its underlying Dasheng audio encoder for their open-source contributions.

Downloads last month
63
Safetensors
Model size
8B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cara-ai/ALM2Vec-FT

Finetuned
(2)
this model

Collection including cara-ai/ALM2Vec-FT