ALM2Vec-FT

ALM2Vec is a universal audio embedding model for retrieval, derived from a pretrained large audio–language model (LALM). Instead of being optimized only for audio–caption matching like conventional contrastive dual-encoders, it transfers the audio understanding, instruction-following, and reasoning abilities of LALMs into a single unified embedding space that works across audio domains, task types, and user intents.

Its key feature is instruction-aware retrieval: a natural-language instruction guides the embedding, so the same audio can be encoded differently for different needs. This supports:

Instruction-aware retrieval — focus the embedding on a specific aspect of the audio.
Text ↔ audio retrieval — bidirectional matching between audio and text.
Audio question answering — match an audio query plus a question against candidate answers.

ALM2Vec achieves competitive results on standard audio and speech retrieval benchmarks while adding these controllable retrieval capabilities. See the project page for interactive demos.

This repository hosts the finetune checkpoint, built on MiDashengLM.

Requirements: transformers>=4.52, torch, safetensors, and torchaudio for non-WAV audio. Requires a GPU (~31GB weights) and trust_remote_code=True.

Example

import torch
from transformers import AutoModel, AutoTokenizer

# switch between pretrain and finetune
repo_id = "cara-ai/ALM2Vec-FT"
# repo_id = "cara-ai/ALM2Vec-PT"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True, torch_dtype=torch.float32).cuda()
model.eval()

QUERY_INSTRUCTION = "Based on the question asked in the text query and context in the audio query, retrieve the relevant text document associated with that question."
DOC_INSTRUCTION = "Represent the user's input."

query_text = ["What is the gender of speaker in this audio?"] * 4

remote_prefix = "https://huggingface.co/cara-ai/ALM2Vec-PT/resolve/main/example/"
query_audio = [remote_prefix + "en_male_music.wav", remote_prefix + "en_male.wav", 
               remote_prefix + "en_female_music.wav", remote_prefix + "en_female.wav"]

doc_text = [
    "male",
    "female",
]

query_embeddings = model.encode(
    text=query_text,
    audio=query_audio,
    task="query",
    instruction=QUERY_INSTRUCTION,
    normalize=True,
    device="cuda",
)
doc_embeddings = model.encode(
    text=doc_text,
    task="document",
    instruction=DOC_INSTRUCTION,
    normalize=True,
    device="cuda",
)

similarity = query_embeddings @ doc_embeddings.T
print(similarity)

similarity = query_embeddings @ query_embeddings.T
print(similarity)

Results

ALM2Vec-FT is the checkpoint hosted in this repository; ALM2Vec-PT is the pretrain variant. In every table, bold marks the best score and underline the second best.

Text–audio retrieval — AudioCaps

Method	T→A R@1	T→A R@5	T→A R@10	A→T R@1	A→T R@5	A→T R@10
LAION-CLAP	36.1	71.8	83.9	46.8	82.9	90.7
MS-CLAP	15.4	47.2	64.5	32.0	66.0	79.2
WavCaps-CLAP-PT	39.7	74.5	86.1	51.7	82.3	90.6
WavCaps-CLAP-FT	42.2	76.5	87.1	54.6	85.2	92.4
JINA-Embed.-v5	20.4	50.3	64.4	23.1	52.7	67.2
ALM2Vec-PT	40.0	74.5	85.9	43.8	74.3	86.5
ALM2Vec-FT	43.2	78.0	87.8	55.5	80.0	88.2

Text–audio retrieval — Clotho

Method	T→A R@1	T→A R@5	T→A R@10	A→T R@1	A→T R@5	A→T R@10
LAION-CLAP	16.1	38.3	51.1	22.7	48.5	60.8
MS-CLAP	15.6	38.9	51.4	22.1	48.9	62.0
WavCaps-CLAP-PT	19.5	45.2	58.2	23.4	50.9	63.4
WavCaps-CLAP-FT	19.7	45.7	59.4	26.9	52.6	64.9
JINA-Embed.-v5	9.2	23.9	35.0	10.5	24.7	34.3
ALM2Vec-PT	19.2	43.4	55.7	17.9	39.4	52.2
ALM2Vec-FT	24.8	52.9	65.8	27.9	52.7	66.3

Speech retrieval — LibriSQA

Method	T→S R@1	T→S R@5	T→S R@10	S→T R@1	S→T R@5	S→T R@10
LAION-CLAP †	0.0	0.1	0.8	0.1	0.2	0.6
Whisper+BGE	83.7	93.3	94.9	85.2	93.4	95.3
CLSR	85.0	93.4	95.0	85.5	94.0	95.6
ALM2Vec-PT	43.7	64.5	72.8	11.2	24.9	34.1
ALM2Vec-FT	84.7	94.1	95.8	86.0	95.2	97.2

Audio understanding — MMAU-mini (accuracy)

Method	Overall	Music	Sound	Speech
GPT-4o Audio ‡	60.8	63.2	64.6	56.3
Gemini 2.5 Pro ‡	71.6	75.1	71.5	68.3
Qwen2.5-Omni ‡	71.5	65.9	78.1	70.6
Audio Flamingo 3 ‡	73.1	76.9	66.1	73.9
ALM2Vec-PT	66.3	62.3	78.7	58.0
ALM2Vec-FT	63.0	61.7	74.8	52.6

† LAION-CLAP is not trained for speech and effectively fails on LibriSQA; shown for reference. ‡ Generative large audio–language models, listed as reference upper bounds rather than directly comparable retrieval baselines.

Citation

If you find this work useful, please consider citing:

@article{ALM2Vec2026,
  title={ALM2Vec: Learning Audio Embeddings for Universal
        Audio Retrieval with Large Audio-Language Models},
  author={TBD},
  journal={arXiv preprint arXiv:TBD},
  year={2026}
}

Acknowledgement

ALM2Vec is built on MiDashengLM and further trained for universal audio retrieval. We thank MiDashengLM and its underlying Dasheng audio encoder for their open-source contributions.