README: list all 9 training datasets (expresso/vox1/vox2 were missing)

361141a verified 10 days ago

2.73 kB

	---
	license: cc-by-4.0
	language:
	- en
	library_name: sentence-transformers
	pipeline_tag: feature-extraction
	base_model: LCO-Embedding/LCO-Embedding-Omni-7B
	tags:
	- audio
	- speech
	- emotion
	- clap
	- contrastive
	- voice
	- sentence-transformers
	---

	# VoiceCLAP-Large

	Voice-text contrastive embedding model — the larger of the two anchors
	released with [VoiceNet](https://huggingface.co/VoiceNet).

	VoiceCLAP-Large is a single-tower model: a rank-16 LoRA finetune of
	[LCO-Embedding-Omni-7B](https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-7B)
	(Qwen2.5-Omni-Thinker-7B backbone with a sentence-transformer
	last-token-pooling head) trained with the symmetric InfoNCE loss. The audio
	and text embeddings are produced by the same backbone — the modality is
	determined by what is fed in via the multimodal chat template.

	\| \| \|
	\| --- \| --- \|
	\| Architecture \| single-tower Omni-Embedding (Qwen2.5-Omni-Thinker-7B + ST last-token-pool) \|
	\| Adaptation \| rank-16 LoRA (alpha 32, dropout 0.05), merged into the released weights \|
	\| Joint embedding \| 3 584-d, L2-normalised \|
	\| Loss \| symmetric InfoNCE (all-gather negatives) \|
	\| Total parameters \| ~7 B (full merged model) \|
	\| Epochs \| 1 \|

	## Training data

	Trained for 1 epoch on the open `voiceclap_10_safe` mixture (9 datasets)
	used in the VoiceNet paper:

	- `emolia-balanced-5M-subset` (annotated subset of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset))
	- `laions_got_talent_clean_with_captions`
	- `majestrino-data`
	- `synthetic_vocal_bursts`
	- `improved_synthetic_vocal_bursts`
	- `ears`
	- `expresso`
	- `voxceleb1`
	- `voxceleb2`

	All clips are captioned with `MOSS-Audio-8B-Thinking`-derived dense
	vocal-style captions covering emotions, talking-style attributes, and
	demographics.

	## Standalone load example

	The model uses the SentenceTransformer multimodal API — both
	`sentence-transformers` and `transformers` are on PyPI; no other deps are
	required.

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("VoiceNet/voiceclap-large", trust_remote_code=True)

	# Text embedding (3 584-d, L2-normalised)
	text_emb = model.encode(["a calm and steady voice"])

	# Audio embedding — pass a dict with raw samples + sampling rate.
	import soundfile as sf
	arr, sr = sf.read("clip.wav")
	audio_emb = model.encode([{"array": arr, "sampling_rate": sr}])

	# Cosine similarity (embeddings already L2-normalised)
	print((audio_emb @ text_emb.T).item())
	```

	For convenience the LoRA adapter is also shipped under `adapter/` so it can
	be reapplied to other LCO-Embedding-Omni-7B forks; the merged
	`model.safetensors` already contains it.

	## Citation

	If you use this model, please cite the VoiceNet paper.