lexandstuff
/

mlx-contentvec

Feature Extraction

voice-conversion

Model card Files Files and versions

mlx-contentvec / README.md

lexandstuff's picture

Upload README.md with huggingface_hub

6c6cd43 verified 3 months ago

|

history blame contribute delete

2.85 kB

	---
	license: mit
	library_name: mlx
	tags:
	- mlx
	- audio
	- speech
	- feature-extraction
	- contentvec
	- hubert
	- voice-conversion
	- rvc
	datasets:
	- librispeech_asr
	language:
	- en
	pipeline_tag: feature-extraction
	---

	# MLX ContentVec / HuBERT Base

	MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon.

	This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI).

	## Model Details

	- Architecture: HuBERT Base (12 transformer layers)
	- Parameters: ~90M
	- Input: 16kHz mono audio
	- Output: 768-dimensional features (~50 frames/second)
	- Framework: [MLX](https://github.com/ml-explore/mlx)
	- Format: SafeTensors (float32)

	## Usage

	```python
	import mlx.core as mx
	import librosa
	from mlx_contentvec import ContentvecModel

	# Load model
	model = ContentvecModel(encoder_layers_1=0)
	model.load_weights("contentvec_base.safetensors")
	model.eval()

	# Load audio at 16kHz
	audio, sr = librosa.load("input.wav", sr=16000, mono=True)
	source = mx.array(audio).reshape(1, -1)

	# Extract features
	result = model(source)
	features = result["x"] # Shape: (1, num_frames, 768)
	```

	## Installation

	```bash
	pip install git+https://github.com/example/mlx-contentvec.git
	```

	## Download Weights

	```python
	from huggingface_hub import hf_hub_download

	weights_path = hf_hub_download(
	repo_id="lexandstuff/mlx-contentvec",
	filename="contentvec_base.safetensors"
	)
	```

	## Validation

	These weights produce numerically identical outputs to the original PyTorch implementation:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Max absolute difference \| 7.3e-6 \|
	\| Cosine similarity \| 1.000000 \|

	## Source Weights

	Converted from [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) (MD5: `b76f784c1958d4e535cd0f6151ca35e4`).

	## Use Cases

	- Voice Conversion: Feature extraction for RVC pipeline
	- Speaker Verification: Content-based audio embeddings
	- Speech Analysis: Semantic feature extraction

	## Citation

	```bibtex
	@inproceedings{qian2022contentvec,
	title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers},
	author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu},
	booktitle={International Conference on Machine Learning},
	year={2022}
	}

	@article{hsu2021hubert,
	title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
	author={Hsu, Wei-Ning and others},
	journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
	year={2021}
	}
	```

	## License

	MIT