MULE (PyTorch) — matteospanio/mule

Pretrained weights for an unofficial PyTorch port of MULE (Musicset Unsupervised Large Embedding), the SF-NFNet-F0 music-audio representation model from SiriusXM/Pandora:

Supervised and Unsupervised Learning of Audio Representations for Music Understanding, M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, A. F. Ehmann. ISMIR 2022. https://arxiv.org/abs/2210.03799

These weights were converted, not re-trained — transferred from the original TensorFlow/Keras model.keras into the PyTorch implementation and verified to be numerically equivalent (end-to-end clip-embedding cosine 0.9999999 vs the original pipeline; ONNX backbone max-abs < 1e-6; 62.35 M params).

Library / code: https://github.com/matteospanio/mule-torch

⚠️ Unofficial. This is an independent community port from TensorFlow to PyTorch. It is not affiliated with, endorsed by, or maintained by SiriusXM, Pandora, or the original authors. All credit for the model goes to them.

Files

File What
model.safetensors Full model state dict (SF-NFNet-F0 backbone + mel filterbank buffer), ~267 MB.
config.json Architecture + frontend + slicing constants (rebuilds MuleConfig).
backbone.onnx Self-contained ONNX export of the backbone ((N,1,96,300) log-mel slice → (N,1728)), opset 17, dynamic batch. ~252 MB.

Usage

pip install mule-torch          # or: pip install git+https://github.com/matteospanio/mule-torch
import torch
from mule_torch import MuleModel

# Downloads these weights from the Hub by default.
model = MuleModel.from_pretrained()                 # == from_pretrained(hf_repo="matteospanio/mule")
waveform = torch.randn(1, 16000 * 10)               # (B, T) mono @ 16 kHz, in [-1, 1]
emb = model(waveform)                               # (B, 1728)

ONNX (backbone only)

The full waveform→embedding path includes a data-dependent number of 2-second slices, so the ONNX export covers the backbone (one standardized 96×300 log-mel slice → 1728-d). Do the mel front-end + slicing in torch/host, then run slices through backbone.onnx:

import onnxruntime as ort, numpy as np
sess = ort.InferenceSession("backbone.onnx", providers=["CPUExecutionProvider"])
emb = sess.run(None, {"mel_slice": slices.astype(np.float32)})[0]   # (N, 1728)

Input convention

16 kHz mono waveform in [-1, 1]. The model computes a 96-band log-mel spectrogram, slices it into 96×300 windows every ~2 s, runs the backbone, and mean-pools the per-slice 1728-d embeddings into one vector per clip.

The original AudioFile reader scales PCM16 by 1/2^16; conventional [-1,1] audio tracks the original closely but isn't bit-identical (the log10(10000·x+1) mel compression is non-linear).

License

These weights are a derivative of the original MULE weights, released by Pandora/SiriusXM under CC BY-NC 4.0, and inherit that non-commercial license. The mule-torch source code is GPL-3.0-only. Please cite McCallum et al. (2022).

@inproceedings{mccallum2022mule,
  title     = {Supervised and Unsupervised Learning of Audio Representations for Music Understanding},
  author    = {McCallum, Matthew C. and Korzeniowski, Filip and Oramas, Sergio and Gouyon, Fabien and Ehmann, Andreas F.},
  booktitle = {Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR)},
  year      = {2022},
  url       = {https://arxiv.org/abs/2210.03799}
}
Downloads last month
-
Safetensors
Model size
66.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for matteospanio/mule