whisper-large-v3-ca-rapnic-down-full
Table of Contents
Click to expand
Model Description
The "whisper-large-v3-ca-rapnic-down-full" is an acoustic model suitable for Automatic Down Syndrome Speech Recognition in Catalan based on the model "BSC-LT/whisper-large-v3-ca-punctuated-3370h". It was finetuned on audios of individuals with Down syndrome, see Rapnic Example for more details on the dataset. At the same time, the original model was based on "openai/whisper-large-v3", with a combination of Catalan data from Common Voice 17.0 (2,659 hours) and 710 hours of data released by the Projecte AINA from Barcelona, Spain. Totalling 3369 hours and 53 minutes. The model is intended to produce lowercase Catalan alphabet including spaces.
Finetuning
For this model, a full finetuning was performed on 70% of the data avaliable (only down syndrome speakers). To avoid training with poor data, we trained a preliminary model and used its WER results to filter speakers with a mean exceeding 80%. This resulted in better metrics for speakers under and over the WER threshold.
Evaluation
To evaluate the model, we set appart 20% of the available data, making sure that no transcription was present in both training and testing sets. The WER results of performing inference in our test set, filtering according to speaker mean WER thresholds, were the following:
| Training ↓ / Evaluation → | 0.5 | 0.6 | 0.7 | 0.8 | None |
|---|---|---|---|---|---|
| 0.8 | 14.42 | 15.28 | 15.73 | 15.73 | 23.41 |
How to Get Started with the Model
Installation
In order to use this model, you may install datasets and transformers:
Create a virtual environment:
python -m venv /path/to/venv
Activate the environment:
source /path/to/venv/bin/activate
Install the modules:
pip install datasets transformers
For Inference
In order to transcribe audio in Catalan using this model, you can follow this example:
#Install Prerequisites
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer
# This code works with GPU
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset, Audio
from evaluate import load
import re
def remove_punctuation(text):
text = text.replace("’", "'") # normalize apostrophe
return re.sub(r"[.,;:!?¿¡«»()\[\]{}]", "", text)
# Load the processor and model.
MODEL_NAME = "CLiC-UB/whisper-large-v3-ca-rapnic-down-full"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")
ds = load_dataset("CLiC-UB/rapnic-example", split="train")
ds = ds.filter(lambda x: x["disorder"] == "Síndrome de Down")
# drop rows with zero duration
ds = ds.filter(lambda x: x["original_duration"] != 0)
# Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
# Process the dataset
def map_to_pred(row):
audio = row["audio"]
input_features = processor(
audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt"
).input_features
row["reference"] = remove_punctuation(row["prompt"].lower())
with torch.no_grad():
predicted_ids = model.generate(input_features.to("cuda"))[0]
transcription = processor.decode(predicted_ids)
row["prediction"] = remove_punctuation(transcription.lower())
return row
# Do the evaluation
result = ds.map(map_to_pred)
# Compute the overall WER now.
wer = load("wer")
WER = 100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)
# Test result:
# 21.03194103194103
Additional Information
Contact
For further information, please send an email to gr.clic@ub.edu.
License
- Downloads last month
- 10
Model tree for CLiC-UB/whisper-large-v3-ca-rapnic-down-full
Base model
openai/whisper-large-v3Collection including CLiC-UB/whisper-large-v3-ca-rapnic-down-full
Evaluation results
- WER on Rapnic (Test) (only Down)test set self-reported23.410