You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

whisper-large-v3-ca-rapnic-all-pissa

Click to expand

whisper-large-v3-ca-rapnic-all-pissa

Model Description

The "whisper-large-v3-ca-rapnic-all-pissa" is an acoustic model suitable for Automatic Disordered Speech Recognition in Catalan based on the model "BSC-LT/whisper-large-v3-ca-punctuated-3370h". It was mainly finetuned on audios of individuals with Down syndrome and Cerebral palsy, see Rapnic Example for more details on the dataset. At the same time, the original model was based on "openai/whisper-large-v3", with a combination of Catalan data from Common Voice 17.0 (2,659 hours) and 710 hours of data released by the Projecte AINA from Barcelona, Spain. Totalling 3369 hours and 53 minutes. The model is intended to produce lowercase Catalan alphabet including spaces.

Finetuning

For this model, a pissa finetuning (on all layers) was performed on 70% of the data avaliable. To avoid training with poor data, we trained a preliminary model and used its WER results to filter speakers with a mean exceeding 80%. This resulted in better metrics for speakers under and over the WER threshold.

Evaluation

To evaluate the model, we set appart 20% of the available data, making sure that no transcription was present in both training and testing sets. The WER results of performing inference in our test set, filtering according to speaker mean WER thresholds, were the following:

Training ↓ / Evaluation →	0.5	0.6	0.7	0.8	None
0.8	19.26	21.24	22.46	23.16	30.15

How to Get Started with the Model

Installation

In order to use this model, you may install datasets and transformers:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

pip install datasets transformers

For Inference

In order to transcribe audio in Catalan using this model, you can follow this example:

#Install Prerequisites
pip install torch
pip install peft
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer

# This code works with GPU

from peft import PeftModel
from transformers import (
    WhisperForConditionalGeneration,
    WhisperProcessor,
)
from datasets import load_dataset, Audio
from evaluate import load
import re
import torch


def remove_punctuation(text):
    text = text.replace("’", "'")  # normalize apostrophe
    return re.sub(r"[.,;:!?¿¡«»()\[\]{}]", "", text)


model_name = "BSC-LT/whisper-large-v3-ca-punctuated-3370h"
adapters_name = "CLiC-UB/whisper-large-v3-ca-rapnic-all-pissa"
save_path = "merged_model_whisper_large_v3_ca_rapnic_all_pissa"

model = WhisperForConditionalGeneration.from_pretrained(
    model_name,
    device_map="cuda",
)
model = PeftModel.from_pretrained(model, model_id=adapters_name)
model = model.merge_and_unload()
print("Loaded model with PiSSA weights")

# Save the merged model
# m.save_pretrained(save_path)
# print(f"Model saved to {save_path}")


# Transcribe example dataset with the merged model
processor = WhisperProcessor.from_pretrained(model_name)

ds = load_dataset("CLiC-UB/rapnic-example", split="train")

# drop rows with zero duration
ds = ds.filter(lambda x: x["original_duration"] != 0)

# Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))


# Process the dataset
def map_to_pred(row):
    audio = row["audio"]
    input_features = processor(
        audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt"
    ).input_features
    row["reference"] = remove_punctuation(row["prompt"].lower())

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]

    transcription = processor.decode(predicted_ids)
    row["prediction"] = remove_punctuation(transcription.lower())

    return row


# Do the evaluation
result = ds.map(map_to_pred)

# Compute the overall WER now.
wer = load("wer")
WER = 100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)
# Test result:
# 29.55354697343194

Additional Information

Contact

For further information, please send an email to gr.clic@ub.edu.

License

CC BY-NC 4.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for CLiC-UB/whisper-large-v3-ca-rapnic-all-pissa

Base model

openai/whisper-large-v3

Finetuned

BSC-LT/whisper-large-v3-ca-punctuated-3370h

Finetuned

(9)

this model

Collection including CLiC-UB/whisper-large-v3-ca-rapnic-all-pissa

Speech Models

Collection

Models developed by the speech team at CLiC • 15 items • Updated 2 days ago

Evaluation results

WER on Rapnic (Test)
test set self-reported

30.150

View on Papers With Code

CLiC-UB
/

whisper-large-v3-ca-rapnic-all-pissa