Update README.md

11d63fc verified about 1 year ago

3.68 kB

	---
	license: mit
	base_model:
	- facebook/wav2vec2-large-robust
	- aadel4/Wav2vec_Classroom
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	language: en
	tags:
	- audio
	- automatic-speech-recognition
	- wav2vec2
	---
	## Model Card: Wav2vec_Classroom_WSP_FT

	### Model Overview
	Model Name: Wav2vec_Classroom_WSP_FT
	Version: 1.0
	Developed By: Ahmed Adel Attia (University of Maryland)
	Date: 2025

	Description:
	Wav2vec_Classroom_WSP_FT is an automatic speech recognition (ASR) model trained specifically for classroom speech transcription using a weakly supervised pretraining (WSP) approach. The model first undergoes supervised pretraining on weakly transcribed classroom data (NCTE-Weak) and is then fine-tuned using a small amount of human-verified gold-standard data (NCTE-Gold). This methodology allows the model to generalize well despite the scarcity of precisely transcribed classroom speech.

	This model is adapted from [Wav2vec-Classroom](https://huggingface.co/aadel4/Wav2vec_Classroom), which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves further fine-tuning to leverage weak transcriptions before final refinement on high-quality annotations.

	This model was originally trained using the fairseq library then ported into Huggingface.

	The model should be run with n-gram LM beamsearch decoding for best results. We got our best results using [this](https://drive.google.com/drive/u/0/folders/1yAFXcbozqDUFZu-hnnzFP_8SAzDYT2JJ) 5-gram LM we trained on classroom speech text.

	Use Case:
	- Speech-to-text transcription for classroom environments.
	- Forced allignment of transcription with audio to provide character and word level boundaries.
	- Educational research and analysis of classroom discourse.
	- Low-resource ASR applications where gold-standard labels are limited.

	### Model Details
	Architecture: Wav2vec2.0-based model fine-tuned with Fairseq

	Training Data:
	- NCTE-Weak: 5000 hours of weak transcriptions from the NCTE dataset.
	- NCTE-Gold: 13 hours of manually transcribed classroom recordings.

	Training Strategy:
	1. Weakly Supervised Pretraining (WSP): The model is first trained using NCTE-Weak transcripts, which contain alignment errors and omissions but provide useful weak supervision.
	2. Precise Fine-tuning: The pretrained model is fine-tuned on NCTE-Gold, ensuring it adapts to high-quality transcriptions.

	### Evaluation Results
	Word Error Rate (WER) comparison on NCTE and MPT test sets:

	\| Training Data \| NCTE WER \| MPT WER \|
	\|--------------\|----------\|---------\|
	\| Baseline (TEDLIUM-trained ASR) \| 55.82 / 50.56 \| 55.11 / 50.50 \|
	\| NCTE-Weak only \| 36.23 / 32.30 \| 50.84 / 46.09 \|
	\| NCTE-Gold only \| 21.12 / 16.47 \| 31.52 / 27.93 \|
	\| Self-training \| 17.45 / 15.09 \| 27.42 / 26.24 \|
	\| NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold) \| 16.54 / 13.51 \| 25.07 / 23.70 \|

	### Limitations
	- The model relies on weak supervision, and transcription quality is dependent on the balance between weak and gold-standard data.
	- Classroom noise, overlapping speech, and spontaneous interactions may still lead to recognition errors.
	- The model was trained specifically on elementary math classrooms and may not generalize well to other educational settings without further adaptation.

	### Usage Request
	If you use the NCTE-WSP-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.

	For inquiries or collaborations, don't hesitate to contact me at aadel@umd.edu or ahmadadelattia@gmail.com