Update README.md

574f849 4 days ago

3.69 kB

	---
	license: cc-by-nc-4.0
	language:
	- ca
	base_model:
	- projecte-aina/stt_ca-es_conformer_transducer_large
	tags:
	- automatic-speech-recognition
	- NeMo
	model-index:
	- name: stt_ca-es_conformer_transducer_large-rapnic-down
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Rapnic (Test)
	type: CLiC-UB/rapnic-example
	split: test
	args:
	language: ca
	metrics:
	- name: WER
	type: wer
	value: 30.78
	---
	# FFT for Down Syndrome: NVIDIA Conformer-Transducer Large (ca-es)

	## Table of Contents
	<details>
	<summary>Click to expand</summary>****

	- [FFT for Down Syndrome: NVIDIA Conformer-Transducer Large (ca-es)](#fft-for-down-syndrome-nvidia-conformer-transducer-large-ca-es)
	- [Table of Contents](#table-of-contents)
	- [Summary](#summary)
	- [Model Description](#model-description)
	- [Finetuning](#finetuning)
	- [Evaluation](#evaluation)
	- [Installation](#installation)
	- [For Inference](#for-inference)
	- [Additional Information](#additional-information)
	- [Contact](#contact)
	- [License](#license)

	</details>

	## Summary

	The "stt_ca-es_conformer_transducer_large-rapnic-down" is an acoustic model based on ["projecte-aina/stt_ca-es_conformer_transducer_large"](https://huggingface.co/projecte-aina/stt_ca-es_conformer_transducer_large) suitable Catalan Automatic Speech Recognition for Down syndrome speech.
	At the same time, the latter was a model based on ["NVIDIA/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large/).

	## Model Description

	This model was created for Down Syndrome and transcribes in lowercase Catalan alphabet including spaces.
	It was mainly finetuned on audios from the Rapnic dataset, see [Rapnic Example](https://huggingface.co/datasets/CLiC-UB/rapnic-example) for more details on the dataset.
	See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.

	## Finetuning
	For this model, a full finetuning was performed on 70% of the data avaliable.
	To avoid training with poor data, we trained a preliminary model and used its WER results to filter speakers with a mean exceeding 80%.
	This resulted in better metrics for speakers under and over the WER threshold.
	### Evaluation
	To evaluate the model, we set appart 20% of the available data, making sure that no transcription was present in both training and testing sets.
	The WER results of performing inference in our test set (down only), filtering according to speaker mean WER thresholds, were the following:
	****
	\| Training ↓ / Evaluation → \| 0.5 \| 0.6 \| 0.7 \| 0.8 \| None \|
	\|----------------------------\|------\|------\|------\|------\|------\|
	\| 0.8 \| 21.80 \| 23.06 \| 23.83 \| 23.83 \| 30.78 \|


	## Installation

	To use this model, install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest PyTorch version.
	```
	pip install nemo_toolkit['all']
	```


	## For Inference
	To transcribe impaired speech in Catalan using this model, you can follow this example:


	```python
	import nemo.collections.asr as nemo_asr

	nemo_asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(model)
	transcription = nemo_asr_model.transcribe([audio_path])[0].text
	print(transcription)
	```

	## Additional Information

	### Contact
	For further information, please send an email to <gr.clic@ub.edu>.

	### License

	[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en)