Automatic Speech Recognition
Transformers
Safetensors
English
joint_aed_ctc_speech-encoder-decoder
custom_code
Eval Results (legacy)
Instructions to use BUT-FIT/DeCRED-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BUT-FIT/DeCRED-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="BUT-FIT/DeCRED-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForSpeechSeq2Seq model = AutoModelForSpeechSeq2Seq.from_pretrained("BUT-FIT/DeCRED-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| datasets: | |
| - mozilla-foundation/common_voice_13_0 | |
| - facebook/voxpopuli | |
| - LIUM/tedlium | |
| - librispeech_asr | |
| - fisher_corpus | |
| - WSJ-0 | |
| metrics: | |
| - wer | |
| pipeline_tag: automatic-speech-recognition | |
| model-index: | |
| - name: tbd | |
| results: | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| name: LibriSpeech (clean) | |
| type: librispeech_asr | |
| config: clean | |
| split: test | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 2.5 | |
| name: Test WER | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| name: LibriSpeech (other) | |
| type: librispeech_asr | |
| config: other | |
| split: test | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 5.6 | |
| name: Test WER | |
| - task: | |
| type: Automatic Speech Recognition | |
| name: automatic-speech-recognition | |
| dataset: | |
| name: tedlium-v3 | |
| type: LIUM/tedlium | |
| config: release1 | |
| split: test | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 6.3 | |
| name: Test WER | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| name: Vox Populi | |
| type: facebook/voxpopuli | |
| config: en | |
| split: test | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 7.3 | |
| name: Test WER | |
| - task: | |
| type: Automatic Speech Recognition | |
| name: automatic-speech-recognition | |
| dataset: | |
| name: Mozilla Common Voice 13.0 | |
| type: mozilla-foundation/common_voice_13_0 | |
| config: en | |
| split: test | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 12.1 | |
| name: Test WER | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| name: FLEURS | |
| type: google/fleurs | |
| split: test | |
| args: | |
| language: en_us | |
| metrics: | |
| - type: wer | |
| value: 6.8 | |
| name: Test WER | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| name: Switchboard | |
| type: unk | |
| split: eval2000 | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 6.8 | |
| name: Test WER | |
| - task: | |
| type: automatic-speech-recognition | |
| name: Automatic Speech Recognition | |
| dataset: | |
| name: Wall Street Journal | |
| type: unk | |
| split: eval92 | |
| args: | |
| language: en | |
| metrics: | |
| - type: wer | |
| value: 1.3 | |
| name: Test WER | |
| # DeCRED-base | |
| This is a **174M encoder-decoder Ebranchformer model** trained with an decoder-centric regularization technique on 6,000 hours of open-source normalised English data. | |
| It achieves Word Error Rates (WERs) comparable to `openai/whisper-medium` across multiple datasets with just 1/4 of the parameters. | |
| Architecture details, training hyperparameters, and a description of the proposed technique will be added soon. | |
| *Disclaimer: The model currently produce insertions on utterances containing silence only, as it was previously not trained on such data. The fix will be added soon.* | |
| The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) | |
| class to transcribe audio files of arbitrary length. | |
| ```python | |
| from transformers import pipeline | |
| model_id = "BUT-FIT/DeCRED-base" | |
| pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True) | |
| # In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type. | |
| # The warning can be ignored. | |
| pipe.type = "seq2seq" | |
| # Run beam search decoding with joint CTC-attention scorer | |
| result_beam = pipe("audio.wav") | |
| # Run greedy decoding without joint CTC-attention scorer | |
| pipe.model.generation_config.ctc_weight = 0.0 | |
| pipe.model.generation_config.num_beams = 1 | |
| result_greedy = pipe("audio.wav") | |
| ``` | |
| ## Citation | |
| If you use [DeCRED](https://arxiv.org/abs/2410.17437) in your research, please cite the following paper: | |
| ```bibtex | |
| @misc{polok2024improvingautomaticspeechrecognition, | |
| title={Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models}, | |
| author={Alexander Polok and Santosh Kesiraju and Karel Beneš and Lukáš Burget and Jan Černocký}, | |
| year={2024}, | |
| eprint={2410.17437}, | |
| archivePrefix={arXiv}, | |
| primaryClass={eess.AS}, | |
| url={https://arxiv.org/abs/2410.17437}, | |
| } | |
| ``` | |