Instructions to use mohammed/fastconformer-quran-ar with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use mohammed/fastconformer-quran-ar with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("mohammed/fastconformer-quran-ar") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
FastConformer Quran Arabic ASR
A fine-tuned NVIDIA FastConformer Hybrid Large model for Quranic Arabic speech recognition, achieving 0.14% Word Error Rate on the tarteel-ai/everyayah validation set.
This model supports both offline transcription (full bilateral context, highest accuracy) and real-time streaming (causal local attention, cache-aware frame-by-frame inference).
Model Details
| Property | Value |
|---|---|
| Base model | nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0 |
| Architecture | EncDecHybridRNNTCTCBPE (FastConformer-Large) |
| Parameters | 114.6M |
| Encoder layers | 18 ร FastConformer blocks |
| Tokenizer | SentencePiece BPE, 1024 tokens |
| Sample rate | 16 kHz, mono |
| Val WER (offline) | 0.0014 (0.14%) |
| Dataset | tarteel-ai/everyayah |
| Framework | NVIDIA NeMo |
Training
Fine-tuned using a 3-phase progressive unfreezing strategy on a single NVIDIA RTX 4070 Ti (12 GB):
| Phase | Layers unfrozen | Steps | LR | Val WER |
|---|---|---|---|---|
| Phase 1 | Top 3 encoder + decoder | 2000 | 5e-5 | 0.0038 |
| Phase 2 | Upper half (layers 9โ17) + decoder | 3000 | 1e-4 | 0.0018 |
| Phase 3 | All layers | 2500 | 5e-5 | 0.0014 |
Progressive unfreezing prevents catastrophic forgetting of the base model's Arabic speech representations while allowing the full model to adapt to Quranic phonetics, tajweed rules, and recitation style.
Training data: tarteel-ai/everyayah โ a diverse multi-reciter dataset of complete Quranic recitations at multiple audio qualities, covering all 114 surahs across dozens of reciters.
Usage
Installation
pip install nemo_toolkit[asr]
Offline transcription (recommended for files)
The .nemo file is saved with full bilateral attention context โ transcribe() works out of the box with no configuration required.
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
"mohammed/fastconformer-quran-ar"
)
model.eval()
# Transcribe a .wav file (16kHz mono)
result = model.transcribe(["recitation.wav"])
print(result[0].text)
# e.g. "ุจูุณูู
ู ุงูููููู ุงูุฑููุญูู
ููฐูู ุงูุฑููุญููู
ู"
Real-time streaming
The model supports cache-aware streaming inference via NeMo's cache_aware_stream_step().
The key loading sequence (order matters):
import torch
import nemo.collections.asr as nemo_asr
from omegaconf import OmegaConf, open_dict
model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
"mohammed/fastconformer-quran-ar"
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Step 1 โ reset conv padding to symmetric (safety check before mode switch)
for layer in model.encoder.layers:
if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
conv = layer.conv.conv
ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
conv.padding = ((ks - 1) // 2, (ks - 1) // 2)
# Step 2 โ switch to causal local attention for streaming
model.change_attention_model(
self_attention_model="rel_pos_local_attn",
att_context_size=[128, 0], # 128 frames lookback (~10s), fully causal
)
with open_dict(model.cfg):
model.cfg.encoder.conv_context_size = "causal"
# Step 3 โ causal conv padding
for layer in model.encoder.layers:
if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
conv = layer.conv.conv
ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
conv.padding = (ks - 1, 0)
# Step 4 โ greedy decoder
decoding_cfg = OmegaConf.structured(model.cfg.decoding)
OmegaConf.set_struct(decoding_cfg, False)
decoding_cfg.strategy = "greedy"
decoding_cfg.greedy.max_symbols = 10
decoding_cfg.greedy.use_cuda_graph_decoder = False # incompatible with streaming
model.change_decoding_strategy(decoding_cfg)
model.eval()
# Streaming loop โ feed 80ms PCM int16 frames
cache_last_channel, cache_last_time = None, None
chunk_samples = 16000 * 1600 // 1000 # 1600ms chunk
audio_chunk = torch.zeros(1, chunk_samples, device=device) # replace with real audio
audio_len = torch.tensor([chunk_samples], device=device)
with torch.no_grad():
processed, processed_len = model.preprocessor(
input_signal=audio_chunk, length=audio_len
)
encoded, encoded_len, cache_last_channel, cache_last_time, _ = (
model.encoder.cache_aware_stream_step(
processed_signal=processed,
processed_signal_length=processed_len,
cache_last_channel=cache_last_channel,
cache_last_time=cache_last_time,
keep_all_outputs=False,
)
)
For a complete streaming implementation with microphone input, silence detection, word callbacks, and a FastAPI WebSocket server, see the companion script in the repository files.
Qualitative Examples
The following are exact reference vs. predicted outputs from the validation set โ the model transcribed these word-for-word correctly, including full diacritisation (tashkeel):
| Reference | Predicted |
|---|---|
| ูููููู ุงูููุฐูู ุฌูุนููู ููููู ู ุงูููููููู ููุจูุงุณูุง ููุงููููููู ู ุณูุจูุงุชูุง ููุฌูุนููู ุงููููููุงุฑู ููุดููุฑูุง | โ Perfect |
| ุงูุฒููุงููู ููุง ููููููุญู ุฅููููุง ุฒูุงููููุฉู ุฃููู ู ูุดูุฑูููุฉู ููุงูุฒููุงููููุฉู ููุง ููููููุญูููุง ุฅููููุง ุฒูุงูู ุฃููู ู ูุดูุฑููู ููุญูุฑููู ู ุฐููููู ุนูููู ุงููู ูุคูู ูููููู | โ Perfect |
| ุฅููููุง ู ููู ุชูุงุจู ููุขู ููู ููุนูู ููู ุนูู ูููุง ุตูุงููุญูุง ููุฃููููุฆููู ููุจูุฏูููู ุงูููููู ุณููููุฆูุงุชูููู ู ุญูุณูููุงุชู ููููุงูู ุงูููููู ุบููููุฑูุง ุฑูุญููู ูุง | โ Perfect |
| ุฅูุฐู ููุงูู ููุฃูุจูููู ููููููู ููู ู ูุง ุชูุนูุจูุฏูููู | โ Perfect |
| ููููู ู ููุง ููููููุนู ู ูุงูู ููููุง ุจูููููู | โ Perfect |
| ุฅูุฐู ููุงูู ููููู ู ุฃูุฎููููู ู ูููุฏู ุฃูููุง ุชูุชููููููู | โ Perfect |
| ุฃูุชูุจูููููู ุจูููููู ุฑููุนู ุขููุฉู ุชูุนูุจูุซูููู | โ Perfect |
| ููููุฌููููููุงูู ููุฃููููููู ุฃูุฌูู ูุนูููู | โ Perfect |
| ููููุฑูุฃููู ุนูููููููู ู ู ูุง ููุงูููุง ุจููู ู ูุคูู ูููููู | โ Perfect |
| ููุฃูููุฐูุฑู ุนูุดููุฑูุชููู ุงููุฃูููุฑูุจูููู | โ Perfect |
| ุงูููุฐูููู ูููููู ูููู ุงูุตููููุงุฉู ููููุคูุชูููู ุงูุฒููููุงุฉู ููููู ู ุจูุงููุขุฎูุฑูุฉู ููู ู ูููููููููู | โ Perfect |
These span multiple surahs (Al-Furqan, An-Nur, Ash-Shu'ara, As-Saffat) and include some of the most phonetically demanding ayahs in the Quran โ long compound sentences, rare vocabulary (ููุดููุฑูุงุ ุณูุจูุงุชูุง), emphatic consonants, and precise tashkeel on every word.
Intended Use & Limitations
Intended use:
- Quranic recitation transcription and verification
- Tajweed learning applications
- Ayah identification from audio
- Recitation correction apps (compare hypothesis against reference ayah)
Limitations:
- Optimised specifically for Quranic Arabic โ performance on Modern Standard Arabic or dialectal Arabic will be significantly lower than the base model
- Best results on clean, single-speaker recitation audio at 16kHz
- The streaming mode introduces ~1.6s of latency per chunk due to the encoder's minimum chunk size requirement
Citation
If you use this model, please cite the base model and dataset:
@misc{fastconformer-quran-ar,
author = {Mohammed},
title = {FastConformer Quran Arabic ASR},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/mohammed/fastconformer-quran-ar}
}
@misc{everyayah,
author = {Tarteel AI},
title = {EveryAyah: A Quranic Recitation Dataset},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/tarteel-ai/everyayah}
}
- Downloads last month
- 181
Dataset used to train mohammed/fastconformer-quran-ar
Spaces using mohammed/fastconformer-quran-ar 3
Evaluation results
- Word Error Rate on tarteel-ai/everyayahself-reported0.001