FastConformer Quran Arabic ASR

A fine-tuned NVIDIA FastConformer Hybrid Large model for Quranic Arabic speech recognition, achieving 0.14% Word Error Rate on the tarteel-ai/everyayah validation set.

This model supports both offline transcription (full bilateral context, highest accuracy) and real-time streaming (causal local attention, cache-aware frame-by-frame inference).


Model Details

Property Value
Base model nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0
Architecture EncDecHybridRNNTCTCBPE (FastConformer-Large)
Parameters 114.6M
Encoder layers 18 ร— FastConformer blocks
Tokenizer SentencePiece BPE, 1024 tokens
Sample rate 16 kHz, mono
Val WER (offline) 0.0014 (0.14%)
Dataset tarteel-ai/everyayah
Framework NVIDIA NeMo

Training

Fine-tuned using a 3-phase progressive unfreezing strategy on a single NVIDIA RTX 4070 Ti (12 GB):

Phase Layers unfrozen Steps LR Val WER
Phase 1 Top 3 encoder + decoder 2000 5e-5 0.0038
Phase 2 Upper half (layers 9โ€“17) + decoder 3000 1e-4 0.0018
Phase 3 All layers 2500 5e-5 0.0014

Progressive unfreezing prevents catastrophic forgetting of the base model's Arabic speech representations while allowing the full model to adapt to Quranic phonetics, tajweed rules, and recitation style.

Training data: tarteel-ai/everyayah โ€” a diverse multi-reciter dataset of complete Quranic recitations at multiple audio qualities, covering all 114 surahs across dozens of reciters.


Usage

Installation

pip install nemo_toolkit[asr]

Offline transcription (recommended for files)

The .nemo file is saved with full bilateral attention context โ€” transcribe() works out of the box with no configuration required.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
    "mohammed/fastconformer-quran-ar"
)
model.eval()

# Transcribe a .wav file (16kHz mono)
result = model.transcribe(["recitation.wav"])
print(result[0].text)
# e.g. "ุจูุณู’ู…ู ุงู„ู„ูŽู‘ู‡ู ุงู„ุฑูŽู‘ุญู’ู…ูŽูฐู†ู ุงู„ุฑูŽู‘ุญููŠู…ู"

Real-time streaming

The model supports cache-aware streaming inference via NeMo's cache_aware_stream_step(). The key loading sequence (order matters):

import torch
import nemo.collections.asr as nemo_asr
from omegaconf import OmegaConf, open_dict

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(
    "mohammed/fastconformer-quran-ar"
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Step 1 โ€” reset conv padding to symmetric (safety check before mode switch)
for layer in model.encoder.layers:
    if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
        conv = layer.conv.conv
        ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
        conv.padding = ((ks - 1) // 2, (ks - 1) // 2)

# Step 2 โ€” switch to causal local attention for streaming
model.change_attention_model(
    self_attention_model="rel_pos_local_attn",
    att_context_size=[128, 0],   # 128 frames lookback (~10s), fully causal
)
with open_dict(model.cfg):
    model.cfg.encoder.conv_context_size = "causal"

# Step 3 โ€” causal conv padding
for layer in model.encoder.layers:
    if hasattr(layer, "conv") and hasattr(layer.conv, "conv"):
        conv = layer.conv.conv
        ks = conv.kernel_size[0] if isinstance(conv.kernel_size, tuple) else conv.kernel_size
        conv.padding = (ks - 1, 0)

# Step 4 โ€” greedy decoder
decoding_cfg = OmegaConf.structured(model.cfg.decoding)
OmegaConf.set_struct(decoding_cfg, False)
decoding_cfg.strategy = "greedy"
decoding_cfg.greedy.max_symbols = 10
decoding_cfg.greedy.use_cuda_graph_decoder = False  # incompatible with streaming
model.change_decoding_strategy(decoding_cfg)
model.eval()

# Streaming loop โ€” feed 80ms PCM int16 frames
cache_last_channel, cache_last_time = None, None
chunk_samples = 16000 * 1600 // 1000  # 1600ms chunk

audio_chunk = torch.zeros(1, chunk_samples, device=device)  # replace with real audio
audio_len   = torch.tensor([chunk_samples], device=device)

with torch.no_grad():
    processed, processed_len = model.preprocessor(
        input_signal=audio_chunk, length=audio_len
    )
    encoded, encoded_len, cache_last_channel, cache_last_time, _ = (
        model.encoder.cache_aware_stream_step(
            processed_signal=processed,
            processed_signal_length=processed_len,
            cache_last_channel=cache_last_channel,
            cache_last_time=cache_last_time,
            keep_all_outputs=False,
        )
    )

For a complete streaming implementation with microphone input, silence detection, word callbacks, and a FastAPI WebSocket server, see the companion script in the repository files.


Qualitative Examples

The following are exact reference vs. predicted outputs from the validation set โ€” the model transcribed these word-for-word correctly, including full diacritisation (tashkeel):

Reference Predicted
ูˆูŽู‡ููˆูŽ ุงู„ูŽู‘ุฐููŠ ุฌูŽุนูŽู„ูŽ ู„ูŽูƒูู…ู ุงู„ู„ูŽู‘ูŠู’ู„ูŽ ู„ูุจูŽุงุณู‹ุง ูˆูŽุงู„ู†ูŽู‘ูˆู’ู…ูŽ ุณูุจูŽุงุชู‹ุง ูˆูŽุฌูŽุนูŽู„ูŽ ุงู„ู†ูŽู‘ู‡ูŽุงุฑูŽ ู†ูุดููˆุฑู‹ุง โœ… Perfect
ุงู„ุฒูŽู‘ุงู†ููŠ ู„ูŽุง ูŠูŽู†ู’ูƒูุญู ุฅูู„ูŽู‘ุง ุฒูŽุงู†ููŠูŽุฉู‹ ุฃูŽูˆู’ ู…ูุดู’ุฑููƒูŽุฉู‹ ูˆูŽุงู„ุฒูŽู‘ุงู†ููŠูŽุฉู ู„ูŽุง ูŠูŽู†ู’ูƒูุญูู‡ูŽุง ุฅูู„ูŽู‘ุง ุฒูŽุงู†ู ุฃูŽูˆู’ ู…ูุดู’ุฑููƒูŒ ูˆูŽุญูุฑูู‘ู…ูŽ ุฐูŽู„ููƒูŽ ุนูŽู„ูŽู‰ ุงู„ู’ู…ูุคู’ู…ูู†ููŠู†ูŽ โœ… Perfect
ุฅูู„ูŽู‘ุง ู…ูŽู†ู’ ุชูŽุงุจูŽ ูˆูŽุขู…ูŽู†ูŽ ูˆูŽุนูŽู…ูู„ูŽ ุนูŽู…ูŽู„ู‹ุง ุตูŽุงู„ูุญู‹ุง ููŽุฃููˆู„ูŽุฆููƒูŽ ูŠูุจูŽุฏูู‘ู„ู ุงู„ู„ูŽู‘ู‡ู ุณูŽูŠูู‘ุฆูŽุงุชูู‡ูู…ู’ ุญูŽุณูŽู†ูŽุงุชู ูˆูŽูƒูŽุงู†ูŽ ุงู„ู„ูŽู‘ู‡ู ุบูŽูููˆุฑู‹ุง ุฑูŽุญููŠู…ู‹ุง โœ… Perfect
ุฅูุฐู’ ู‚ูŽุงู„ูŽ ู„ูุฃูŽุจููŠู‡ู ูˆูŽู‚ูŽูˆู’ู…ูู‡ู ู…ูŽุง ุชูŽุนู’ุจูุฏููˆู†ูŽ โœ… Perfect
ูŠูŽูˆู’ู…ูŽ ู„ูŽุง ูŠูŽู†ู’ููŽุนู ู…ูŽุงู„ูŒ ูˆูŽู„ูŽุง ุจูŽู†ููˆู†ูŽ โœ… Perfect
ุฅูุฐู’ ู‚ูŽุงู„ูŽ ู„ูŽู‡ูู…ู’ ุฃูŽุฎููˆู‡ูู…ู’ ู‡ููˆุฏูŒ ุฃูŽู„ูŽุง ุชูŽุชูŽู‘ู‚ููˆู†ูŽ โœ… Perfect
ุฃูŽุชูŽุจู’ู†ููˆู†ูŽ ุจููƒูู„ูู‘ ุฑููŠุนู ุขูŠูŽุฉู‹ ุชูŽุนู’ุจูŽุซููˆู†ูŽ โœ… Perfect
ููŽู†ูŽุฌูŽู‘ูŠู’ู†ูŽุงู‡ู ูˆูŽุฃูŽู‡ู’ู„ูŽู‡ู ุฃูŽุฌู’ู…ูŽุนููŠู†ูŽ โœ… Perfect
ููŽู‚ูŽุฑูŽุฃูŽู‡ู ุนูŽู„ูŽูŠู’ู‡ูู…ู’ ู…ูŽุง ูƒูŽุงู†ููˆุง ุจูู‡ู ู…ูุคู’ู…ูู†ููŠู†ูŽ โœ… Perfect
ูˆูŽุฃูŽู†ู’ุฐูุฑู’ ุนูŽุดููŠุฑูŽุชูŽูƒูŽ ุงู„ู’ุฃูŽู‚ู’ุฑูŽุจููŠู†ูŽ โœ… Perfect
ุงู„ูŽู‘ุฐููŠู†ูŽ ูŠูู‚ููŠู…ููˆู†ูŽ ุงู„ุตูŽู‘ู„ูŽุงุฉูŽ ูˆูŽูŠูุคู’ุชููˆู†ูŽ ุงู„ุฒูŽู‘ูƒูŽุงุฉูŽ ูˆูŽู‡ูู…ู’ ุจูุงู„ู’ุขุฎูุฑูŽุฉู ู‡ูู…ู’ ูŠููˆู‚ูู†ููˆู†ูŽ โœ… Perfect

These span multiple surahs (Al-Furqan, An-Nur, Ash-Shu'ara, As-Saffat) and include some of the most phonetically demanding ayahs in the Quran โ€” long compound sentences, rare vocabulary (ู†ูุดููˆุฑู‹ุงุŒ ุณูุจูŽุงุชู‹ุง), emphatic consonants, and precise tashkeel on every word.


Intended Use & Limitations

Intended use:

  • Quranic recitation transcription and verification
  • Tajweed learning applications
  • Ayah identification from audio
  • Recitation correction apps (compare hypothesis against reference ayah)

Limitations:

  • Optimised specifically for Quranic Arabic โ€” performance on Modern Standard Arabic or dialectal Arabic will be significantly lower than the base model
  • Best results on clean, single-speaker recitation audio at 16kHz
  • The streaming mode introduces ~1.6s of latency per chunk due to the encoder's minimum chunk size requirement

Citation

If you use this model, please cite the base model and dataset:

@misc{fastconformer-quran-ar,
  author    = {Mohammed},
  title     = {FastConformer Quran Arabic ASR},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/mohammed/fastconformer-quran-ar}
}
@misc{everyayah,
  author    = {Tarteel AI},
  title     = {EveryAyah: A Quranic Recitation Dataset},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/tarteel-ai/everyayah}
}
Downloads last month
181
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train mohammed/fastconformer-quran-ar

Spaces using mohammed/fastconformer-quran-ar 3

Evaluation results