Instructions to use Reza2kn/Shenava-Rizeh-0.9 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use Reza2kn/Shenava-Rizeh-0.9 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("Reza2kn/Shenava-Rizeh-0.9") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Shenava-Rizeh 0.9 β Persian Cache-Aware Streaming ASR (32M)
A tiny (32M-param) cache-aware, multi-latency streaming Persian (Farsi) ASR model β FastConformer-Hybrid (RNNT + CTC), 16 kHz. Built for fully offline, on-device, real-time captioning (WebGPU / WASM / NeMo), part of the VisualEars project (SLT 2026).
One model serves the entire latencyβaccuracy curve (0 / 80 / 480 / 1040 ms) β pick your operating point at inference time, no re-training. Its larger sibling is Shenava-Koochik-0.9 (114M).
π Results β Golden6669 (held-out gold Persian eval)
Evaluated on Reza2kn/visualears-golden-6669 (6,669 clips, official Persian normalizer), RNNT head:
att_context_size |
Latency | WER | CER | WER_bf |
|---|---|---|---|---|
[70, 0] |
0 ms (real-time) | 11.08% | 3.17% | 9.68% |
[70, 1] |
80 ms | 10.85% | 3.08% | 9.43% |
[70, 6] |
480 ms | 10.56% | 2.93% | 9.14% |
[70, 13] |
1040 ms | 10.46% | 2.89% | 9.03% |
The curve is nearly flat β only 0.62 pp WER from 0 β 1040 ms. You get near-best accuracy at zero lookahead: a 32M model doing 11.08% WER / 3.17% CER at true real-time, fully on-device. For reference, the previous-generation single-latency fa32M scored 17.40% (and could not run below its trained latency).
Flatness holds per-condition
The latency penalty is uniform across acoustic conditions β low-latency does not fray on the hard far-field/obstructed tail (Golden6669 is deliberately ~94% non-clean):
| Condition | n | 0 ms WER | 1040 ms WER | Ξ |
|---|---|---|---|---|
| clean | 400 | 13.38 | 12.62 | +0.76 |
| obstructed | 4,335 | 10.87 | 10.20 | +0.67 |
| far-field | 1,934 | 11.24 | 10.77 | +0.47 |
Far-field has the smallest 0β1040 ms gap. (clean scores worst here β a quirk of Golden6669's small 400-clip clean slice, not a streaming effect.)
WER_bf = boundary-forgiven WER (utterances perfect modulo Persian word-spacing conventions counted correct).
How it was trained
- Base:
nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms(English cache-aware streaming), persianized by swapping in a Persian BPE-1024 tokenizer and reinitializing the decoder + joint (encoder kept). - Multi-latency:
att_context_size = [[70,13],[70,6],[70,1],[70,0]](chunked_limited) β one checkpoint covers 0 / 80 / 480 / 1040 ms. - Phase A: ~7,386 h / 3.66M clips of cleaned, teacher-pseudo-labeled Persian (
Reza2kn/visualears-persian-pseudo-asr). - Phase B: gold-anchor fine-tune on 355 human-verified gold + active-learning corrections.
- Trajectory: random-init decoder β Phase A 17.94% β Phase B 10.46% (@1040 ms).
Usage (NeMo)
from nemo.collections.asr.models import ASRModel
m = ASRModel.restore_from("shenava-rizeh-0.9.nemo").cuda().eval()
m.encoder.set_default_att_context_size([70, 0]) # 0 ms (real-time); or [70,13] for best WER
print(m.transcribe(["clip.wav"])[0].text)
[70,0]=0 ms Β· [70,1]=80 ms Β· [70,6]=480 ms Β· [70,13]=1040 ms (1 encoder frame = 80 ms, FastConformer subsampling 8).
Notes
- Version 0.9 β Phase B on 355 human-verified gold. v1.0 follows a larger active-learning gold round (the 6K worst-disagreement clips now under review on Argilla).
- The CTC head is the deployment head for real-time/WebGPU; RNNT is the higher-accuracy offline/rescorer head.
- Larger streaming sibling:
Shenava-Koochik-0.9(114M). Offline flagship:shenava-fa-fastconformer-115m(7.29%).
Citation
@misc{shenava_rizeh_2026,
title = {Shenava-Rizeh: Persian Cache-Aware Streaming ASR (32M)},
author = {Sayar, Reza},
year = {2026},
howpublished = {Hugging Face},
url = {https://huggingface.co/Reza2kn/Shenava-Rizeh-0.9}
}
- Downloads last month
- -