FastConformer-Quran CoreML β Offline
iOS and macOS deployment of Muno459/fastconformer-quran, offline (full-utterance) variant. Apple-silicon native, full fp16 precision (no integer reduction), Neural Engine specialized via multi-function fixed-shape entry points.
For the cache-aware streaming variant, see Muno459/fastconformer-quran-coreml-streaming.
Why multi-function
Apple's Neural Engine only runs kernels it pre-compiled at install time. Anything with a runtime-dynamic output shape (e.g., variable T β variable T_out via subsampling) gets refused and falls back to GPU/CPU. The workaround: a multi-function mlprogram with one entry point per audio-length bucket, all sharing the same set of weights. Each entry point has fully fixed input AND output shapes so ANE pre-compiles a perfect kernel for it. The caller picks the function name at runtime based on padded audio length.
Result: one .mlpackage (~257 MB, weights deduped), 7 ANE-specialized entry points, no shape-probing noise on first launch.
Models
| File | Purpose | Size |
|---|---|---|
fastconformer-quran-offline.mlpackage |
ASR encoder + CTC head, 7 fixed-shape entry points | ~257 MB |
pronunciation-head.mlpackage |
Pronunciation head v7 | ~5 MB |
Both are pure CTC end-to-end.
Entry points
| Function name | Audio length | Input shape | Output logprobs shape |
|---|---|---|---|
predict_T80 |
0.8 s | (1, 80, 80) | (1, 10, 1025) |
predict_T200 |
2 s | (1, 80, 200) | (1, 25, 1025) |
predict_T400 |
4 s | (1, 80, 400) | (1, 50, 1025) |
predict_T800 |
8 s (default) | (1, 80, 800) | (1, 100, 1025) |
predict_T1600 |
16 s | (1, 80, 1600) | (1, 200, 1025) |
predict_T2400 |
24 s | (1, 80, 2400) | (1, 300, 1025) |
predict_T4800 |
48 s | (1, 80, 4800) | (1, 600, 1025) |
For audio longer than 48 s, split into chunks of β€ 48 s. The smart-chunked pattern below handles this automatically.
Quick start (Swift, Core ML)
The function picker is MLModelConfiguration.functionName (not MLPredictionOptions), and it's set at model load time β each function lives behind its own loaded MLModel. Cache one MLModel per bucket and route to it based on padded audio length. Requires iOS 18 / macOS 15.
import CoreML
let BUCKETS = [80, 200, 400, 800, 1600, 2400, 4800]
@available(iOS 18.0, macOS 15.0, *)
actor QuranASR {
private let mlpackageURL: URL
private var models: [Int: MLModel] = [:]
init(mlpackageURL: URL) {
self.mlpackageURL = mlpackageURL
}
private func model(for T: Int) throws -> MLModel {
if let m = models[T] { return m }
let cfg = MLModelConfiguration()
cfg.computeUnits = .all // ANE + GPU + CPU
cfg.functionName = "predict_T\(T)" // pick the bucket's entry point at load time
let m = try MLModel(contentsOf: mlpackageURL, configuration: cfg)
models[T] = m
return m
}
func transcribe(_ audioSamples: [Float]) throws -> String {
let actualFrames = audioSamples.count / 160 // 10 ms hop at 16 kHz
let T = BUCKETS.first { $0 >= actualFrames } ?? BUCKETS.last!
let paddedSamples = audioSamples + Array(repeating: 0.0, count: max(0, T*160 - audioSamples.count))
let features = computeLogMel(paddedSamples) // (1, 80, T), Float16
let asr = try model(for: T)
let input = try MLDictionaryFeatureProvider(dictionary: ["audio_signal": features])
let out = try asr.prediction(from: input)
let logprobs = out.featureValue(for: "logprobs")!.multiArrayValue! // (1, T/8, 1025)
// Trim output to actual frames, then decode
let validOutFrames = (actualFrames + 7) / 8
let tokenIds = ctcCollapse(logprobs, validFrames: validOutFrames)
return sentencePieceDecode(tokenIds, model: "tokenizer.model")
}
}
iOS 17 fallback (default entry point only)
On iOS 17 / macOS 14, MLModelConfiguration.functionName doesn't exist β loading the .mlpackage without setting it uses the default function predict_T800 (8 s bucket). Always pad audio to 800 mel frames (128 000 samples at 16 kHz) and trim the output by ceil(actualFrames / 8) after CTC collapse. Verses up to 8 s work as-is; longer audio needs the smart-chunked pattern below.
import CoreML
actor QuranASRDefault {
private let asr: MLModel
init(mlpackageURL: URL) throws {
let cfg = MLModelConfiguration()
cfg.computeUnits = .all
// No functionName set β uses default `predict_T800`
asr = try MLModel(contentsOf: mlpackageURL, configuration: cfg)
}
func transcribe(_ audioSamples: [Float]) throws -> String {
let actualFrames = audioSamples.count / 160
let T = 800
let paddedSamples = audioSamples + Array(repeating: 0.0, count: max(0, T*160 - audioSamples.count))
let features = computeLogMel(paddedSamples) // (1, 80, 800), Float16
let input = try MLDictionaryFeatureProvider(dictionary: ["audio_signal": features])
let out = try asr.prediction(from: input)
let logprobs = out.featureValue(for: "logprobs")!.multiArrayValue! // (1, 100, 1025)
let validOutFrames = (actualFrames + 7) / 8
let tokenIds = ctcCollapse(logprobs, validFrames: validOutFrames)
return sentencePieceDecode(tokenIds, model: "tokenizer.model")
}
}
Smart chunked inference (live UX)
For a "live" UX while the user recites, run the offline model on overlapping audio chunks and stitch transcripts with longest-suffix-of-existing / prefix-of-new dedupe.
import AVFoundation
import CoreML
actor StreamingTranscriber {
private let asr: QuranASR
private let chunkSeconds = 8.0
private let overlapSeconds = 1.5
private let sampleRate = 16000
private var buffer: [Float] = []
private var lastEmittedEnd = 0
private var transcript = ""
init(asr: QuranASR) {
self.asr = asr
}
func push(_ samples: [Float]) async throws -> String {
buffer.append(contentsOf: samples)
let chunkSamples = Int(chunkSeconds * Double(sampleRate))
let overlapSamples = Int(overlapSeconds * Double(sampleRate))
while buffer.count - lastEmittedEnd >= chunkSamples {
let start = max(0, lastEmittedEnd - overlapSamples)
let end = start + chunkSamples
let chunk = Array(buffer[start..<end])
let text = try await asr.transcribe(chunk)
transcript = mergeWithOverlap(existing: transcript, new: text)
lastEmittedEnd = end - overlapSamples / 2
}
return transcript
}
private func mergeWithOverlap(existing: String, new: String) -> String {
let exWords = existing.split(separator: " ")
let newWords = new.split(separator: " ")
for k in stride(from: min(exWords.count, newWords.count), through: 1, by: -1) {
let suffix = exWords.suffix(k)
let prefix = newWords.prefix(k)
if Array(suffix) == Array(prefix) {
return (exWords + newWords.dropFirst(k)).joined(separator: " ")
}
}
return existing + " " + new
}
}
Each transcribe() call always lands on the predict_T800 entry point (8 s chunks fit the typical ayah).
Precision
fp16 throughout, no integer reduction. Apple's Neural Engine is natively fp16, so fp16 weights and activations are the accuracy-preserving choice on-device. No int8 / int4 used anywhere in this release.
Cross-checked against the canonical ONNX baseline at T=500: max abs diff logprobs 3.2 Γ 10β»β΅, encoder_output 1.5 Γ 10β»βΆ. Effectively identical to upstream.
FP16 overflow fix
The vanilla FP16 export produced all-NaN logprobs on real audio. Root cause: RelPositionalEncoding multiplies the pre-encode output (peaks ~5 200) by xscale = β512 β 22.6, producing values up to 117 924 β 1.8Γ the FP16 maximum of 65 504. In FP16 this becomes +inf, and inf Γ 0 (from the attention mask) cascades NaN through all 17 Conformer layers.
Fix: pos_enc.forward is patched to clamp(x, β2400, 2400) before the xscale multiply. 2 400 Γ 22.6 β 54 240 β safely inside FP16. The clamp is baked into the traced graph as a MIL clip op; it has no measurable effect on transcription accuracy for normal audio (mel features do not legitimately reach Β±2 400 after per-utterance normalization).
Feature extraction
The model expects 80-channel log-mel features computed identically to NVIDIA's NeMo FilterbankFeatures default:
- 16 kHz sample rate
- 25 ms window (400 samples) with Hann
- 10 ms hop (160 samples)
- 512-point FFT
- 80 mel bins, mel_floor = 1e-5
- Per-utterance mean and variance normalization per channel
A pure-Swift implementation will be ~200 lines using Accelerate for the FFT. The exact Python reference is in tajweed/aligner.py on the main repo.
Tokenizer
tokenizer.model is a SentencePiece BPE model with 1,024 pieces plus 1 blank (id 1024). For iOS, use a SentencePiece Swift port or implement BPE decoding manually (~50 lines).
Decoding pipeline:
- Argmax over
logprobsper frame to get a sequence of token IDs - Trim to
ceil(actual_audio_samples / 160 / 8)output frames (the rest are padding) - CTC collapse: remove blanks (id 1024) and dedupe consecutive identical IDs
- SentencePiece decode to final Arabic text
License
Apache 2.0. Same license as the upstream Muno459/fastconformer-quran and NVIDIA FastConformer-Hybrid.
Citation
@misc{fastconformer-quran-coreml-offline-2026,
title = {FastConformer-Quran CoreML (Offline): on-device Quranic ASR for iOS},
author = {Anon},
year = {2026},
url = {https://huggingface.co/Muno459/fastconformer-quran-coreml-offline},
}
- Downloads last month
- 95
Model tree for Muno459/fastconformer-quran-coreml-offline
Base model
Muno459/fastconformer-quran