You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FastConformer-Quran CoreML β€” Offline

iOS and macOS deployment of Muno459/fastconformer-quran, offline (full-utterance) variant. Apple-silicon native, full fp16 precision (no integer reduction), Neural Engine specialized via multi-function fixed-shape entry points.

For the cache-aware streaming variant, see Muno459/fastconformer-quran-coreml-streaming.

Why multi-function

Apple's Neural Engine only runs kernels it pre-compiled at install time. Anything with a runtime-dynamic output shape (e.g., variable T β†’ variable T_out via subsampling) gets refused and falls back to GPU/CPU. The workaround: a multi-function mlprogram with one entry point per audio-length bucket, all sharing the same set of weights. Each entry point has fully fixed input AND output shapes so ANE pre-compiles a perfect kernel for it. The caller picks the function name at runtime based on padded audio length.

Result: one .mlpackage (~257 MB, weights deduped), 7 ANE-specialized entry points, no shape-probing noise on first launch.

Models

File Purpose Size
fastconformer-quran-offline.mlpackage ASR encoder + CTC head, 7 fixed-shape entry points ~257 MB
pronunciation-head.mlpackage Pronunciation head v7 ~5 MB

Both are pure CTC end-to-end.

Entry points

Function name Audio length Input shape Output logprobs shape
predict_T80 0.8 s (1, 80, 80) (1, 10, 1025)
predict_T200 2 s (1, 80, 200) (1, 25, 1025)
predict_T400 4 s (1, 80, 400) (1, 50, 1025)
predict_T800 8 s (default) (1, 80, 800) (1, 100, 1025)
predict_T1600 16 s (1, 80, 1600) (1, 200, 1025)
predict_T2400 24 s (1, 80, 2400) (1, 300, 1025)
predict_T4800 48 s (1, 80, 4800) (1, 600, 1025)

For audio longer than 48 s, split into chunks of ≀ 48 s. The smart-chunked pattern below handles this automatically.

Quick start (Swift, Core ML)

The function picker is MLModelConfiguration.functionName (not MLPredictionOptions), and it's set at model load time β€” each function lives behind its own loaded MLModel. Cache one MLModel per bucket and route to it based on padded audio length. Requires iOS 18 / macOS 15.

import CoreML

let BUCKETS = [80, 200, 400, 800, 1600, 2400, 4800]

@available(iOS 18.0, macOS 15.0, *)
actor QuranASR {
    private let mlpackageURL: URL
    private var models: [Int: MLModel] = [:]

    init(mlpackageURL: URL) {
        self.mlpackageURL = mlpackageURL
    }

    private func model(for T: Int) throws -> MLModel {
        if let m = models[T] { return m }
        let cfg = MLModelConfiguration()
        cfg.computeUnits = .all              // ANE + GPU + CPU
        cfg.functionName = "predict_T\(T)"   // pick the bucket's entry point at load time
        let m = try MLModel(contentsOf: mlpackageURL, configuration: cfg)
        models[T] = m
        return m
    }

    func transcribe(_ audioSamples: [Float]) throws -> String {
        let actualFrames = audioSamples.count / 160   // 10 ms hop at 16 kHz
        let T = BUCKETS.first { $0 >= actualFrames } ?? BUCKETS.last!
        let paddedSamples = audioSamples + Array(repeating: 0.0, count: max(0, T*160 - audioSamples.count))

        let features = computeLogMel(paddedSamples)   // (1, 80, T), Float16
        let asr = try model(for: T)
        let input = try MLDictionaryFeatureProvider(dictionary: ["audio_signal": features])
        let out = try asr.prediction(from: input)
        let logprobs = out.featureValue(for: "logprobs")!.multiArrayValue!   // (1, T/8, 1025)

        // Trim output to actual frames, then decode
        let validOutFrames = (actualFrames + 7) / 8
        let tokenIds = ctcCollapse(logprobs, validFrames: validOutFrames)
        return sentencePieceDecode(tokenIds, model: "tokenizer.model")
    }
}

iOS 17 fallback (default entry point only)

On iOS 17 / macOS 14, MLModelConfiguration.functionName doesn't exist β€” loading the .mlpackage without setting it uses the default function predict_T800 (8 s bucket). Always pad audio to 800 mel frames (128 000 samples at 16 kHz) and trim the output by ceil(actualFrames / 8) after CTC collapse. Verses up to 8 s work as-is; longer audio needs the smart-chunked pattern below.

import CoreML

actor QuranASRDefault {
    private let asr: MLModel

    init(mlpackageURL: URL) throws {
        let cfg = MLModelConfiguration()
        cfg.computeUnits = .all
        // No functionName set β†’ uses default `predict_T800`
        asr = try MLModel(contentsOf: mlpackageURL, configuration: cfg)
    }

    func transcribe(_ audioSamples: [Float]) throws -> String {
        let actualFrames = audioSamples.count / 160
        let T = 800
        let paddedSamples = audioSamples + Array(repeating: 0.0, count: max(0, T*160 - audioSamples.count))
        let features = computeLogMel(paddedSamples)  // (1, 80, 800), Float16
        let input = try MLDictionaryFeatureProvider(dictionary: ["audio_signal": features])
        let out = try asr.prediction(from: input)
        let logprobs = out.featureValue(for: "logprobs")!.multiArrayValue!  // (1, 100, 1025)
        let validOutFrames = (actualFrames + 7) / 8
        let tokenIds = ctcCollapse(logprobs, validFrames: validOutFrames)
        return sentencePieceDecode(tokenIds, model: "tokenizer.model")
    }
}

Smart chunked inference (live UX)

For a "live" UX while the user recites, run the offline model on overlapping audio chunks and stitch transcripts with longest-suffix-of-existing / prefix-of-new dedupe.

import AVFoundation
import CoreML

actor StreamingTranscriber {
    private let asr: QuranASR
    private let chunkSeconds = 8.0
    private let overlapSeconds = 1.5
    private let sampleRate = 16000
    private var buffer: [Float] = []
    private var lastEmittedEnd = 0
    private var transcript = ""

    init(asr: QuranASR) {
        self.asr = asr
    }

    func push(_ samples: [Float]) async throws -> String {
        buffer.append(contentsOf: samples)
        let chunkSamples = Int(chunkSeconds * Double(sampleRate))
        let overlapSamples = Int(overlapSeconds * Double(sampleRate))

        while buffer.count - lastEmittedEnd >= chunkSamples {
            let start = max(0, lastEmittedEnd - overlapSamples)
            let end = start + chunkSamples
            let chunk = Array(buffer[start..<end])
            let text = try await asr.transcribe(chunk)
            transcript = mergeWithOverlap(existing: transcript, new: text)
            lastEmittedEnd = end - overlapSamples / 2
        }
        return transcript
    }

    private func mergeWithOverlap(existing: String, new: String) -> String {
        let exWords = existing.split(separator: " ")
        let newWords = new.split(separator: " ")
        for k in stride(from: min(exWords.count, newWords.count), through: 1, by: -1) {
            let suffix = exWords.suffix(k)
            let prefix = newWords.prefix(k)
            if Array(suffix) == Array(prefix) {
                return (exWords + newWords.dropFirst(k)).joined(separator: " ")
            }
        }
        return existing + " " + new
    }
}

Each transcribe() call always lands on the predict_T800 entry point (8 s chunks fit the typical ayah).

Precision

fp16 throughout, no integer reduction. Apple's Neural Engine is natively fp16, so fp16 weights and activations are the accuracy-preserving choice on-device. No int8 / int4 used anywhere in this release.

Cross-checked against the canonical ONNX baseline at T=500: max abs diff logprobs 3.2 Γ— 10⁻⁡, encoder_output 1.5 Γ— 10⁻⁢. Effectively identical to upstream.

FP16 overflow fix

The vanilla FP16 export produced all-NaN logprobs on real audio. Root cause: RelPositionalEncoding multiplies the pre-encode output (peaks ~5 200) by xscale = √512 β‰ˆ 22.6, producing values up to 117 924 β€” 1.8Γ— the FP16 maximum of 65 504. In FP16 this becomes +inf, and inf Γ— 0 (from the attention mask) cascades NaN through all 17 Conformer layers.

Fix: pos_enc.forward is patched to clamp(x, βˆ’2400, 2400) before the xscale multiply. 2 400 Γ— 22.6 β‰ˆ 54 240 β€” safely inside FP16. The clamp is baked into the traced graph as a MIL clip op; it has no measurable effect on transcription accuracy for normal audio (mel features do not legitimately reach Β±2 400 after per-utterance normalization).

Feature extraction

The model expects 80-channel log-mel features computed identically to NVIDIA's NeMo FilterbankFeatures default:

  • 16 kHz sample rate
  • 25 ms window (400 samples) with Hann
  • 10 ms hop (160 samples)
  • 512-point FFT
  • 80 mel bins, mel_floor = 1e-5
  • Per-utterance mean and variance normalization per channel

A pure-Swift implementation will be ~200 lines using Accelerate for the FFT. The exact Python reference is in tajweed/aligner.py on the main repo.

Tokenizer

tokenizer.model is a SentencePiece BPE model with 1,024 pieces plus 1 blank (id 1024). For iOS, use a SentencePiece Swift port or implement BPE decoding manually (~50 lines).

Decoding pipeline:

  1. Argmax over logprobs per frame to get a sequence of token IDs
  2. Trim to ceil(actual_audio_samples / 160 / 8) output frames (the rest are padding)
  3. CTC collapse: remove blanks (id 1024) and dedupe consecutive identical IDs
  4. SentencePiece decode to final Arabic text

License

Apache 2.0. Same license as the upstream Muno459/fastconformer-quran and NVIDIA FastConformer-Hybrid.

Citation

@misc{fastconformer-quran-coreml-offline-2026,
  title  = {FastConformer-Quran CoreML (Offline): on-device Quranic ASR for iOS},
  author = {Anon},
  year   = {2026},
  url    = {https://huggingface.co/Muno459/fastconformer-quran-coreml-offline},
}
Downloads last month
95
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Muno459/fastconformer-quran-coreml-offline

Quantized
(1)
this model