ByT5-small — Sindhi Spelling Correction

A finetuned version of google/byt5-small for automatic spelling correction in Sindhi (Arabic script). The model takes a misspelled Sindhi sentence as input and outputs the corrected version.

Model Details

Field	Details
Base model	google/byt5-small
Model type	Encoder-Decoder (Seq2Seq)
Language	Sindhi (sd) — Arabic script
Task	Spelling correction (text2text-generation)
Parameters	299,637,760
License	Apache 2.0

Why ByT5?

ByT5 operates directly on raw UTF-8 bytes with no tokenizer vocabulary. This makes it ideal for Sindhi because:

No out-of-vocabulary issues with rare characters or diacritics
Naturally handles Arabic script variants (zabar, zer, pesh, shadda)
Character-level corrections are learned directly from bytes

Training Details

Dataset

Source: fahadqazi/Sindhi-Misspelled-Sentences
Training rows: 60,000 pairs
Validation rows: 3,000 pairs
Test rows: 3,000 pairs
Format: incorrect sentence → correct sentence pairs

Hyperparameters

Parameter	Value
Max input length	128 bytes
Max target length	128 bytes
Batch size (effective)	32 (2 × 16 grad accum)
Epochs	5
Optimizer	Adafactor
Warmup steps	500
Gradient clipping	1.0
Precision	bf16
Gradient checkpointing	Yes

Training Environment

GPU: NVIDIA RTX (8GB VRAM)
Framework: PyTorch + HuggingFace Transformers
Training time: ~3 hours

Evaluation Results

Evaluated on a held-out test set of 3,000 Sindhi sentence pairs with Unicode NFC normalization applied before scoring.

Metric	Score
Character Error Rate (CER)	0.0447
Exact Match	0.2897

What these numbers mean

CER 0.0447 — the model makes errors on only ~4.5% of characters. A 10-word Sindhi sentence is corrected with ~95.5% character accuracy.
Exact Match 0.29 — 29% of sentences are corrected perfectly (every character matches). This metric is strict — even one wrong diacritic in a long sentence counts as failure.

Per-epoch training curve

Epoch	CER	Exact Match
1	0.0601	0.1797
2	0.0518	0.2383
3	0.0479	0.2600
4	0.0463	0.2747
5	0.0456	0.2807
Test	0.0447	0.2897

How to Use

Basic inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "your-username/byt5-sindhi-spell-correction"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def correct_sindhi(text: str) -> str:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
    )
    outputs = model.generate(
        **inputs,
        max_length=128,
        num_beams=4,          # beam search for better quality
        early_stopping=True,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# example
misspelled = "اسن يور ۽ اتر امريع ۾ بعد ۾ سروص سروس سيںٽر قائم ڪا آحں"
corrected  = correct_sindhi(misspelled)
print("Input    :", misspelled)
print("Corrected:", corrected)

Batch inference

import torch

def correct_batch(texts: list[str], batch_size: int = 8) -> list[str]:
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        inputs = tokenizer(
            batch,
            return_tensors="pt",
            max_length=128,
            truncation=True,
            padding=True,
        )
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=128,
                num_beams=4,
            )
        decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        results.extend(decoded)
    return results

Limitations

Max input length is 128 bytes (~40–50 Sindhi characters). Very long sentences will be truncated.
Exact match is 29% — the model corrects most characters correctly but may not produce a perfectly identical sentence every time.
Trained on synthetic errors — real-world human typos may differ from the error patterns in the training data.
Diacritics are hard — Sindhi has many optional diacritical marks; the model may inconsistently add or remove them.
Not a language model — the model corrects spelling at the character/byte level and does not understand sentence meaning.

Intended Use

Sindhi text preprocessing pipelines
OCR post-correction for Sindhi documents
Input correction for Sindhi keyboard/typing tools
Data cleaning for Sindhi NLP datasets

Out-of-Scope Use

Other languages (model is Sindhi-specific)
Grammatical error correction (this is spelling only)
Sentences longer than ~50 characters (truncation applies)

Citation

If you use this model in your research, please cite:

@misc{byt5-sindhi-spell-correction,
  title     = {ByT5-small Fine-tuned for Sindhi Spelling Correction},
  author    = {Danish},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/your-username/byt5-sindhi-spell-correction}
}

Acknowledgements

Base model: google/byt5-small
Dataset: fahadqazi/Sindhi-Misspelled-Sentences
Training framework: HuggingFace Transformers

Downloads last month: 26

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanishMahdi/snd_spell_corrector

Base model

google/byt5-small

Finetuned

(242)

this model

DanishMahdi
/

snd_spell_corrector