fahadqazi/Sindhi-Misspelled-Sentences
Viewer • Updated • 4.71M • 9 • 1
A finetuned version of google/byt5-small for automatic spelling correction in Sindhi (Arabic script). The model takes a misspelled Sindhi sentence as input and outputs the corrected version.
| Field | Details |
|---|---|
| Base model | google/byt5-small |
| Model type | Encoder-Decoder (Seq2Seq) |
| Language | Sindhi (sd) — Arabic script |
| Task | Spelling correction (text2text-generation) |
| Parameters | 299,637,760 |
| License | Apache 2.0 |
ByT5 operates directly on raw UTF-8 bytes with no tokenizer vocabulary. This makes it ideal for Sindhi because:
incorrect sentence → correct sentence pairs| Parameter | Value |
|---|---|
| Max input length | 128 bytes |
| Max target length | 128 bytes |
| Batch size (effective) | 32 (2 × 16 grad accum) |
| Epochs | 5 |
| Optimizer | Adafactor |
| Warmup steps | 500 |
| Gradient clipping | 1.0 |
| Precision | bf16 |
| Gradient checkpointing | Yes |
Evaluated on a held-out test set of 3,000 Sindhi sentence pairs with Unicode NFC normalization applied before scoring.
| Metric | Score |
|---|---|
| Character Error Rate (CER) | 0.0447 |
| Exact Match | 0.2897 |
| Epoch | CER | Exact Match |
|---|---|---|
| 1 | 0.0601 | 0.1797 |
| 2 | 0.0518 | 0.2383 |
| 3 | 0.0479 | 0.2600 |
| 4 | 0.0463 | 0.2747 |
| 5 | 0.0456 | 0.2807 |
| Test | 0.0447 | 0.2897 |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "your-username/byt5-sindhi-spell-correction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def correct_sindhi(text: str) -> str:
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
truncation=True,
)
outputs = model.generate(
**inputs,
max_length=128,
num_beams=4, # beam search for better quality
early_stopping=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# example
misspelled = "اسن يور ۽ اتر امريع ۾ بعد ۾ سروص سروس سيںٽر قائم ڪا آحں"
corrected = correct_sindhi(misspelled)
print("Input :", misspelled)
print("Corrected:", corrected)
import torch
def correct_batch(texts: list[str], batch_size: int = 8) -> list[str]:
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
inputs = tokenizer(
batch,
return_tensors="pt",
max_length=128,
truncation=True,
padding=True,
)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=128,
num_beams=4,
)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
results.extend(decoded)
return results
If you use this model in your research, please cite:
@misc{byt5-sindhi-spell-correction,
title = {ByT5-small Fine-tuned for Sindhi Spelling Correction},
author = {Danish},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/your-username/byt5-sindhi-spell-correction}
}
Base model
google/byt5-small