GaroOCR
OCR model for the Garo (grt_Latn) language, fine-tuned from microsoft/Florence-2-base-ft on Garo text images.
Developed by MWire Labs, Shillong, Meghalaya; part of an ongoing effort to build foundational AI for Northeast Indian languages.
Model Details
| Base model | microsoft/Florence-2-base-ft |
| Parameters | 231M |
| Language | Garo (Achik) |
| Task | OCR (image → text) |
| Training samples | 80,000 |
| Epochs | 5 |
| Character Accuracy | 93.13% |
Training Setup
- Hardware: NVIDIA A40 (48GB)
- Precision: bfloat16
- Batch size: 4 (effective 16 with gradient accumulation)
- Learning rate: 3e-4 with cosine scheduler
- Max label length: 128 tokens
- Task prompt:
<OCR>(Florence-2 uppercase token)
Usage
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("MWirelabs/garo-ocr", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"MWirelabs/garo-ocr",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).cuda()
image = Image.open("your_image.png").convert("RGB")
inputs = processor(text="<OCR>", images=image, return_tensors="pt")
inputs = {k: v.cuda() for k, v in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
with torch.no_grad():
generated = model.generate(
pixel_values=inputs["pixel_values"],
input_ids=inputs["input_ids"],
max_new_tokens=128,
)
text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)
Note: Use
transformers==4.38.2for compatibility.
Limitations
- Max reliable output length is ~128 tokens
- Part of MWire Labs' mono-language series; a multilingual NE-OCR model covering more Northeast Indian languages is in development
- Downloads last month
- 11
Model tree for MWirelabs/garo-ocr
Base model
microsoft/Florence-2-base-ftEvaluation results
- Character Accuracy (1000 samples)self-reported93.130