--- library_name: transformers license: cc-by-nc-4.0 tags: - nllb - uzs - Southern Uzbek - Afghani Uzbek language: - en - uz - uzs base_model: facebook/nllb-200-distilled-600M pipeline_tag: translation datasets: - tahrirchi/lutfiy --- # Lutfiy: Southern Uzbek Machine Translation Model This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek". ## Model details | Model | Tokenizer Length | Parameter Count | |-------|------------|-------------------| [`lutfiy`](https://huggingface.co/tahrirchi/lutfiy) | 256,204 | 615M | **Common attributes:** - **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) - **Languages:** Southern Uzbek, Northern Uzbek, English ## Intended uses & limitations These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English. ### How to use You can use these models with the Transformers library. Here's a quick example: #### Install `lutfiy` library for fixing ZWNJ ```bash pip install lutfiy ``` ```python from lutfiy import fix_zwnj from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_ckpt = "tahrirchi/lutfiy" tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt) # Example translation input_text = "O'zbekiston kelajagi buyuk davlatdir." tokenizer.src_lang = "uzn_Latn" tokenizer.tgt_lang = "uzs_Arab" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر. ``` ## Training data The models were trained on a parallel corpus of 40,000 sentence pairs, including: - Northern Uzbek - Southern Uzbek (37,415 pairs) - English - Southern Uzbek (2,579 pairs) The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash). ## Training procedure For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2508.14586). ## Citation If you use these models in your research, please cite our paper: ```bibtex @misc{mamasaidov2025fillinggapuzbekcreating, title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek}, author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov}, year={2025}, eprint={2508.14586}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14586}, } ``` ## Contacts We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek. For further development and issues about the dataset, please use m.mamasaidov@tahrichi.uz or a.shopulatov@tahrirchi.uz to contact.