Persian-Phi: A Cross-Lingual Adapted Small Language Model
Persian-Phi is a 3.8B parameter Large Language Model (LLM) adapted from Microsoft's Phi-3 Mini to support the Persian language. Unlike standard multilingual models, Persian-Phi demonstrates how a high-capability monolingual English model can be effectively transferred to a low-resource language using a resource-efficient curriculum learning pipeline.
It features an extended tokenizer, embedding alignment via a "warm-up" stage, and continual pre-training on filtered Persian corpora, achieving competitive performance on the Open Persian LLM Leaderboard.
- 📄 Preprint: Link to Arxiv/Preprint
- 💻 Google Colab Demo: Open In Colab
Model Details
Model Description
Persian-Phi was developed to address the scarcity of high-quality LLMs for the Persian language without requiring the massive computational resources typically needed to train multilingual models from scratch.
The model follows a three-stage training pipeline:
- Tokenizer Extension: The LLaMA-2 tokenizer was adapted with 5,000 new Persian-specific tokens.
- Warm-up: A "warm-up" phase using a translated Tiny Stories dataset to align new embeddings and mitigate catastrophic forgetting.
- Continual Pre-training (CPT) & SFT: Training on the Targoman Large Persian Corpus (TLPC) and Wikipedia, followed by instruction tuning on bilingual datasets.
- Model type: Causal Language Model (Phi-3 architecture)
- Language(s) (NLP): Persian (Farsi), English
- License: Apache-2.0
- Finetuned from model: microsoft/Phi-3-mini-4k-instruct
Model Sources
- Paper: Link to Arxiv/Preprint
- Demo: Google Colab
Uses
Direct Use
The model is designed for:
- Persian Text Generation: Creative writing, summarization, and content generation.
- Question Answering: Answering queries in Persian across various domains.
- Cross-Lingual Tasks: Translating simple concepts between English and Persian.
Bias, Risks, and Limitations
- Hallucination: The model may generate plausible-sounding but factually incorrect information.
You can use the model directly with the Hugging Face transformers library:
Training Details
Training Data
The model was trained on a mix of filtered and curated datasets:
- Pre-training (CPT):
- Targoman Large Persian Corpus (TLPC): A subset of 12.4M documents, rigorously filtered for quality and deduplicated using MinHash.
- Persian Wikipedia:
- Translated Tiny Stories:
- Supervised Fine-Tuning (SFT):
- Bactrian-X: ~63k Persian instruction pairs.
- Aya Dataset: 50k English samples to retain English capabilities.
- TED2020: 30k bilingual pairs for translation alignment.
Training Procedure
The training utilized Parameter-Efficient Fine-Tuning (PEFT) with LoRA to keep computational costs low.
Preprocessing
- Tokenizer: Extended LLaMA-2 tokenizer with 5,000 new Persian tokens.
- Filtering: Applied heuristics (word length, symbol ratio, stop words) and safety filtering (profanity checks).
Training Hyperparameters
- Technique: LoRA (Rank 64 for CPT, Rank 32 for SFT) on attention and feed-forward layers.
- Full Fine-tuning: Applied to Embeddings and LM Head.
- Optimizer: 8-bit AdamW.
- Precision: Mixed-precision (bfloat16).
- Scheduler: Cosine with warm-up.
- Batch Size: Effective batch size of 64.
Speeds, Sizes, Times
- Hardware: 2x NVIDIA RTX 3090 (24GB).
- Training Duration: Approximately 12 days.
Evaluation
Results
The model was evaluated on the Open Persian LLM Leaderboard. Despite being significantly smaller than many competitors (3.8B parameters), Persian-Phi achieves competitive results, trailing only the 8B state-of-the-art model.
| Metric | Score | Comparison |
|---|---|---|
| Part MC | 30.56 | Outperforms Maral-7B and PersianMind |
| ARC Easy | 64.65 | Competitive with Llama-2 based models |
| ARC Challenge | 51.00 | Strong reasoning capability |
| MMLU Pro | 17.18 | Limited by context window (2k) |
| AUT MC | 43.98 | Consistent performance |
Refer to Table 2 in the technical report for full comparisons.
Acknowledgments
We thank the Part AI Research Center for providing the GPU resources that made this research possible.
- Downloads last month
- 30
Model tree for amirakhlaghiqqq/PersianPhi
Base model
microsoft/Phi-3-mini-4k-instruct