Persian-Phi: A Cross-Lingual Adapted Small Language Model

Persian-Phi is a 3.8B parameter Large Language Model (LLM) adapted from Microsoft's Phi-3 Mini to support the Persian language. Unlike standard multilingual models, Persian-Phi demonstrates how a high-capability monolingual English model can be effectively transferred to a low-resource language using a resource-efficient curriculum learning pipeline.

It features an extended tokenizer, embedding alignment via a "warm-up" stage, and continual pre-training on filtered Persian corpora, achieving competitive performance on the Open Persian LLM Leaderboard.

📄 Preprint: Link to Arxiv/Preprint
💻 Google Colab Demo: Open In Colab

Model Details

Model Description

Persian-Phi was developed to address the scarcity of high-quality LLMs for the Persian language without requiring the massive computational resources typically needed to train multilingual models from scratch.

The model follows a three-stage training pipeline:

Tokenizer Extension: The LLaMA-2 tokenizer was adapted with 5,000 new Persian-specific tokens.
Warm-up: A "warm-up" phase using a translated Tiny Stories dataset to align new embeddings and mitigate catastrophic forgetting.
Continual Pre-training (CPT) & SFT: Training on the Targoman Large Persian Corpus (TLPC) and Wikipedia, followed by instruction tuning on bilingual datasets.

Model type: Causal Language Model (Phi-3 architecture)
Language(s) (NLP): Persian (Farsi), English
License: Apache-2.0
Finetuned from model: microsoft/Phi-3-mini-4k-instruct

Model Sources

Paper: Link to Arxiv/Preprint
Demo: Google Colab

Uses

Direct Use

The model is designed for:

Persian Text Generation: Creative writing, summarization, and content generation.
Question Answering: Answering queries in Persian across various domains.
Cross-Lingual Tasks: Translating simple concepts between English and Persian.

Bias, Risks, and Limitations

Hallucination: The model may generate plausible-sounding but factually incorrect information.

You can use the model directly with the Hugging Face transformers library:

Training Details

Training Data

The model was trained on a mix of filtered and curated datasets:

Pre-training (CPT):
- Targoman Large Persian Corpus (TLPC): A subset of 12.4M documents, rigorously filtered for quality and deduplicated using MinHash.
- Persian Wikipedia:
- Translated Tiny Stories:
Supervised Fine-Tuning (SFT):
- Bactrian-X: ~63k Persian instruction pairs.
- Aya Dataset: 50k English samples to retain English capabilities.
- TED2020: 30k bilingual pairs for translation alignment.

Training Procedure

The training utilized Parameter-Efficient Fine-Tuning (PEFT) with LoRA to keep computational costs low.

Preprocessing

Tokenizer: Extended LLaMA-2 tokenizer with 5,000 new Persian tokens.
Filtering: Applied heuristics (word length, symbol ratio, stop words) and safety filtering (profanity checks).

Training Hyperparameters

Technique: LoRA (Rank 64 for CPT, Rank 32 for SFT) on attention and feed-forward layers.
Full Fine-tuning: Applied to Embeddings and LM Head.
Optimizer: 8-bit AdamW.
Precision: Mixed-precision (bfloat16).
Scheduler: Cosine with warm-up.
Batch Size: Effective batch size of 64.

Speeds, Sizes, Times

Hardware: 2x NVIDIA RTX 3090 (24GB).
Training Duration: Approximately 12 days.

Evaluation

Results

The model was evaluated on the Open Persian LLM Leaderboard. Despite being significantly smaller than many competitors (3.8B parameters), Persian-Phi achieves competitive results, trailing only the 8B state-of-the-art model.

Metric	Score	Comparison
Part MC	30.56	Outperforms Maral-7B and PersianMind
ARC Easy	64.65	Competitive with Llama-2 based models
ARC Challenge	51.00	Strong reasoning capability
MMLU Pro	17.18	Limited by context window (2k)
AUT MC	43.98	Consistent performance

Refer to Table 2 in the technical report for full comparisons.

Acknowledgments

We thank the Part AI Research Center for providing the GPU resources that made this research possible.

Downloads last month: 30

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for amirakhlaghiqqq/PersianPhi

Base model

microsoft/Phi-3-mini-4k-instruct

Adapter

(786)

this model

amirakhlaghiqqq
/

PersianPhi