Persian-Phi: A Cross-Lingual Adapted Small Language Model

Persian-Phi is a 3.8B parameter Large Language Model (LLM) adapted from Microsoft's Phi-3 Mini to support the Persian language. Unlike standard multilingual models, Persian-Phi demonstrates how a high-capability monolingual English model can be effectively transferred to a low-resource language using a resource-efficient curriculum learning pipeline.

It features an extended tokenizer, embedding alignment via a "warm-up" stage, and continual pre-training on filtered Persian corpora, achieving competitive performance on the Open Persian LLM Leaderboard.

Model Details

Model Description

Persian-Phi was developed to address the scarcity of high-quality LLMs for the Persian language without requiring the massive computational resources typically needed to train multilingual models from scratch.

The model follows a three-stage training pipeline:

  1. Tokenizer Extension: The LLaMA-2 tokenizer was adapted with 5,000 new Persian-specific tokens.
  2. Warm-up: A "warm-up" phase using a translated Tiny Stories dataset to align new embeddings and mitigate catastrophic forgetting.
  3. Continual Pre-training (CPT) & SFT: Training on the Targoman Large Persian Corpus (TLPC) and Wikipedia, followed by instruction tuning on bilingual datasets.
  • Model type: Causal Language Model (Phi-3 architecture)
  • Language(s) (NLP): Persian (Farsi), English
  • License: Apache-2.0
  • Finetuned from model: microsoft/Phi-3-mini-4k-instruct

Model Sources

Uses

Direct Use

The model is designed for:

  • Persian Text Generation: Creative writing, summarization, and content generation.
  • Question Answering: Answering queries in Persian across various domains.
  • Cross-Lingual Tasks: Translating simple concepts between English and Persian.

Bias, Risks, and Limitations

  • Hallucination: The model may generate plausible-sounding but factually incorrect information.

You can use the model directly with the Hugging Face transformers library:

Training Details

Training Data

The model was trained on a mix of filtered and curated datasets:

  1. Pre-training (CPT):
    • Targoman Large Persian Corpus (TLPC): A subset of 12.4M documents, rigorously filtered for quality and deduplicated using MinHash.
    • Persian Wikipedia:
    • Translated Tiny Stories:
  2. Supervised Fine-Tuning (SFT):
    • Bactrian-X: ~63k Persian instruction pairs.
    • Aya Dataset: 50k English samples to retain English capabilities.
    • TED2020: 30k bilingual pairs for translation alignment.

Training Procedure

The training utilized Parameter-Efficient Fine-Tuning (PEFT) with LoRA to keep computational costs low.

Preprocessing

  • Tokenizer: Extended LLaMA-2 tokenizer with 5,000 new Persian tokens.
  • Filtering: Applied heuristics (word length, symbol ratio, stop words) and safety filtering (profanity checks).

Training Hyperparameters

  • Technique: LoRA (Rank 64 for CPT, Rank 32 for SFT) on attention and feed-forward layers.
  • Full Fine-tuning: Applied to Embeddings and LM Head.
  • Optimizer: 8-bit AdamW.
  • Precision: Mixed-precision (bfloat16).
  • Scheduler: Cosine with warm-up.
  • Batch Size: Effective batch size of 64.

Speeds, Sizes, Times

  • Hardware: 2x NVIDIA RTX 3090 (24GB).
  • Training Duration: Approximately 12 days.

Evaluation

Results

The model was evaluated on the Open Persian LLM Leaderboard. Despite being significantly smaller than many competitors (3.8B parameters), Persian-Phi achieves competitive results, trailing only the 8B state-of-the-art model.

Metric Score Comparison
Part MC 30.56 Outperforms Maral-7B and PersianMind
ARC Easy 64.65 Competitive with Llama-2 based models
ARC Challenge 51.00 Strong reasoning capability
MMLU Pro 17.18 Limited by context window (2k)
AUT MC 43.98 Consistent performance

Refer to Table 2 in the technical report for full comparisons.

Acknowledgments

We thank the Part AI Research Center for providing the GPU resources that made this research possible.

Downloads last month
30
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amirakhlaghiqqq/PersianPhi

Adapter
(786)
this model

Datasets used to train amirakhlaghiqqq/PersianPhi