| --- |
| license: cc-by-nc-4.0 |
| language: |
| - fa |
| tags: |
| - masked-language-modeling |
| - feature-extraction |
| - large-scale-dataset |
| - Persian |
| - dataset_size:72.9B |
| - no-next-sentence-prediction |
| pipeline_tag: fill-mask |
|
|
| extra_gated_description: >- |
| You agree to not use the model to conduct experiments that cause harm to |
| human subjects. |
| extra_gated_fields: |
| Full Name: text |
| Organization (University): text |
| Email address: text |
| Country: country |
| Could you briefly explain the purpose of using the dataset?: text |
| I agree to use this dataset for non-commercial use ONLY: checkbox |
| --- |
| |
| # Persian Masked Language Model (MLM) |
|
|
| This model is a **Masked Language Model (MLM)** trained on a **72.9-billion-token corpus** of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance **language understanding tasks** and provide high-quality contextual embeddings for various NLP applications in Persian. |
|
|
| - **Our Paper:** Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [link](https://arxiv.org/abs/2501.04858) |
|
|
| ## Model Details |
|
|
| ### Model Description |
| - **Model Type:** Masked Language Model (MLM) |
| - **Base Model:** XLM-RoBERTa Large |
| - **Objective:** Predicting randomly masked tokens within sequences |
| - **Training Corpus Size:** 72.9 billion tokens |
| - **Maximum Sequence Length:** 512 tokens |
| - **Special Feature:** No Next Sentence Prediction (NSP) task |
|
|
| ## Training Details |
|
|
| ### Training Configuration |
| - **Hardware:** 8 NVIDIA A800 GPUs |
| - **Duration:** One week |
| - **Optimization Framework:** DeepSpeed (Stage 0) |
| - **Training Parameters:** |
| - **Learning Rate:** 5e-5 |
| - **Maximum Sequence Length:** 512 tokens |
| - **Precision:** FP16 (Mixed Precision) |
|
|
| ### Corpus |
| The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity: |
| - Web-crawled data |
| - Academic articles and books |
| - Persian Wikipedia |
| - Religious texts |
| - Social media platforms |
|
|
| The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data. |
|
|
| ## Usage |
|
|
| The model can be used for various **downstream NLP tasks** in Persian, including: |
| - Text classification |
| - Named entity recognition |
| - Question answering |
| - Semantic search |
| - Contextual embedding generation |
|
|
| ### Example Usage |
| This model can be loaded and used with the 🤗 Transformers library: |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForMaskedLM |
| |
| # Load tokenizer and model |
| tokenizer = AutoTokenizer.from_pretrained("your_model_id") |
| model = AutoModelForMaskedLM.from_pretrained("your_model_id") |
| |
| # Example text |
| text = "این یک [MASK] جدید است." |
| inputs = tokenizer(text, return_tensors="pt") |
| |
| # Predict the masked token |
| outputs = model(**inputs) |
| logits = outputs.logits |
| ``` |
|
|
| ## Training procedure |
|
|
| ### Training hyperparameters |
|
|
| The following hyperparameters were used during training: |
| - learning_rate: 5e-05 |
| - train_batch_size: 30 |
| - eval_batch_size: 8 |
| - seed: 42 |
| - distributed_type: multi-GPU |
| - num_devices: 8 |
| - gradient_accumulation_steps: 2 |
| - total_train_batch_size: 480 |
| - total_eval_batch_size: 64 |
| - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| - lr_scheduler_type: linear |
| - num_epochs: 1.0 |
| - mixed_precision_training: Native AMP |
|
|
| ### Framework versions |
|
|
| - Transformers 4.47.0.dev0 |
| - Pytorch 2.4.1+cu121 |
| - Datasets 3.0.2 |
| - Tokenizers 0.20.1 |
|
|
| ## Citation |
| If you find this model helpful, please ensure to cite the following paper. |
|
|
| **BibTeX:** |
| ``` |
| @misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian, |
| title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, |
| author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi}, |
| year={2025}, |
| eprint={2501.04858}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2501.04858}, |
| } |
| ``` |