Update README.md

08effc4 verified 9 months ago

4.26 kB

	---
	license: cc-by-nc-4.0
	language:
	- fa
	tags:
	- masked-language-modeling
	- feature-extraction
	- large-scale-dataset
	- Persian
	- dataset_size:72.9B
	- no-next-sentence-prediction
	pipeline_tag: fill-mask

	extra_gated_description: >-
	You agree to not use the model to conduct experiments that cause harm to
	human subjects.
	extra_gated_fields:
	Full Name: text
	Organization (University): text
	Email address: text
	Country: country
	Could you briefly explain the purpose of using the dataset?: text
	I agree to use this dataset for non-commercial use ONLY: checkbox
	---

	# Persian Masked Language Model (MLM)

	This model is a Masked Language Model (MLM) trained on a 72.9-billion-token corpus of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance language understanding tasks and provide high-quality contextual embeddings for various NLP applications in Persian.

	- Our Paper: Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [link](https://arxiv.org/abs/2501.04858)

	## Model Details

	### Model Description
	- Model Type: Masked Language Model (MLM)
	- Base Model: XLM-RoBERTa Large
	- Objective: Predicting randomly masked tokens within sequences
	- Training Corpus Size: 72.9 billion tokens
	- Maximum Sequence Length: 512 tokens
	- Special Feature: No Next Sentence Prediction (NSP) task

	## Training Details

	### Training Configuration
	- Hardware: 8 NVIDIA A800 GPUs
	- Duration: One week
	- Optimization Framework: DeepSpeed (Stage 0)
	- Training Parameters:
	- Learning Rate: 5e-5
	- Maximum Sequence Length: 512 tokens
	- Precision: FP16 (Mixed Precision)

	### Corpus
	The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity:
	- Web-crawled data
	- Academic articles and books
	- Persian Wikipedia
	- Religious texts
	- Social media platforms

	The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data.

	## Usage

	The model can be used for various downstream NLP tasks in Persian, including:
	- Text classification
	- Named entity recognition
	- Question answering
	- Semantic search
	- Contextual embedding generation

	### Example Usage
	This model can be loaded and used with the 🤗 Transformers library:

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("your_model_id")
	model = AutoModelForMaskedLM.from_pretrained("your_model_id")

	# Example text
	text = "این یک [MASK] جدید است."
	inputs = tokenizer(text, return_tensors="pt")

	# Predict the masked token
	outputs = model(**inputs)
	logits = outputs.logits
	```

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 30
	- eval_batch_size: 8
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 8
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 480
	- total_eval_batch_size: 64
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 1.0
	- mixed_precision_training: Native AMP

	### Framework versions

	- Transformers 4.47.0.dev0
	- Pytorch 2.4.1+cu121
	- Datasets 3.0.2
	- Tokenizers 0.20.1

	## Citation
	If you find this model helpful, please ensure to cite the following paper.

	BibTeX:
	```
	@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian,
	title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization},
	author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi},
	year={2025},
	eprint={2501.04858},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2501.04858},
	}
	```