Arabic Named Entity Recognition with LoRA Fine-tuning
A fine-tuned BERT model for Arabic Named Entity Recognition (NER) using Low-Rank Adaptation (LoRA) on the AraBERT v2 base model.
Model Details
Model Description
This model is a fine-tuned version of aubmindlab/bert-base-arabertv2 for token classification tasks, specifically Arabic Named Entity Recognition. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning, making it parameter-efficient while maintaining high performance.
The model can identify four types of entities in Arabic text:
PER (Person): Names of people
ORG (Organization): Companies, institutions, government bodies
LOC (Location): Cities, countries, geographical locations
MISC (Miscellaneous): Other named entities
Developed by: [Diaa Essam]
Model type: Token Classification (NER)
Language(s) (NLP): Arabic (ar)
License: MIT
Finetuned from model: aubmindlab/bert-base-arabertv2
Model Sources
- Repository: Kaggle notebook
- Base Model: aubmindlab/bert-base-arabertv2
- Dataset: iSemantics/conllpp-ner-ar
Uses
Direct Use
The model can be directly used for Arabic Named Entity Recognition tasks without additional fine-tuning. It's suitable for:
- Extracting named entities from Arabic news articles
- Information extraction from Arabic documents
- Arabic text analysis and understanding
- Building Arabic NLP pipelines
Downstream Use
The model can be further fine-tuned for:
- Domain-specific NER tasks (medical, legal, financial Arabic text)
- Custom entity types beyond the standard four categories
- Transfer learning for related Arabic NLP tasks
Out-of-Scope Use
- Non-Arabic text (the model is trained exclusively on Arabic)
- Sentiment analysis or other non-NER tasks
- Real-time applications requiring sub-millisecond latency without optimization
Bias, Risks, and Limitations
- The model's performance may vary across different Arabic dialects and writing styles
- Entity recognition accuracy depends on text quality and may be lower for informal or dialectal Arabic
- The model may underperform on domain-specific jargon not present in the training data
- MISC entities have lower representation in the training data, though class weighting was applied to mitigate this
Recommendations
Users should be aware of the model's limitations regarding:
- Dialectal variations in Arabic text
- Domain-specific terminology
- The need for post-processing to handle multi-word entities correctly
- Potential biases in the training data reflecting the source corpus
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForTokenClassification
from peft import PeftModel
import torch
# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2")
base_model = AutoModelForTokenClassification.from_pretrained(
"aubmindlab/bert-base-arabertv2",
num_labels=9
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[checkpoint path]")
model = model.merge_and_unload()
# Prepare input
text = "ู
ุญู
ุฏ ูุนู
ู ูู ุดุฑูุฉ ุฌูุฌู ูู ุงููุงูุฑุฉ"
tokens = text.split()
inputs = tokenizer(
tokens,
is_split_into_words=True,
return_tensors="pt",
truncation=True
)
# Predict
model.eval()
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Map predictions to labels
id_to_tag = {0: 'B-LOC', 1: 'B-MISC', 2: 'B-ORG', 3: 'B-PER',
4: 'I-LOC', 5: 'I-MISC', 6: 'I-ORG', 7: 'I-PER', 8: 'O'}
word_ids = inputs.word_ids(batch_index=0)
results = []
previous_word_idx = None
for idx, word_idx in enumerate(word_ids):
if word_idx is not None and word_idx != previous_word_idx:
pred_label = id_to_tag[predictions[0][idx].item()]
results.append((tokens[word_idx], pred_label))
previous_word_idx = word_idx
print(results)
Training Details
Training Data
The model was trained on the iSemantics/conllpp-ner-ar dataset, which is an Arabic adaptation of the CoNLL++ NER dataset. The dataset contains Arabic text annotated with four entity types (PER, ORG, LOC, MISC) using the IOB2 tagging scheme.
Training data composition:
- Training samples: Combined train + validation sets for final training
- Test samples: Held-out test set for evaluation
- Entity types: 4 (PER, ORG, LOC, MISC) with 9 labels including IOB tags and 'O'
Training Procedure
Preprocessing
- Text tokenization using AraBERT v2 tokenizer
- Token alignment for subword tokenization
- Maximum sequence length: 128 tokens
- Special tokens and subword continuations labeled with -100 (ignored in loss calculation)
Training Hyperparameters
LoRA Configuration:
- LoRA rank (r): 32
- LoRA alpha: 64
- LoRA dropout: 0.05
- Target modules: query, value, key, dense layers
- Trainable parameters: 3.7987% of total model parameters
Training Arguments:
- Training regime: fp16 mixed precision (when GPU available)
- Learning rate: 1e-4
- Batch size: 32 (training), 64 (evaluation)
- Number of epochs: 70
- Weight decay: 0.01
- Warmup ratio: 0.15
- Optimizer: AdamW
- LR scheduler: Cosine
- Label smoothing: 0.1
- Gradient accumulation steps: 1
Class Weighting: Custom weighted loss function applied with the following weights:
- B-LOC, I-LOC, B-ORG, I-ORG, B-PER, I-PER: 1.0
- B-MISC, I-MISC: 2.5 (increased to handle class imbalance)
- O (outside): 0.5 (decreased to focus on entities)
Speeds, Sizes, Times
- Training time: ~0.33-0.5 hours (with GPU acceleration)
- Model size: Base model + LoRA adapter (~550MB total)
- Throughput: Varies by hardware; optimized for GPU inference
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on the test split of the iSemantics/conllpp-ner-ar dataset.
Metrics
Evaluation metrics computed using the seqeval library:
- Precision: Proportion of predicted entities that are correct
- Recall: Proportion of true entities that are identified
- F1 Score: Harmonic mean of precision and recall
- Accuracy: Token-level accuracy
Results
Overall Performance:
- F1 Score: 0.850601
- Precision: 0.839822
- Recall: 0.861660
- Accuracy: 0.9460
Per-Entity Performance: Results show strong performance across all entity types, with particularly good results on:
- PER (Person) entities
- LOC (Location) entities
- ORG (Organization) entities
- MISC entities (improved through class weighting)
See the detailed classification report in the training logs for complete per-class metrics.
Technical Specifications
Model Architecture and Objective
- Base Architecture: BERT (Bidirectional Encoder Representations from Transformers)
- Specific Model: AraBERT v2 base
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Task: Token Classification (Named Entity Recognition)
- Output: 9-class classification (B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC, O)
Compute Infrastructure
Hardware
- GPU-accelerated training (CUDA-enabled)
- Optimized for modern GPUs (tested on various CUDA-compatible devices)
Software
- Framework: PyTorch with Transformers and PEFT libraries
- Key Libraries:
- transformers (Hugging Face)
- peft (Parameter-Efficient Fine-Tuning)
- seqeval (evaluation metrics)
- datasets (Hugging Face)
Framework Versions
- PEFT: 0.16.0
- Transformers: 4.x
- PyTorch: 2.x
- Python: 3.8+
Citation
If you use this model, please cite:
BibTeX:
@misc{arabic-ner-lora,
author = {Diaa Eldin Essam Zaki},
title = {Arabic Named Entity Recognition with LoRA Fine-tuning},
year = {2025},
publisher = {HuggingFace},
}
Glossary
- NER: Named Entity Recognition - the task of identifying and classifying named entities in text
- LoRA: Low-Rank Adaptation - a parameter-efficient fine-tuning method that adds trainable rank decomposition matrices
- IOB2: Inside-Outside-Beginning tagging scheme for sequence labeling
- AraBERT: Arabic BERT model pre-trained on large Arabic corpora
- Token Classification: Assigning a label to each token in a sequence
Model Card Authors
Diaa Essam
Model Card Contact
- Downloads last month
- 25
Model tree for Diaa-Essam/arabert-v2-ner-lora
Base model
aubmindlab/bert-base-arabertv2