Arabic Named Entity Recognition with LoRA Fine-tuning

A fine-tuned BERT model for Arabic Named Entity Recognition (NER) using Low-Rank Adaptation (LoRA) on the AraBERT v2 base model.

Model Details

Model Description

This model is a fine-tuned version of aubmindlab/bert-base-arabertv2 for token classification tasks, specifically Arabic Named Entity Recognition. The model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning, making it parameter-efficient while maintaining high performance.

The model can identify four types of entities in Arabic text:

PER (Person): Names of people
ORG (Organization): Companies, institutions, government bodies
LOC (Location): Cities, countries, geographical locations
MISC (Miscellaneous): Other named entities
Developed by: [Diaa Essam]
Model type: Token Classification (NER)
Language(s) (NLP): Arabic (ar)
License: MIT
Finetuned from model: aubmindlab/bert-base-arabertv2

Model Sources

Repository: Kaggle notebook
Base Model: aubmindlab/bert-base-arabertv2
Dataset: iSemantics/conllpp-ner-ar

Uses

Direct Use

The model can be directly used for Arabic Named Entity Recognition tasks without additional fine-tuning. It's suitable for:

Extracting named entities from Arabic news articles
Information extraction from Arabic documents
Arabic text analysis and understanding
Building Arabic NLP pipelines

Downstream Use

The model can be further fine-tuned for:

Domain-specific NER tasks (medical, legal, financial Arabic text)
Custom entity types beyond the standard four categories
Transfer learning for related Arabic NLP tasks

Out-of-Scope Use

Non-Arabic text (the model is trained exclusively on Arabic)
Sentiment analysis or other non-NER tasks
Real-time applications requiring sub-millisecond latency without optimization

Bias, Risks, and Limitations

The model's performance may vary across different Arabic dialects and writing styles
Entity recognition accuracy depends on text quality and may be lower for informal or dialectal Arabic
The model may underperform on domain-specific jargon not present in the training data
MISC entities have lower representation in the training data, though class weighting was applied to mitigate this

Recommendations

Users should be aware of the model's limitations regarding:

Dialectal variations in Arabic text
Domain-specific terminology
The need for post-processing to handle multi-word entities correctly
Potential biases in the training data reflecting the source corpus

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForTokenClassification
from peft import PeftModel
import torch

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2")
base_model = AutoModelForTokenClassification.from_pretrained(
    "aubmindlab/bert-base-arabertv2",
    num_labels=9
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[checkpoint path]")
model = model.merge_and_unload()

# Prepare input
text = "محمد يعمل في شركة جوجل في القاهرة"
tokens = text.split()

inputs = tokenizer(
    tokens,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True
)

# Predict
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Map predictions to labels
id_to_tag = {0: 'B-LOC', 1: 'B-MISC', 2: 'B-ORG', 3: 'B-PER', 
             4: 'I-LOC', 5: 'I-MISC', 6: 'I-ORG', 7: 'I-PER', 8: 'O'}

word_ids = inputs.word_ids(batch_index=0)
results = []
previous_word_idx = None

for idx, word_idx in enumerate(word_ids):
    if word_idx is not None and word_idx != previous_word_idx:
        pred_label = id_to_tag[predictions[0][idx].item()]
        results.append((tokens[word_idx], pred_label))
    previous_word_idx = word_idx

print(results)

Training Details

Training Data

The model was trained on the iSemantics/conllpp-ner-ar dataset, which is an Arabic adaptation of the CoNLL++ NER dataset. The dataset contains Arabic text annotated with four entity types (PER, ORG, LOC, MISC) using the IOB2 tagging scheme.

Training data composition:

Training samples: Combined train + validation sets for final training
Test samples: Held-out test set for evaluation
Entity types: 4 (PER, ORG, LOC, MISC) with 9 labels including IOB tags and 'O'

Training Procedure

Preprocessing

Text tokenization using AraBERT v2 tokenizer
Token alignment for subword tokenization
Maximum sequence length: 128 tokens
Special tokens and subword continuations labeled with -100 (ignored in loss calculation)

Training Hyperparameters

LoRA Configuration:

LoRA rank (r): 32
LoRA alpha: 64
LoRA dropout: 0.05
Target modules: query, value, key, dense layers
Trainable parameters: 3.7987% of total model parameters

Training Arguments:

Training regime: fp16 mixed precision (when GPU available)
Learning rate: 1e-4
Batch size: 32 (training), 64 (evaluation)
Number of epochs: 70
Weight decay: 0.01
Warmup ratio: 0.15
Optimizer: AdamW
LR scheduler: Cosine
Label smoothing: 0.1
Gradient accumulation steps: 1

Class Weighting: Custom weighted loss function applied with the following weights:

B-LOC, I-LOC, B-ORG, I-ORG, B-PER, I-PER: 1.0
B-MISC, I-MISC: 2.5 (increased to handle class imbalance)
O (outside): 0.5 (decreased to focus on entities)

Speeds, Sizes, Times

Training time: ~0.33-0.5 hours (with GPU acceleration)
Model size: Base model + LoRA adapter (~550MB total)
Throughput: Varies by hardware; optimized for GPU inference

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on the test split of the iSemantics/conllpp-ner-ar dataset.

Metrics

Evaluation metrics computed using the seqeval library:

Precision: Proportion of predicted entities that are correct
Recall: Proportion of true entities that are identified
F1 Score: Harmonic mean of precision and recall
Accuracy: Token-level accuracy

Results

Overall Performance:

F1 Score: 0.850601
Precision: 0.839822
Recall: 0.861660
Accuracy: 0.9460

Per-Entity Performance: Results show strong performance across all entity types, with particularly good results on:

PER (Person) entities
LOC (Location) entities
ORG (Organization) entities
MISC entities (improved through class weighting)

See the detailed classification report in the training logs for complete per-class metrics.

Technical Specifications

Model Architecture and Objective

Base Architecture: BERT (Bidirectional Encoder Representations from Transformers)
Specific Model: AraBERT v2 base
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Task: Token Classification (Named Entity Recognition)
Output: 9-class classification (B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC, O)

Compute Infrastructure

Hardware

GPU-accelerated training (CUDA-enabled)
Optimized for modern GPUs (tested on various CUDA-compatible devices)

Software

Framework: PyTorch with Transformers and PEFT libraries
Key Libraries:
- transformers (Hugging Face)
- peft (Parameter-Efficient Fine-Tuning)
- seqeval (evaluation metrics)
- datasets (Hugging Face)

Framework Versions

PEFT: 0.16.0
Transformers: 4.x
PyTorch: 2.x
Python: 3.8+

Citation

If you use this model, please cite:

BibTeX:

@misc{arabic-ner-lora,
  author = {Diaa Eldin Essam Zaki},
  title = {Arabic Named Entity Recognition with LoRA Fine-tuning},
  year = {2025},
  publisher = {HuggingFace},
}

Glossary

NER: Named Entity Recognition - the task of identifying and classifying named entities in text
LoRA: Low-Rank Adaptation - a parameter-efficient fine-tuning method that adds trainable rank decomposition matrices
IOB2: Inside-Outside-Beginning tagging scheme for sequence labeling
AraBERT: Arabic BERT model pre-trained on large Arabic corpora
Token Classification: Assigning a label to each token in a sequence

Model Card Authors

Diaa Essam

Model Card Contact

diaaesam123@gmail.com

Downloads last month: 25

Model tree for Diaa-Essam/arabert-v2-ner-lora

Base model

aubmindlab/bert-base-arabertv2

Adapter

(1)

this model

Diaa-Essam
/

arabert-v2-ner-lora