Darija2Vec SOTA (300D)

Darija2Vec-SOTA-300D is a high-performance Word2Vec embedding model specifically engineered for the Moroccan dialect (Darija). Developed as a State-of-the-Art (SOTA) resource, it addresses the unique challenges of Moroccan Arabic NLP, particularly the heavy use of code-switching and diverse orthographic scripts (Arabic and Latin/Arabizi).

🌟 Key Technical Innovations (SOTA)

Unlike standard embeddings that treat different scripts as separate languages, this model implements a Script Unification Pipeline:

  • Script Unification: Systematic mapping of high-frequency Arabizi terms to their Arabic script equivalents (e.g., ana β†’ Ψ£Ω†Ψ§, ghadi β†’ غادي). This doubles the statistical density for core semantic concepts.
  • English Noise Filtering: A custom heuristic filter was used to purge English segments often found in bilingual datasets like DODa, ensuring the semantic space is purely Darija-centric.
  • High-Dimensionality (300D): Trained with a 300-dimension Skip-gram architecture to capture complex Moroccan morphological and semantic nuances.

πŸ“Š Model Specifications

Parameter Configuration
Model Type Word2Vec Skip-gram (sg=1)
Vector Dimensions 300
Window Size 7 (optimized for Darija syntax)
Corpus Size ~317,141 unique sentences
Min Word Count 5
Training Epochs 15

πŸ“₯ Dataset Sources

The model was trained on a consolidated corpus combining the best available public resources:

  1. Darija Open Dataset (DODa): Recursive scan of translated sentences.
  2. Goud.ma News: For formal and journalistic Darija vocabulary.
  3. Atlasia/Bounhar Sentiment: For authentic social media and conversational data.

πŸ’» Usage (Gensim)

from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download

# Download the SOTA vectors
repo_id = "halimbahae/Darija2Vec-SOTA-300D"
vector_file = hf_hub_download(repo_id=repo_id, filename="darija2vec_sota_vectors.txt")

# Load into Gensim
wv = KeyedVectors.load_word2vec_format(vector_file, binary=False)

# Explore similarities
print(wv.most_similar("Ω…Ψ²ΩŠΨ§Ω†", topn=5))
print(wv.most_similar("Ψ·ΩˆΩ…ΩˆΨ¨ΩŠΩ„", topn=5))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train halimbahae/Darija2Vec-SOTA-300D