Darija2Vec SOTA (300D)

Darija2Vec-SOTA-300D is a high-performance Word2Vec embedding model specifically engineered for the Moroccan dialect (Darija). Developed as a State-of-the-Art (SOTA) resource, it addresses the unique challenges of Moroccan Arabic NLP, particularly the heavy use of code-switching and diverse orthographic scripts (Arabic and Latin/Arabizi).

🌟 Key Technical Innovations (SOTA)

Unlike standard embeddings that treat different scripts as separate languages, this model implements a Script Unification Pipeline:

Script Unification: Systematic mapping of high-frequency Arabizi terms to their Arabic script equivalents (e.g., ana → أنا, ghadi → غادي). This doubles the statistical density for core semantic concepts.
English Noise Filtering: A custom heuristic filter was used to purge English segments often found in bilingual datasets like DODa, ensuring the semantic space is purely Darija-centric.
High-Dimensionality (300D): Trained with a 300-dimension Skip-gram architecture to capture complex Moroccan morphological and semantic nuances.

📊 Model Specifications

Parameter	Configuration
Model Type	Word2Vec Skip-gram (`sg=1`)
Vector Dimensions	300
Window Size	7 (optimized for Darija syntax)
Corpus Size	~317,141 unique sentences
Min Word Count	5
Training Epochs	15

📥 Dataset Sources

The model was trained on a consolidated corpus combining the best available public resources:

Darija Open Dataset (DODa): Recursive scan of translated sentences.
Goud.ma News: For formal and journalistic Darija vocabulary.
Atlasia/Bounhar Sentiment: For authentic social media and conversational data.

💻 Usage (Gensim)

from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download

# Download the SOTA vectors
repo_id = "halimbahae/Darija2Vec-SOTA-300D"
vector_file = hf_hub_download(repo_id=repo_id, filename="darija2vec_sota_vectors.txt")

# Load into Gensim
wv = KeyedVectors.load_word2vec_format(vector_file, binary=False)

# Explore similarities
print(wv.most_similar("مزيان", topn=5))
print(wv.most_similar("طوموبيل", topn=5))

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

halimbahae
/

Darija2Vec-SOTA-300D

Darija2Vec SOTA (300D)

🌟 Key Technical Innovations (SOTA)

📊 Model Specifications

📥 Dataset Sources

💻 Usage (Gensim)

Dataset used to train halimbahae/Darija2Vec-SOTA-300D