Darija2Vec SOTA (300D)
Darija2Vec-SOTA-300D is a high-performance Word2Vec embedding model specifically engineered for the Moroccan dialect (Darija). Developed as a State-of-the-Art (SOTA) resource, it addresses the unique challenges of Moroccan Arabic NLP, particularly the heavy use of code-switching and diverse orthographic scripts (Arabic and Latin/Arabizi).
π Key Technical Innovations (SOTA)
Unlike standard embeddings that treat different scripts as separate languages, this model implements a Script Unification Pipeline:
- Script Unification: Systematic mapping of high-frequency Arabizi terms to their Arabic script equivalents (e.g.,
anaβΨ£ΩΨ§,ghadiβΨΊΨ§Ψ―Ω). This doubles the statistical density for core semantic concepts. - English Noise Filtering: A custom heuristic filter was used to purge English segments often found in bilingual datasets like DODa, ensuring the semantic space is purely Darija-centric.
- High-Dimensionality (300D): Trained with a 300-dimension Skip-gram architecture to capture complex Moroccan morphological and semantic nuances.
π Model Specifications
| Parameter | Configuration |
|---|---|
| Model Type | Word2Vec Skip-gram (sg=1) |
| Vector Dimensions | 300 |
| Window Size | 7 (optimized for Darija syntax) |
| Corpus Size | ~317,141 unique sentences |
| Min Word Count | 5 |
| Training Epochs | 15 |
π₯ Dataset Sources
The model was trained on a consolidated corpus combining the best available public resources:
- Darija Open Dataset (DODa): Recursive scan of translated sentences.
- Goud.ma News: For formal and journalistic Darija vocabulary.
- Atlasia/Bounhar Sentiment: For authentic social media and conversational data.
π» Usage (Gensim)
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
# Download the SOTA vectors
repo_id = "halimbahae/Darija2Vec-SOTA-300D"
vector_file = hf_hub_download(repo_id=repo_id, filename="darija2vec_sota_vectors.txt")
# Load into Gensim
wv = KeyedVectors.load_word2vec_format(vector_file, binary=False)
# Explore similarities
print(wv.most_similar("Ω
Ψ²ΩΨ§Ω", topn=5))
print(wv.most_similar("Ψ·ΩΩ
ΩΨ¨ΩΩ", topn=5))
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support