Omar Kamali PRO
omarkamali
AI & ML interests
NLP & LLMs for low resource languages.
Recent Activity
updated
a dataset about 11 hours ago
omarkamali/wikipedia-monthly posted an
update
1 day ago
You're probably training on outdated Wikipedia data right now and don't know it. 💡
In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."
He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
• For English, that's 700,000 missing articles.
• For Moroccan Arabic, 30% of the language's entire Wikipedia.
• For 31 other languages, there was literally no text corpus at all until recently.
I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).
Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.
Here's the full story of how I built Wikipedia Monthly 👇
https://omarkamali.com/blog/wikipedia-monthly-pipeline updated
a model 5 days ago
wikilangs/hu