English — Wikilangs Models
Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on English Wikipedia by Wikilangs.
🌐 Language Page · 🎮 Playground · 📊 Full Research Report
Language Samples
Example sentences drawn from the English Wikipedia corpus:
Alexander V may refer to: Alexander V of Macedon (died 294 BCE) Antipope Alexander V Alexander V of Imereti
Alfonso IV may refer to: Alfonso IV of León (924–931) Afonso IV of Portugal Alfonso IV of Aragon Alfonso IV of Ribagorza Alfonso IV d'Este Duke of Modena and Regg
Anastasius I or Anastasios I may refer to: Anastasius I Dicorus (–518), Roman emperor Anastasius I of Antioch (died 599), Patriarch of Antioch Pope Anastasius I (died 401), pope
Angula may refer to: Aṅgula, a measure equal to a finger's breadth Eel, a biological order of fish Nahas Angula, former Prime Minister of Namibia Helmut Angula See also Angul (disambiguation)
Two antipopes used the regnal name Victor IV: Antipope Victor IV Antipope Victor IV
Quick Start
Load the Tokenizer
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("en_tokenizer_32k.model")
text = "Albrecht Achilles may refer to: Albrecht III Achilles, Elector of Brandenburg Al"
tokens = sp.EncodeAsPieces(text)
ids = sp.EncodeAsIds(text)
print(tokens) # subword pieces
print(ids) # integer ids
# Decode back
print(sp.DecodeIds(ids))
Tokenization examples (click to expand)
Sample 1: Albrecht Achilles may refer to: Albrecht III Achilles, Elector of Brandenburg Al…
| Vocab | Tokens | Count |
|---|---|---|
| 8k | ▁alb recht ▁ach illes ▁may ▁refer ▁to : ▁alb recht … (+27 more) |
37 |
| 16k | ▁alb recht ▁ach illes ▁may ▁refer ▁to : ▁alb recht … (+26 more) |
36 |
| 32k | ▁albrecht ▁achilles ▁may ▁refer ▁to : ▁albrecht ▁iii ▁achilles , … (+17 more) |
27 |
| 64k | ▁albrecht ▁achilles ▁may ▁refer ▁to : ▁albrecht ▁iii ▁achilles , … (+16 more) |
26 |
Sample 2: Alexander V may refer to: Alexander V of Macedon (died 294 BCE) Antipope Alexand…
| Vocab | Tokens | Count |
|---|---|---|
| 8k | ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁maced … (+20 more) |
30 |
| 16k | ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁macedon … (+18 more) |
28 |
| 32k | ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁macedon … (+15 more) |
25 |
| 64k | ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁macedon … (+15 more) |
25 |
Sample 3: Two antipopes used the regnal name Victor IV: Antipope Victor IV Antipope Victor…
| Vocab | Tokens | Count |
|---|---|---|
| 8k | ▁two ▁antip op es ▁used ▁the ▁reg nal ▁name ▁victor … (+8 more) |
18 |
| 16k | ▁two ▁antip opes ▁used ▁the ▁reg nal ▁name ▁victor ▁iv … (+7 more) |
17 |
| 32k | ▁two ▁antip opes ▁used ▁the ▁regnal ▁name ▁victor ▁iv : … (+6 more) |
16 |
| 64k | ▁two ▁antipopes ▁used ▁the ▁regnal ▁name ▁victor ▁iv : ▁antipope … (+5 more) |
15 |
Load Word Embeddings
from gensim.models import KeyedVectors
# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("en_embeddings_128d_aligned.kv")
similar = wv.most_similar("word", topn=5)
for word, score in similar:
print(f" {word}: {score:.3f}")
Load N-gram Model
import pyarrow.parquet as pq
df = pq.read_table("en_3gram_word.parquet").to_pandas()
print(df.head())
Models Overview
| Category | Assets |
|---|---|
| Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes |
| N-gram models | 2 / 3 / 4 / 5-gram (word & subword) |
| Markov chains | Context 1–5 (word & subword) |
| Embeddings | 32d, 64d, 128d — mono & aligned |
| Vocabulary | Full frequency list + Zipf analysis |
| Statistics | Corpus & model statistics JSON |
Metrics Summary
| Component | Model | Key Metric | Value |
|---|---|---|---|
| Tokenizer | 8k BPE | Compression | 3.84x |
| Tokenizer | 16k BPE | Compression | 4.22x |
| Tokenizer | 32k BPE | Compression | 4.51x |
| Tokenizer | 64k BPE | Compression | 4.70x 🏆 |
| N-gram | 2-gram (subword) | Perplexity | 257 🏆 |
| N-gram | 2-gram (word) | Perplexity | 386,225 |
| N-gram | 3-gram (subword) | Perplexity | 2,180 |
| N-gram | 3-gram (word) | Perplexity | 4,093,782 |
| N-gram | 4-gram (subword) | Perplexity | 12,758 |
| N-gram | 4-gram (word) | Perplexity | 14,465,722 |
| N-gram | 5-gram (subword) | Perplexity | 55,700 |
| N-gram | 5-gram (word) | Perplexity | 12,820,936 |
| Markov | ctx-1 (subword) | Predictability | 0.0% |
| Markov | ctx-1 (word) | Predictability | 6.2% |
| Markov | ctx-2 (subword) | Predictability | 46.4% |
| Markov | ctx-2 (word) | Predictability | 48.3% |
| Markov | ctx-3 (subword) | Predictability | 45.8% |
| Markov | ctx-3 (word) | Predictability | 75.9% |
| Markov | ctx-4 (subword) | Predictability | 36.8% |
| Markov | ctx-4 (word) | Predictability | 89.2% 🏆 |
| Vocabulary | full | Size | 1,867,537 |
| Vocabulary | full | Zipf R² | 0.9862 |
| Embeddings | mono_32d | Isotropy | 0.7693 🏆 |
| Embeddings | mono_64d | Isotropy | 0.7388 |
| Embeddings | mono_128d | Isotropy | 0.6687 |
📊 Full ablation study, per-model breakdowns, and interpretation guide →
About
Trained on wikipedia-monthly — monthly snapshots of 300+ Wikipedia languages.
A project by Wikilangs · Maintainer: Omar Kamali · Omneity Labs
Citation
@misc{wikilangs2025,
author = {Kamali, Omar},
title = {Wikilangs: Open NLP Models for Wikipedia Languages},
year = {2025},
doi = {10.5281/zenodo.18073153},
publisher = {Zenodo},
url = {https://huggingface.co/wikilangs},
institution = {Omneity Labs}
}
Links
- 🌐 wikilangs.org
- 🌍 Language page
- 🎮 Playground
- 🤗 HuggingFace models
- 📊 wikipedia-monthly dataset
- 👤 Omar Kamali
- 🤝 Sponsor: Featherless AI
License: MIT — free for academic and commercial use.
Generated by Wikilangs Pipeline · 2026-03-03 22:59:51
