English — Wikilangs Models

Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on English Wikipedia by Wikilangs.

🌐 Language Page · 🎮 Playground · 📊 Full Research Report

Language Samples

Example sentences drawn from the English Wikipedia corpus:

Alexander V may refer to: Alexander V of Macedon (died 294 BCE) Antipope Alexander V Alexander V of Imereti

Alfonso IV may refer to: Alfonso IV of León (924–931) Afonso IV of Portugal Alfonso IV of Aragon Alfonso IV of Ribagorza Alfonso IV d'Este Duke of Modena and Regg

Anastasius I or Anastasios I may refer to: Anastasius I Dicorus (–518), Roman emperor Anastasius I of Antioch (died 599), Patriarch of Antioch Pope Anastasius I (died 401), pope

Angula may refer to: Aṅgula, a measure equal to a finger's breadth Eel, a biological order of fish Nahas Angula, former Prime Minister of Namibia Helmut Angula See also Angul (disambiguation)

Two antipopes used the regnal name Victor IV: Antipope Victor IV Antipope Victor IV

Quick Start

Load the Tokenizer

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("en_tokenizer_32k.model")

text = "Albrecht Achilles may refer to: Albrecht III Achilles, Elector of Brandenburg Al"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

print(tokens)  # subword pieces
print(ids)     # integer ids

# Decode back
print(sp.DecodeIds(ids))
Tokenization examples (click to expand)

Sample 1: Albrecht Achilles may refer to: Albrecht III Achilles, Elector of Brandenburg Al…

Vocab Tokens Count
8k ▁alb recht ▁ach illes ▁may ▁refer ▁to : ▁alb recht … (+27 more) 37
16k ▁alb recht ▁ach illes ▁may ▁refer ▁to : ▁alb recht … (+26 more) 36
32k ▁albrecht ▁achilles ▁may ▁refer ▁to : ▁albrecht ▁iii ▁achilles , … (+17 more) 27
64k ▁albrecht ▁achilles ▁may ▁refer ▁to : ▁albrecht ▁iii ▁achilles , … (+16 more) 26

Sample 2: Alexander V may refer to: Alexander V of Macedon (died 294 BCE) Antipope Alexand…

Vocab Tokens Count
8k ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁maced … (+20 more) 30
16k ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁macedon … (+18 more) 28
32k ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁macedon … (+15 more) 25
64k ▁alexander ▁v ▁may ▁refer ▁to : ▁alexander ▁v ▁of ▁macedon … (+15 more) 25

Sample 3: Two antipopes used the regnal name Victor IV: Antipope Victor IV Antipope Victor…

Vocab Tokens Count
8k ▁two ▁antip op es ▁used ▁the ▁reg nal ▁name ▁victor … (+8 more) 18
16k ▁two ▁antip opes ▁used ▁the ▁reg nal ▁name ▁victor ▁iv … (+7 more) 17
32k ▁two ▁antip opes ▁used ▁the ▁regnal ▁name ▁victor ▁iv : … (+6 more) 16
64k ▁two ▁antipopes ▁used ▁the ▁regnal ▁name ▁victor ▁iv : ▁antipope … (+5 more) 15

Load Word Embeddings

from gensim.models import KeyedVectors

# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("en_embeddings_128d_aligned.kv")

similar = wv.most_similar("word", topn=5)
for word, score in similar:
    print(f"  {word}: {score:.3f}")

Load N-gram Model

import pyarrow.parquet as pq

df = pq.read_table("en_3gram_word.parquet").to_pandas()
print(df.head())

Models Overview

Performance Dashboard

Category Assets
Tokenizers BPE at 8k, 16k, 32k, 64k vocab sizes
N-gram models 2 / 3 / 4 / 5-gram (word & subword)
Markov chains Context 1–5 (word & subword)
Embeddings 32d, 64d, 128d — mono & aligned
Vocabulary Full frequency list + Zipf analysis
Statistics Corpus & model statistics JSON

Metrics Summary

Component Model Key Metric Value
Tokenizer 8k BPE Compression 3.84x
Tokenizer 16k BPE Compression 4.22x
Tokenizer 32k BPE Compression 4.51x
Tokenizer 64k BPE Compression 4.70x 🏆
N-gram 2-gram (subword) Perplexity 257 🏆
N-gram 2-gram (word) Perplexity 386,225
N-gram 3-gram (subword) Perplexity 2,180
N-gram 3-gram (word) Perplexity 4,093,782
N-gram 4-gram (subword) Perplexity 12,758
N-gram 4-gram (word) Perplexity 14,465,722
N-gram 5-gram (subword) Perplexity 55,700
N-gram 5-gram (word) Perplexity 12,820,936
Markov ctx-1 (subword) Predictability 0.0%
Markov ctx-1 (word) Predictability 6.2%
Markov ctx-2 (subword) Predictability 46.4%
Markov ctx-2 (word) Predictability 48.3%
Markov ctx-3 (subword) Predictability 45.8%
Markov ctx-3 (word) Predictability 75.9%
Markov ctx-4 (subword) Predictability 36.8%
Markov ctx-4 (word) Predictability 89.2% 🏆
Vocabulary full Size 1,867,537
Vocabulary full Zipf R² 0.9862
Embeddings mono_32d Isotropy 0.7693 🏆
Embeddings mono_64d Isotropy 0.7388
Embeddings mono_128d Isotropy 0.6687

📊 Full ablation study, per-model breakdowns, and interpretation guide →


About

Trained on wikipedia-monthly — monthly snapshots of 300+ Wikipedia languages.

A project by Wikilangs · Maintainer: Omar Kamali · Omneity Labs

Citation

@misc{wikilangs2025,
  author    = {Kamali, Omar},
  title     = {Wikilangs: Open NLP Models for Wikipedia Languages},
  year      = {2025},
  doi       = {10.5281/zenodo.18073153},
  publisher = {Zenodo},
  url       = {https://huggingface.co/wikilangs},
  institution = {Omneity Labs}
}

Links

License: MIT — free for academic and commercial use.


Generated by Wikilangs Pipeline · 2026-03-03 22:59:51

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train wikilangs/en