OTK-BPE-50k: Optimized BPE Tokenizers

This repository hosts the OTK-BPE-50k family of production-grade Byte-Level BPE (BBPE) tokenizers designed for regional, custom, and multilingual language models.

It currently features models optimized for primary Nigerian languages: Yoruba (yo), Igbo (ig), Hausa (ha), and Nigerian Pidgin (pcm), along with a Unified (Naija) model. All models in this repository share a 50,000 vocabulary size (50k) configuration.

This repository will be expanded with 50k tokenizer models for other regional and low-resource languages in the future.

Key Features

Byte-Level BPE (BBPE): Maps raw UTF-8 bytes to printable characters, resulting in 0.00% Out-Of-Vocabulary (OOV) / zero [UNK] tokens.
Diacritic Preservation & Normalization: Uses compiled NFC normalization inside the pre-tokenization chain to prevent tonal accents and combining diacritics from splitting into decomposed code points.
Code-Mixed English Support: Blends ~15% English Wikipedia articles into the training corpora, ensuring extremely low fertility rates when processing mixed sentences.
Emoji & Symbol Merging: Injects curated emoji vocab subsets during training to force BPE to merge them into single tokens rather than breaking them down into multiple byte pieces.
Optimized Vocab Size (50,000): Restores the compression efficiency of monolingual tokenizers while preserving code-mixed English and emoji benefits.

Tokenizer Models in this Repository

Filename	Language	Target Language Code	Vocab Size
`otk-bpe-50k-yo.json`	Yoruba	`yo`	50,000
`otk-bpe-50k-ig.json`	Igbo	`ig`	50,000
`otk-bpe-50k-ha.json`	Hausa	`ha`	50,000
`otk-bpe-50k-pcm.json`	Nigerian Pidgin	`pcm`	50,000
`otk-bpe-50k-naija.json`	Unified (Naija Multilingual)	`naija` / `mul`	50,000

Performance Benchmarks

We evaluated the models against global multilingual baselines on the respective MasakhaNEWS test splits. Fertility represents the average number of tokens produced per word (lower is better).

1. Yoruba News Benchmarks (179,432 words)

Tokenizer	Vocab Size	Total Tokens	Fertility	UNK Rate
Olaverse Yoruba (BBPE)	50,000	223,615	1.246 (Best!)	0.00%
Olaverse Unified (Naija)	50,000	232,628	1.296	0.00%
GPT-4o (`o200k_base`)	200,019	302,744	1.687	0.00%
AfroXLMR	250,002	408,611	2.277	0.02%
GPT-4 (`cl100k_base`)	100,277	455,352	2.538	0.00%

2. Igbo News Benchmarks (141,600 words)

Tokenizer	Vocab Size	Total Tokens	Fertility	UNK Rate
Olaverse Igbo (BBPE)	50,000	196,819	1.390 (Best!)	0.00%
Olaverse Unified (Naija)	50,000	200,576	1.416	0.00%
GPT-4o (`o200k_base`)	200,019	255,897	1.807	0.00%
GPT-4 (`cl100k_base`)	100,277	370,053	2.613	0.00%
AfroXLMR	250,002	363,965	2.570	0.00%

3. Hausa News Benchmarks (304,201 words)

Tokenizer	Vocab Size	Total Tokens	Fertility	UNK Rate
Olaverse Hausa (BBPE)	50,000	369,035	1.213 (Best!)	0.00%
Olaverse Unified (Naija)	50,000	374,380	1.231	0.00%
GPT-4o (`o200k_base`)	200,019	483,309	1.589	0.00%
AfroXLMR	250,002	488,035	1.604	0.01%
GPT-4 (`cl100k_base`)	100,277	625,640	2.057	0.00%

4. Pidgin News Benchmarks (136,233 words)

Tokenizer	Vocab Size	Total Tokens	Fertility	UNK Rate
Olaverse Pidgin (BBPE)	50,000	166,186	1.220 (Best!)	0.00%
Olaverse Unified (Naija)	50,000	170,174	1.249	0.00%
GPT-4o (`o200k_base`)	200,019	177,712	1.304	0.00%
GPT-4 (`cl100k_base`)	100,277	184,129	1.352	0.00%
AfroXLMR	250,002	190,921	1.401	0.00%

How to Use

Installation

Make sure you have the required libraries installed:

pip install tokenizers transformers huggingface_hub

Method A: Standard Transformers Loading (Recommended for LLM Training/Inference)

You can load the wrapped tokenizers directly from subfolders using Hugging Face's AutoTokenizer:

from transformers import AutoTokenizer

# Load the custom Yoruba tokenizer
tokenizer = AutoTokenizer.from_pretrained("olaverse/otk-bpe-50k", subfolder="yo")

# Encode text
text = "Ẹ kú àbọ̀ ooo, ṣé dáadáa ni? 😂"
inputs = tokenizer(text)

print("Tokens:", tokenizer.tokenize(text))
print("IDs:", inputs["input_ids"])
print("Decoded:", tokenizer.decode(inputs["input_ids"]))

Method B: Lightweight Raw BPE Loading (Using the `olaverse` library)

To avoid importing the heavy transformers package, use the lightweight, offline-first olaverse library:

from olaverse import Tokenizer

# 1. Initialize tokenizer (it will automatically download and cache from the Hub on-demand)
tokenizer = Tokenizer("yo")  # Supports: "yo", "ig", "ha", "pcm", "naija"

# 2. Encode text
text = "Ẹ kú àbọ̀ ooo, ṣé dáadáa ni? 😂"
ids = tokenizer.encode(text)

print("IDs:", ids)
print("Decoded:", tokenizer.decode(ids))

Datasets Used for Training

WaxalNLP: Colloquial scripts for Yoruba (yor_tts), Igbo (ibo_tts), Hausa (hau_tts), and Pidgin (pcm_tts).
Wikipedia: Monolingual encyclopedic corpora (yo, ig, ha, pcm).
MasakhaNEWS: Nigerian language news articles.
Wikipedia (English): Clean English articles for code-mixed blending.
Curated Emoji Registry: Common Nigerian social media emojis.

olaverse
/

otk-bpe-50k

OTK-BPE-50k: Optimized BPE Tokenizers

Key Features

Tokenizer Models in this Repository

Performance Benchmarks

1. Yoruba News Benchmarks (179,432 words)

2. Igbo News Benchmarks (141,600 words)

3. Hausa News Benchmarks (304,201 words)

4. Pidgin News Benchmarks (136,233 words)

How to Use

Installation

Method A: Standard Transformers Loading (Recommended for LLM Training/Inference)

Method B: Lightweight Raw BPE Loading (Using the `olaverse` library)

Datasets Used for Training

Links

Datasets used to train olaverse/otk-bpe-50k

Collection including olaverse/otk-bpe-50k

OTK-BPE

OTK-BPE-50k: Optimized BPE Tokenizers

Key Features

Tokenizer Models in this Repository

Performance Benchmarks

1. Yoruba News Benchmarks (179,432 words)

2. Igbo News Benchmarks (141,600 words)

3. Hausa News Benchmarks (304,201 words)

4. Pidgin News Benchmarks (136,233 words)

How to Use

Installation

Method A: Standard Transformers Loading (Recommended for LLM Training/Inference)

Method B: Lightweight Raw BPE Loading (Using the olaverse library)

Datasets Used for Training

Links

Datasets used to train olaverse/otk-bpe-50k

Collection including olaverse/otk-bpe-50k

Method B: Lightweight Raw BPE Loading (Using the `olaverse` library)