Text Generation
fastText
Cebuano
wikilangs
nlp
tokenizer
embeddings
n-gram
markov
wikipedia
feature-extraction
sentence-similarity
tokenization
n-grams
markov-chain
text-mining
babelvec
vocabulous
vocabulary
monolingual
family-austronesian_philippine_central
Instructions to use wikilangs/ceb with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- fastText
How to use wikilangs/ceb with fastText:
from huggingface_hub import hf_hub_download import fasttext model = fasttext.load_model(hf_hub_download("wikilangs/ceb", "model.bin")) - Notebooks
- Google Colab
- Kaggle
| language: ceb | |
| language_name: Cebuano | |
| language_family: austronesian_philippine_central | |
| tags: | |
| - wikilangs | |
| - nlp | |
| - tokenizer | |
| - embeddings | |
| - n-gram | |
| - markov | |
| - wikipedia | |
| - feature-extraction | |
| - sentence-similarity | |
| - tokenization | |
| - n-grams | |
| - markov-chain | |
| - text-mining | |
| - fasttext | |
| - babelvec | |
| - vocabulous | |
| - vocabulary | |
| - monolingual | |
| - family-austronesian_philippine_central | |
| license: mit | |
| library_name: wikilangs | |
| pipeline_tag: text-generation | |
| datasets: | |
| - omarkamali/wikipedia-monthly | |
| dataset_info: | |
| name: wikipedia-monthly | |
| description: Monthly snapshots of Wikipedia articles across 300+ languages | |
| metrics: | |
| - name: best_compression_ratio | |
| type: compression | |
| value: 4.164 | |
| - name: best_isotropy | |
| type: isotropy | |
| value: 0.8551 | |
| - name: best_alignment_r10 | |
| type: alignment | |
| value: 0.5920 | |
| - name: vocabulary_size | |
| type: vocab | |
| value: 208251 | |
| generated: 2026-03-04 | |
| # Cebuano — Wikilangs Models | |
| Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on **Cebuano** Wikipedia by [Wikilangs](https://wikilangs.org). | |
| 🌐 [Language Page](https://wikilangs.org/languages/ceb/) · 🎮 [Playground](https://wikilangs.org/playground/?lang=ceb) · 📊 [Full Research Report](RESEARCH_REPORT.md) | |
| ## Language Samples | |
| Example sentences drawn from the Cebuano Wikipedia corpus: | |
| > Kining maong panid gitagana alang sa lista sa mga tawo nga nahimong mayor sa lalawigan sa Sugbo. Alkalde sa Lalawigan sa Sugbo Alkalde | |
| > Ang sekswalidad puyde mopasabot sa: Sekswalidad sa tawo Sekswalidad sa tanom Sekswalidad (oryentasyon) Sekswalidad sa mananap | |
| > Katawhan ug Kultura Ekonomiya Heyograpiya Politikal Mga lungsod Dakbayan Mga dakbayan Pisikal Kaagi Mga sumpay sa gawas | |
| > Kining maong panid gitagana alang sa lista sa mga tawo nga nahimong gobernador sa lalawigan sa Samar. Mga Gobernador Antonio Bolastig Milagrosa T. Tan Gobernador Gobernador sa Samar | |
| > Kining maong panid gitagana alang sa lista sa mga tawo nga nahimong gobernador sa lalawigan sa Biliran. Mga Gobernador (gikan Wayne Jaro Rogelio J. Espina Gobernador Gobernador sa Biliran | |
| ## Quick Start | |
| ### Load the Tokenizer | |
| ```python | |
| import sentencepiece as spm | |
| sp = spm.SentencePieceProcessor() | |
| sp.Load("ceb_tokenizer_32k.model") | |
| text = "Ang (MDCCL) mao ang usa ka tuig sa kalendaryong Gregoryano. Ang maoy usa ka tuig" | |
| tokens = sp.EncodeAsPieces(text) | |
| ids = sp.EncodeAsIds(text) | |
| print(tokens) # subword pieces | |
| print(ids) # integer ids | |
| # Decode back | |
| print(sp.DecodeIds(ids)) | |
| ``` | |
| <details> | |
| <summary><b>Tokenization examples (click to expand)</b></summary> | |
| **Sample 1:** `Ang (MDCCL) mao ang usa ka tuig sa kalendaryong Gregoryano. Ang maoy usa ka tuig…` | |
| | Vocab | Tokens | Count | | |
| |-------|--------|-------| | |
| | 8k | `▁ang ▁( m d c cl ) ▁mao ▁ang ▁usa … (+27 more)` | 37 | | |
| | 16k | `▁ang ▁( m d c cl ) ▁mao ▁ang ▁usa … (+24 more)` | 34 | | |
| | 32k | `▁ang ▁( m d c cl ) ▁mao ▁ang ▁usa … (+22 more)` | 32 | | |
| | 64k | `▁ang ▁( md c cl ) ▁mao ▁ang ▁usa ▁ka … (+21 more)` | 31 | | |
| **Sample 2:** `Vilnius - Ulohan, Lyetuwanya. lungsod ug dakbayan sa Uropa` | |
| | Vocab | Tokens | Count | | |
| |-------|--------|-------| | |
| | 8k | `▁v il n ius ▁- ▁ulo han , ▁ly et … (+9 more)` | 19 | | |
| | 16k | `▁vil n ius ▁- ▁ulohan , ▁ly et uw an … (+7 more)` | 17 | | |
| | 32k | `▁vil n ius ▁- ▁ulohan , ▁ly et uw an … (+7 more)` | 17 | | |
| | 64k | `▁vil n ius ▁- ▁ulohan , ▁lyetuwanya . ▁lungsod ▁ug … (+3 more)` | 13 | | |
| **Sample 3:** `Ang manunuwat usa ka tawo nga naay propesyon sa pagsulat.` | |
| | Vocab | Tokens | Count | | |
| |-------|--------|-------| | |
| | 8k | `▁ang ▁man un u wat ▁usa ▁ka ▁tawo ▁nga ▁na … (+9 more)` | 19 | | |
| | 16k | `▁ang ▁man un u wat ▁usa ▁ka ▁tawo ▁nga ▁na … (+8 more)` | 18 | | |
| | 32k | `▁ang ▁man un u wat ▁usa ▁ka ▁tawo ▁nga ▁naay … (+6 more)` | 16 | | |
| | 64k | `▁ang ▁man un uwat ▁usa ▁ka ▁tawo ▁nga ▁naay ▁propes … (+4 more)` | 14 | | |
| </details> | |
| ### Load Word Embeddings | |
| ```python | |
| from gensim.models import KeyedVectors | |
| # Aligned embeddings (cross-lingual, mapped to English vector space) | |
| wv = KeyedVectors.load("ceb_embeddings_128d_aligned.kv") | |
| similar = wv.most_similar("word", topn=5) | |
| for word, score in similar: | |
| print(f" {word}: {score:.3f}") | |
| ``` | |
| ### Load N-gram Model | |
| ```python | |
| import pyarrow.parquet as pq | |
| df = pq.read_table("ceb_3gram_word.parquet").to_pandas() | |
| print(df.head()) | |
| ``` | |
| ## Models Overview | |
|  | |
| | Category | Assets | | |
| |----------|--------| | |
| | Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes | | |
| | N-gram models | 2 / 3 / 4 / 5-gram (word & subword) | | |
| | Markov chains | Context 1–5 (word & subword) | | |
| | Embeddings | 32d, 64d, 128d — mono & aligned | | |
| | Vocabulary | Full frequency list + Zipf analysis | | |
| | Statistics | Corpus & model statistics JSON | | |
| ## Metrics Summary | |
| | Component | Model | Key Metric | Value | | |
| |-----------|-------|------------|-------| | |
| | Tokenizer | 8k BPE | Compression | 3.20x | | |
| | Tokenizer | 16k BPE | Compression | 3.59x | | |
| | Tokenizer | 32k BPE | Compression | 3.89x | | |
| | Tokenizer | 64k BPE | Compression | 4.16x 🏆 | | |
| | N-gram | 2-gram (subword) | Perplexity | 244 🏆 | | |
| | N-gram | 2-gram (word) | Perplexity | 1,490 | | |
| | N-gram | 3-gram (subword) | Perplexity | 1,343 | | |
| | N-gram | 3-gram (word) | Perplexity | 2,538 | | |
| | N-gram | 4-gram (subword) | Perplexity | 3,750 | | |
| | N-gram | 4-gram (word) | Perplexity | 4,059 | | |
| | N-gram | 5-gram (subword) | Perplexity | 6,751 | | |
| | N-gram | 5-gram (word) | Perplexity | 5,049 | | |
| | Markov | ctx-1 (subword) | Predictability | 13.0% | | |
| | Markov | ctx-1 (word) | Predictability | 0.0% | | |
| | Markov | ctx-2 (subword) | Predictability | 32.8% | | |
| | Markov | ctx-2 (word) | Predictability | 66.0% | | |
| | Markov | ctx-3 (subword) | Predictability | 28.5% | | |
| | Markov | ctx-3 (word) | Predictability | 83.0% | | |
| | Markov | ctx-4 (subword) | Predictability | 31.1% | | |
| | Markov | ctx-4 (word) | Predictability | 94.4% 🏆 | | |
| | Vocabulary | full | Size | 208,251 | | |
| | Vocabulary | full | Zipf R² | 0.9938 | | |
| | Embeddings | mono_32d | Isotropy | 0.8551 | | |
| | Embeddings | mono_64d | Isotropy | 0.8254 | | |
| | Embeddings | mono_128d | Isotropy | 0.7631 | | |
| | Embeddings | aligned_32d | Isotropy | 0.8551 🏆 | | |
| | Embeddings | aligned_64d | Isotropy | 0.8254 | | |
| | Embeddings | aligned_128d | Isotropy | 0.7631 | | |
| | Alignment | aligned_32d | R@1 / R@5 / R@10 | 5.8% / 18.8% / 31.4% | | |
| | Alignment | aligned_64d | R@1 / R@5 / R@10 | 11.2% / 32.6% / 46.4% | | |
| | Alignment | aligned_128d | R@1 / R@5 / R@10 | 23.8% / 47.0% / 59.2% 🏆 | | |
| 📊 **[Full ablation study, per-model breakdowns, and interpretation guide →](RESEARCH_REPORT.md)** | |
| --- | |
| ## About | |
| Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) — monthly snapshots of 300+ Wikipedia languages. | |
| A project by **[Wikilangs](https://wikilangs.org)** · Maintainer: [Omar Kamali](https://omarkamali.com) · [Omneity Labs](https://omneitylabs.com) | |
| ### Citation | |
| ```bibtex | |
| @misc{wikilangs2025, | |
| author = {Kamali, Omar}, | |
| title = {Wikilangs: Open NLP Models for Wikipedia Languages}, | |
| year = {2025}, | |
| doi = {10.5281/zenodo.18073153}, | |
| publisher = {Zenodo}, | |
| url = {https://huggingface.co/wikilangs}, | |
| institution = {Omneity Labs} | |
| } | |
| ``` | |
| ### Links | |
| - 🌐 [wikilangs.org](https://wikilangs.org) | |
| - 🌍 [Language page](https://wikilangs.org/languages/ceb/) | |
| - 🎮 [Playground](https://wikilangs.org/playground/?lang=ceb) | |
| - 🤗 [HuggingFace models](https://huggingface.co/wikilangs) | |
| - 📊 [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) | |
| - 👤 [Omar Kamali](https://huggingface.co/omarkamali) | |
| - 🤝 Sponsor: [Featherless AI](https://featherless.ai) | |
| **License:** MIT — free for academic and commercial use. | |
| --- | |
| *Generated by Wikilangs Pipeline · 2026-03-04 08:49:55* | |