YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

BoKenlm-sp - Tibetan KenLM Language Model

A KenLM n-gram language model trained on Tibetan text, tokenized with sentencepiece tokenizer.

Model Details

Parameter Value
Model Type Modified Kneser-Ney 5-gram
Tokenizer openpecha/BoSentencePiece (Unigram, 20k vocab)
Training Corpus bo_corpus.txt
Pruning 0 0 1
Tokens 38,532,313
Vocabulary Size 19,974

N-gram Statistics

Order Count D1 D2 D3+
1 19,974 0.4286 0.4732 1.6466
2 6,644,290 0.6716 1.1474 1.5430
3 4,300,626 0.8465 1.2657 1.4802
4 3,485,091 0.9175 1.3852 1.5176
5 2,597,780 0.8773 1.4487 1.5846

Memory Estimates

Type MB Details
probing 375 assuming -p 1.5
probing 458 assuming -r models -p 1.5
trie 187 without quantization
trie 99 assuming -q 8 -b 8 quantization
trie 159 assuming -a 22 array pointer compression
trie 71 assuming -a 22 -q 8 -b 8 array pointer compression and quantization

Training Resources

Metric Value
Peak Virtual Memory 12,333 MB
Peak RSS 2,976 MB
Wall Time 33.1s
User Time 37.0s
System Time 16.6s

Usage

import kenlm

model = kenlm.Model("BoKenlm-sp.arpa")

# Score a tokenized sentence
score = model.score("▁བོད་སྐད་ ▁ཀྱི་ ▁ཚིག་གྲུབ་ ▁འདི་ ▁ཡིན།")
print(score)

Files

  • BoKenlm-sp.arpa — ARPA format language model
  • README.md — This model card

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support