phonsobon/khmer-artical-summaries
Viewer β’ Updated β’ 13k β’ 92
An abstractive summarization model for the Khmer language, fine-tuned from
google/mt5-small on two Khmer news datasets.
Note: Khmer has no spaces between words. The mT5 SentencePiece tokenizer handles all subword segmentation automatically β do not apply any word-splitting pre-processing.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import unicodedata, re
tokenizer = AutoTokenizer.from_pretrained("phonsobon/khmer-text-summarization-1024k", use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained("phonsobon/khmer-text-summarization-1024k")
model.eval()
def clean_khmer(text):
text = unicodedata.normalize("NFC", text)
text = re.sub(r"<[^>]+>|https?://\S+", " ", text)
text = re.sub(r"[ \t]+", " ", text)
return text.strip()
article = "αααα
αΌαα’ααααααααααααααα’ααααα
ααΈααα ..." # your Khmer article
inputs = tokenizer(
"summarize: " + clean_khmer(article),
return_tensors="pt",
max_length=1024,
truncation=True,
)
output_ids = model.generate(**inputs) # generation_config baked in
summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(summary)
| Metric | Score |
|---|---|
| Loss | 0.2025 |
| Rouge1 | 40.7835 |
| Rouge2 | 39.2445 |
| Rougel | 40.8362 |
| Rougelsum | 40.7699 |
| Gen Len | 255.0000 |
| Metric | Score |
|---|---|
| Loss | 0.2321 |
| Rouge1 | 42.1170 |
| Rouge2 | 40.8878 |
| Rougel | 42.0827 |
| Rougelsum | 42.0164 |
| Gen Len | 255.0000 |
| Setting | Value |
|---|---|
| Base model | google/mt5-small |
| Fine-tuning method | LoRA (merged) |
| Task prefix | summarize: |
| Max input length | 1024 tokens |
| Max target length | 256 tokens |
| Epochs | 10 |
| Learning rate | 0.0005 |
| Beam search | 4 beams |
| No-repeat n-gram | 3 |
| Training date | 2026-06-22T08:16:44.850432 |
| Dataset | Columns used |
|---|---|
phonsobon/khmer-artical-summaries |
content β summaries |
phonsobon/khmer-text-summarization-v2 |
article β summaries |
Base model
google/mt5-small