Khmer Text Summarization — mT5-small

An abstractive summarization model for the Khmer language, fine-tuned from google/mt5-small on two Khmer news datasets.

Note: Khmer has no spaces between words. The mT5 SentencePiece tokenizer handles all subword segmentation automatically — do not apply any word-splitting pre-processing.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import unicodedata, re

tokenizer = AutoTokenizer.from_pretrained("phonsobon/khmer-text-summarization-1024k", use_fast=False)
model     = AutoModelForSeq2SeqLM.from_pretrained("phonsobon/khmer-text-summarization-1024k")
model.eval()

def clean_khmer(text):
    text = unicodedata.normalize("NFC", text)
    text = re.sub(r"<[^>]+>|https?://\S+", " ", text)
    text = re.sub(r"[ \t]+", " ", text)
    return text.strip()

article = "បញ្ចូលអត្ថបទខ្មែររបស់អ្នកនៅទីនេះ ..."   # your Khmer article

inputs = tokenizer(
    "summarize: " + clean_khmer(article),
    return_tensors="pt",
    max_length=1024,
    truncation=True,
)

output_ids = model.generate(**inputs)   # generation_config baked in
summary    = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(summary)

Evaluation — Validation set

Metric	Score
Loss	0.2025
Rouge1	40.7835
Rouge2	39.2445
Rougel	40.8362
Rougelsum	40.7699
Gen Len	255.0000

Evaluation — Test set

Metric	Score
Loss	0.2321
Rouge1	42.1170
Rouge2	40.8878
Rougel	42.0827
Rougelsum	42.0164
Gen Len	255.0000

Training details

Setting	Value
Base model	`google/mt5-small`
Fine-tuning method	LoRA (merged)
Task prefix	`summarize:`
Max input length	1024 tokens
Max target length	256 tokens
Epochs	10
Learning rate	0.0005
Beam search	4 beams
No-repeat n-gram	3
Training date	2026-06-22T08:16:44.850432

Datasets

Dataset	Columns used
`phonsobon/khmer-artical-summaries`	`content` → `summaries`
`phonsobon/khmer-text-summarization-v2`	`article` → `summaries`

Limitations

Optimised for Khmer-language news articles.
ROUGE scores are computed character-level (no Khmer word segmenter) — treat as relative, not absolute quality.
Model may struggle on very short or colloquial Khmer text outside the training distribution.

Downloads last month: 44

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for phonsobon/khmer-text-summarization-1024k

Base model

google/mt5-small

Finetuned

(724)

this model

phonsobon
/

khmer-text-summarization-1024k