Khmer Text Summarization β€” mT5-small

An abstractive summarization model for the Khmer language, fine-tuned from google/mt5-small on two Khmer news datasets.

Note: Khmer has no spaces between words. The mT5 SentencePiece tokenizer handles all subword segmentation automatically β€” do not apply any word-splitting pre-processing.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import unicodedata, re

tokenizer = AutoTokenizer.from_pretrained("phonsobon/khmer-text-summarization-1024k", use_fast=False)
model     = AutoModelForSeq2SeqLM.from_pretrained("phonsobon/khmer-text-summarization-1024k")
model.eval()

def clean_khmer(text):
    text = unicodedata.normalize("NFC", text)
    text = re.sub(r"<[^>]+>|https?://\S+", " ", text)
    text = re.sub(r"[ \t]+", " ", text)
    return text.strip()

article = "αž”αž‰αŸ’αž…αžΌαž›αž’αžαŸ’αžαž”αž‘αžαŸ’αž˜αŸ‚αžšαžšαž”αžŸαŸ‹αž’αŸ’αž“αž€αž“αŸ…αž‘αžΈαž“αŸαŸ‡ ..."   # your Khmer article

inputs = tokenizer(
    "summarize: " + clean_khmer(article),
    return_tensors="pt",
    max_length=1024,
    truncation=True,
)

output_ids = model.generate(**inputs)   # generation_config baked in
summary    = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(summary)

Evaluation β€” Validation set

Metric Score
Loss 0.2025
Rouge1 40.7835
Rouge2 39.2445
Rougel 40.8362
Rougelsum 40.7699
Gen Len 255.0000

Evaluation β€” Test set

Metric Score
Loss 0.2321
Rouge1 42.1170
Rouge2 40.8878
Rougel 42.0827
Rougelsum 42.0164
Gen Len 255.0000

Training details

Setting Value
Base model google/mt5-small
Fine-tuning method LoRA (merged)
Task prefix summarize:
Max input length 1024 tokens
Max target length 256 tokens
Epochs 10
Learning rate 0.0005
Beam search 4 beams
No-repeat n-gram 3
Training date 2026-06-22T08:16:44.850432

Datasets

Dataset Columns used
phonsobon/khmer-artical-summaries content β†’ summaries
phonsobon/khmer-text-summarization-v2 article β†’ summaries

Limitations

  • Optimised for Khmer-language news articles.
  • ROUGE scores are computed character-level (no Khmer word segmenter) β€” treat as relative, not absolute quality.
  • Model may struggle on very short or colloquial Khmer text outside the training distribution.
Downloads last month
44
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phonsobon/khmer-text-summarization-1024k

Base model

google/mt5-small
Finetuned
(724)
this model

Datasets used to train phonsobon/khmer-text-summarization-1024k