🇹🇭 Thai NLP Toolkit

A multi-task NLP framework for the Thai language built from scratch with PyTorch.

Uses a shared Transformer encoder backbone with three task-specific heads:

Task Head Metric
Named Entity Recognition Token classification (7 labels) Entity-level F1
Sentiment Analysis Sentence classification (3 labels) Macro-F1
Question Answering Extractive span prediction EM / F1

Model Architecture

  • Tokenizer: SentencePiece BPE (32K vocab) with Thai-specific preprocessing
  • Encoder: 6-layer Transformer (d_model=256, 8 heads, d_ff=1024)
  • Max sequence length: 512 tokens

Usage

# Clone the repository first
# git clone https://github.com/puttibenz/thai-nlp-toolkit.git

from inference.pipeline import ThaiNLPPipeline

pipeline = ThaiNLPPipeline(model_dir="path/to/downloaded/model", device="auto")

# NER
result = pipeline.predict("สมชายทำงานที่กรุงเทพ", task="ner")

# Sentiment Analysis
result = pipeline.predict("อาหารอร่อยมากครับ", task="sentiment")

# Question Answering
result = pipeline.predict(
    "กรุงเทพมหานครเป็นเมืองหลวงของประเทศไทย",
    task="qa",
    question="เมืองหลวงของประเทศไทยคืออะไร"
)

Training Data

Dataset Task Source
ThaiNER v2.2 NER pythainlp/thainer-corpus-v2.2
Wisesight Sentiment Sentiment pythainlp/wisesight_sentiment
iApp Thai Wiki QA QA iapp_wiki_qa_squad

Training Details

  • Framework: PyTorch (custom implementation)
  • Training: Multi-task learning with round-robin sampling
  • Optimizer: AdamW with cosine LR schedule + warmup
  • Mixed Precision: FP16 on CUDA
  • Batch Size: 32 (×4 gradient accumulation = effective 128)

File Structure

thai-nlp-toolkit/
├── checkpoint.pt              # Model weights
├── config.yaml                # Model architecture config
└── tokenizer/
    ├── thai_bpe.model         # SentencePiece BPE model
    └── tokenizer_config.json  # Tokenizer config

Source Code

GitHub: puttibenz/thai-nlp-toolkit

License

MIT

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train puttimej/thai-nlp-toolkit