wilai-2.0

Version: 1.0.0

Wilai-2.0 is a Thai-specific GPT-like language model with enhanced word-level tokenization. The tokenizer is trained on pre-segmented Thai text using advanced word segmentation techniques, resulting in better linguistic alignment and more efficient tokenization for Thai language processing.

Recent updates (2025-10-31)

  • The fast tokenizer (tokenizer.json) has been rebuilt from the repository's SentencePiece vocabulary and uploaded to the model repo on the Hugging Face Hub so AutoTokenizer.from_pretrained will download the canonical tokenizer matching the model's vocab_size (1000).
  • Auto APIs (AutoTokenizer + AutoModelForCausalLM) now load end-to-end without requiring trust_remote_code=True in our test environment — see "How to use" below for recommended usage.

Model Details

Key Value
vocab_size 1000
block_size 128
n_layer 6
n_head 8
n_embd 512
parameter_count 20004864

Tokenization

This model uses an enhanced SentencePiece tokenizer with advanced word-level segmentation for Thai text:

  • Word-level Training: SentencePiece BPE trained on pre-segmented Thai words for better linguistic alignment
  • Known Segmentations: Dictionary of common Thai phrases with correct word boundaries
  • PyThaiNLP Integration: Automatic word segmentation for unknown text
  • Improved Efficiency: Fewer tokens needed for common Thai expressions

Example tokenization:

Input: 'แม่อย่าคิดมาก' (Mom, don't think too much)
Words: ['แม่', 'อย่า', 'คิด', 'มาก']
Tokens: ['▁แม่', '▁อย่า', '▁คิด', '▁มาก'] (4 tokens - word-level)

Intended Uses

  • Thai text generation and completion
  • Research on Thai language models
  • Fine-tuning for Thai-specific NLP tasks

Limitations

  • Trained only on Thai Wikipedia data, may not generalize well to other domains
  • Limited context window (block_size)
  • May generate biased or inappropriate content

How to Use

Recommended: Using AutoTokenizer + AutoModel (standard Transformers)

The repository now provides a canonical tokenizer.json on the Hub. The recommended, simplest way to load the tokenizer and model is using the standard Transformers Auto APIs. In our smoke tests the Auto classes load correctly without requiring trust_remote_code=True.

from transformers import AutoTokenizer, AutoModelForCausalLM

# Loads the canonical tokenizer.json from the Hub and the model implementation
tokenizer = AutoTokenizer.from_pretrained('JonusNattapong/wilai-2.0')
model = AutoModelForCausalLM.from_pretrained('JonusNattapong/wilai-2.0')

text = 'สวัสดีครับ นี่คือการทดสอบ'
inputs = tokenizer(text, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Notes:

  • If you still need the custom tokenizer class (it contains additional segmentation logic), you can load it via wilai_transformers.WilaiTokenizer.from_pretrained(...) and use it to pre-process text before calling the standard model.
  • Run the provided smoke test locally to validate your environment:
python test_auto_apis.py

Alternative: Using Standard Transformers

The model can be loaded with standard Transformers, but you'll need to handle tokenization differently:

from transformers import AutoModelForCausalLM
from wilai_transformers import WilaiTokenizer

# Load model with standard Transformers
model = AutoModelForCausalLM.from_pretrained('JonusNattapong/wilai-2.0')

# Load tokenizer separately using custom class
tokenizer = WilaiTokenizer.from_pretrained('JonusNattapong/wilai-2.0')

# Generate text
input_text = 'สวัสดีครับ'
tokens = tokenizer.encode(input_text, add_special_tokens=True)
outputs = model.generate([tokens], max_length=50, do_sample=True, temperature=0.8)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)

Using the Original Implementation

For the original implementation, clone this repository and use the local code:

from huggingface_hub import hf_hub_download
from src.inference import ThaiTextGenerator, GenerationConfig

# Download model artifacts
model_path = hf_hub_download(repo_id='JonusNattapong/wilai-2.0', filename='pytorch_model.bin')
config_path = hf_hub_download(repo_id='JonusNattapong/wilai-2.0', filename='config.json')
tokenizer_path = hf_hub_download(repo_id='JonusNattapong/wilai-2.0', filename='thai_sp.model')

# Load model
generator = ThaiTextGenerator.from_pretrained(model_path, config_path, tokenizer_path)
generated = generator.generate('สวัสดี', config=GenerationConfig(max_tokens=20))
print(generated)

Training Details

  • Architecture: GPT-like decoder-only transformer
  • Training Data: Thai Wikipedia articles
  • Tokenization: SentencePiece BPE
  • Framework: PyTorch

Evaluation

Evaluation results and benchmarks will be added here.

Citation

If you use this model, please cite:

@misc{wilai,
  title={Wilai-2.0: A Thai Language Model},
  author={Nattapong Tapachoom},
  year={2025},
  url={https://huggingface.co/JonusNattapong/wilai-2.0}
}

License

This model is released under the MIT License.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support