Catalan Stories Model
This is a small Llama-style language model trained on a dataset of synthetically-generated short stories in Catalan language, with a custom 512-token vocabulary.
Try it out here: https://huggingface.co/spaces/sdobson/catalan-stories-6m
Model Details
- Architecture: Llama (decoder-only transformer)
- Parameters: ~6M parameters
- Hidden size: 256
- Layers: 8
- Attention heads: 8
- KV heads: 4 (Grouped Query Attention)
- Vocabulary size: 512 (custom SentencePiece tokenizer)
- Max sequence length: 256 tokens
- Training data: Catalan stories dataset
Custom Tokenizer
This model uses a custom SentencePiece tokenizer trained specifically on our dataset with a vocabulary size of only 512 tokens. This makes the model:
- Very lightweight and fast
- Optimised for simple Catalan stories
- Easy to deploy in resource-constrained environments
Usage
from transformers import LlamaForCausalLM, LlamaTokenizer
import torch
# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("samdobson/catalan-stories-6m")
tokenizer = LlamaTokenizer.from_pretrained("samdobson/catalan-stories-6m")
# Generate text
prompt = "Hi havia una vegada"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.8,
top_k=40,
do_sample=True
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Training Details
- Framework: llama2.c (PyTorch)
- Dataset: Catalan stories
- Tokenizer: Custom SentencePiece model (512 vocab)
- Hardware: Trained on GeForce RTX 3060
- Training time: 30min
Limitations
- Domain-specific: The model is optimized for simple English stories and may not generalize well to other domains
- Small vocabulary: With only 512 tokens, the model has limited vocabulary coverage
- Short context: Maximum sequence length of 256 tokens
- Size: While efficient, this is a small model (~15M parameters) and has limited capabilities compared to larger models
Intended Use
This model is intended for:
- Educational purposes
- Learning about language models and tokenization
- Lightweight text generation in resource-constrained environments
- Generating simple children's stories
- Experimentation with custom tokenizers
Training Data
The model was trained on the Catalan stories dataset, which consists of short stories written in simple Catalan, generated synthetically to be suitable for language learning.
License
MIT License
- Downloads last month
- 8