Catalan Stories Model

This is a small Llama-style language model trained on a dataset of synthetically-generated short stories in Catalan language, with a custom 512-token vocabulary.

Try it out here: https://huggingface.co/spaces/sdobson/catalan-stories-6m

Model Details

Architecture: Llama (decoder-only transformer)
Parameters: ~6M parameters
Hidden size: 256
Layers: 8
Attention heads: 8
KV heads: 4 (Grouped Query Attention)
Vocabulary size: 512 (custom SentencePiece tokenizer)
Max sequence length: 256 tokens
Training data: Catalan stories dataset

Custom Tokenizer

This model uses a custom SentencePiece tokenizer trained specifically on our dataset with a vocabulary size of only 512 tokens. This makes the model:

Very lightweight and fast
Optimised for simple Catalan stories
Easy to deploy in resource-constrained environments

Usage

from transformers import LlamaForCausalLM, LlamaTokenizer
import torch

# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("samdobson/catalan-stories-6m")
tokenizer = LlamaTokenizer.from_pretrained("samdobson/catalan-stories-6m")

# Generate text
prompt = "Hi havia una vegada"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_k=40,
        do_sample=True
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Training Details

Framework: llama2.c (PyTorch)
Dataset: Catalan stories
Tokenizer: Custom SentencePiece model (512 vocab)
Hardware: Trained on GeForce RTX 3060
Training time: 30min

Limitations

Domain-specific: The model is optimized for simple English stories and may not generalize well to other domains
Small vocabulary: With only 512 tokens, the model has limited vocabulary coverage
Short context: Maximum sequence length of 256 tokens
Size: While efficient, this is a small model (~15M parameters) and has limited capabilities compared to larger models

Intended Use

This model is intended for:

Educational purposes
Learning about language models and tokenization
Lightweight text generation in resource-constrained environments
Generating simple children's stories
Experimentation with custom tokenizers

Training Data

The model was trained on the Catalan stories dataset, which consists of short stories written in simple Catalan, generated synthetically to be suitable for language learning.

License

MIT License

Downloads last month: 8

Model tree for sdobson/catalan-stories-6m

Quantizations

1 model

sdobson
/

catalan-stories-6m