---
language:
- ca
license: mit
tags:
- llama
- tinystories
- text-generation
- custom-tokenizer
pipeline_tag: text-generation
---

# Catalan Stories Model

This is a small Llama-style language model trained on a dataset of synthetically-generated short stories in Catalan language, with a custom 512-token vocabulary.

Try it out here: https://huggingface.co/spaces/sdobson/catalan-stories-6m

## Model Details

- **Architecture**: Llama (decoder-only transformer)
- **Parameters**: ~6M parameters
- **Hidden size**: 256
- **Layers**: 8
- **Attention heads**: 8
- **KV heads**: 4 (Grouped Query Attention)
- **Vocabulary size**: 512 (custom SentencePiece tokenizer)
- **Max sequence length**: 256 tokens
- **Training data**: [Catalan stories](https://huggingface.co/datasets/sdobson/catalan-stories) dataset

## Custom Tokenizer

This model uses a custom SentencePiece tokenizer trained specifically on our dataset with a vocabulary size of only 512 tokens. This makes the model:
- Very lightweight and fast
- Optimised for simple Catalan stories
- Easy to deploy in resource-constrained environments

## Usage

```python
from transformers import LlamaForCausalLM, LlamaTokenizer
import torch

# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("samdobson/catalan-stories-6m")
tokenizer = LlamaTokenizer.from_pretrained("samdobson/catalan-stories-6m")

# Generate text
prompt = "Hi havia una vegada"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_k=40,
        do_sample=True
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Training Details

- **Framework**: llama2.c (PyTorch)
- **Dataset**: [Catalan stories](https://huggingface.co/datasets/sdobson/catalan-stories)
- **Tokenizer**: Custom SentencePiece model (512 vocab)
- **Hardware**: Trained on GeForce RTX 3060
- **Training time**: 30min

## Limitations

- **Domain-specific**: The model is optimized for simple English stories and may not generalize well to other domains
- **Small vocabulary**: With only 512 tokens, the model has limited vocabulary coverage
- **Short context**: Maximum sequence length of 256 tokens
- **Size**: While efficient, this is a small model (~15M parameters) and has limited capabilities compared to larger models

## Intended Use

This model is intended for:
- Educational purposes
- Learning about language models and tokenization
- Lightweight text generation in resource-constrained environments
- Generating simple children's stories
- Experimentation with custom tokenizers

## Training Data

The model was trained on the [Catalan stories dataset](https://huggingface.co/datasets/sdobson/catalan-stories), which consists of short stories written in simple Catalan, generated synthetically to be suitable for language learning.

## License

MIT License