--- language: - ca license: mit tags: - llama - tinystories - text-generation - custom-tokenizer pipeline_tag: text-generation --- # Catalan Stories Model This is a small Llama-style language model trained on a dataset of synthetically-generated short stories in Catalan language, with a custom 512-token vocabulary. Try it out here: https://huggingface.co/spaces/sdobson/catalan-stories-6m ## Model Details - **Architecture**: Llama (decoder-only transformer) - **Parameters**: ~6M parameters - **Hidden size**: 256 - **Layers**: 8 - **Attention heads**: 8 - **KV heads**: 4 (Grouped Query Attention) - **Vocabulary size**: 512 (custom SentencePiece tokenizer) - **Max sequence length**: 256 tokens - **Training data**: [Catalan stories](https://huggingface.co/datasets/sdobson/catalan-stories) dataset ## Custom Tokenizer This model uses a custom SentencePiece tokenizer trained specifically on our dataset with a vocabulary size of only 512 tokens. This makes the model: - Very lightweight and fast - Optimised for simple Catalan stories - Easy to deploy in resource-constrained environments ## Usage ```python from transformers import LlamaForCausalLM, LlamaTokenizer import torch # Load model and tokenizer model = LlamaForCausalLM.from_pretrained("samdobson/catalan-stories-6m") tokenizer = LlamaTokenizer.from_pretrained("samdobson/catalan-stories-6m") # Generate text prompt = "Hi havia una vegada" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, temperature=0.8, top_k=40, do_sample=True ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Training Details - **Framework**: llama2.c (PyTorch) - **Dataset**: [Catalan stories](https://huggingface.co/datasets/sdobson/catalan-stories) - **Tokenizer**: Custom SentencePiece model (512 vocab) - **Hardware**: Trained on GeForce RTX 3060 - **Training time**: 30min ## Limitations - **Domain-specific**: The model is optimized for simple English stories and may not generalize well to other domains - **Small vocabulary**: With only 512 tokens, the model has limited vocabulary coverage - **Short context**: Maximum sequence length of 256 tokens - **Size**: While efficient, this is a small model (~15M parameters) and has limited capabilities compared to larger models ## Intended Use This model is intended for: - Educational purposes - Learning about language models and tokenization - Lightweight text generation in resource-constrained environments - Generating simple children's stories - Experimentation with custom tokenizers ## Training Data The model was trained on the [Catalan stories dataset](https://huggingface.co/datasets/sdobson/catalan-stories), which consists of short stories written in simple Catalan, generated synthetically to be suitable for language learning. ## License MIT License