Update on 26.05.2025
For those interested in building NLP applications for Estonian: Llammas π is no longer the best open-weight model available for it. For a detailed comparison of open LLMs on Estonian, see our recent paper.
For an up-to-date leaderboard of both proprietary and open models, please visit baromeeter.tartunlp.ai (and consider contributing if you speak Estonian βΊοΈ)
LLammas π
Llama-2-7B instruction-tuned for Estonian in two stages:
- Continued pre-training: 5B tokens of CulturaX with 75% of documents in Estonain and 25% in English (see Llammas-base),
- Instruction-tuning: Alpaca-cleaned, Alpaca-est, OASST1 top-1 English conversations, CoT and FLAN-V2 following open-instruct (both 10,000), WMT18 English-Estonian translation development data (as documents), general MTee validation English-Estonian held-out data.
Alpaca-est is an instruction dataset generated for Estonian with gpt-3.5-turbo-0613, following Alpaca. More details in our paper.
Additional resources:
- Paper: https://aclanthology.org/2024.findings-naacl.210/
- Code: github.com/TartuNLP/llammas
- Base model: tartuNLP/Llammas-base
- 4-bit quantized model in GGUF: AlbertUnn/LlammasGGUF
- Alpaca-est dataset: github.com/TartuNLP/alpaca-est
Using the model
Using the model in a text-generation pipeline:
from transformers import pipeline
import torch
pipe = pipeline("text-generation", model="tartuNLP/Llammas", torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "user", "content": "Tere!"},
{"role": "assistant", "content": "Tere! Kas saaksin teid kuidagi aidata?"},
{"role": "user", "content": "Kuidas alustada kirja kirjutamist?"}
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.6, top_k=50, top_p=0.9)
print(outputs[0]["generated_text"][len(prompt):])
Using the model in a conversational pipeline (works with transformers==4.36.2, issues with output in newer versions):
from transformers import pipeline, Conversation
import torch
pipe = pipeline("conversational", model="tartuNLP/Llammas", torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "user", "content": "Tere!"},
{"role": "assistant", "content": "Tere! Kas saaksin teid kuidagi aidata?"},
{"role": "user", "content": "Kuidas alustada kirja kirjutamist?"}
]
conversation = Conversation(messages)
conversation = pipe(conversation)
Conversational format:
<|user|>
Tere!
<|assistant|>
Tere! Kas saaksin teid kuidagi aidata?</s>
<|user|>
Kuidas alustada kirja kirjutamist?
<|assistant|>
Kirja kirjutamiseks alustage tervitusega, nΓ€iteks "Tere!" vΓ΅i "Tere hommikust!". SeejΓ€rel tutvustage ennast ja mainige, kellega kirjutate. Kirjeldage oma mΓ΅tteid vΓ΅i kΓΌsimusi, mida soovite arutada. LΓ΅petage kiri viisakalt, nΓ€iteks "TΓ€nan teid tΓ€helepanu eest!" vΓ΅i "Parimate soovidega!"</s>
Citation
@inproceedings{kuulmets-etal-2024-teaching,
title = "Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer",
author = "Kuulmets, Hele-Andra and
Purason, Taido and
Luhtaru, Agnes and
Fishel, Mark",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-naacl.210",
doi = "10.18653/v1/2024.findings-naacl.210",
pages = "3309--3325",
abstract = "This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named Llammas, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.",
}
- Downloads last month
- 40
Model tree for tartuNLP/Llammas
Base model
meta-llama/Llama-2-7b-hfCollection including tartuNLP/Llammas
Paper for tartuNLP/Llammas
Evaluation results
- Winogrande Et
on
tartuNLP/winogrande_et
50.370 *