allenai/dolma
Updated β’ 4.69k β’ 1.03k
A 3.04 billion parameter multilingual language model trained from scratch for Hebrew, Arabic, English, and Farsi β four languages spanning three scripts (Latin, Hebrew, Arabic).
| Variant | File | Size | Description |
|---|---|---|---|
| Base (pretrained) | checkpoints/best_model.pt |
11.7 GB | Best pretrained checkpoint (step 20,000) |
| SFT (instruction-tuned) | checkpoints/sft_model.pt |
5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data |
Pretrained on ~50B tokens from:
Language distribution weighted toward Hebrew as anchor language.
Custom 32K vocabulary trained on balanced multilingual corpus:
| Language | Fertility (tokens/word) |
|---|---|
| Hebrew | 1.75 BPB (best) |
| Farsi | 3.14 BPB |
| Arabic | 3.73 BPB |
| English | 3.83 BPB |
The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.
| Language | Accuracy |
|---|---|
| English | 31.8% |
| Hebrew | 27.0% |
| Arabic | 28.4% |
| Farsi | 28.2% |
| Overall | 28.9% |
Note: Random baseline is 25%. This is a 3B model trained on a budget β competitive performance relative to scale.
SemiticGPT/
βββ checkpoints/
β βββ best_model.pt # Pretrained base model
β βββ sft_model.pt # SFT instruction-tuned model
βββ tokenizer/
β βββ multilingual_32k.model # SentencePiece tokenizer
β βββ multilingual_32k.vocab # Vocabulary file
βββ eval/
β βββ belebele_3b_results.json
β βββ belebele_3b.log
βββ training_scripts/
β βββ train_multilingual_3b_fsdp.py
β βββ train_sft_3b.py
β βββ prepare_sft_data_v2.py
βββ README.md
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")
# Load model (custom architecture β see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition
@misc{slasky2026semiticgpt,
title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/SemiticGPT}
}
Apache 2.0