File size: 3,193 Bytes
ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f 2fb73f9 d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 ebf013f d1ae2f6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | ---
license: apache-2.0
language:
- he
- ar
- fa
- en
tags:
- multilingual
- hebrew
- arabic
- persian
- semitic
- sentiment-analysis
- cross-lingual
pipeline_tag: text-generation
---
# SemiticGPT-3B
A 3.14B parameter multilingual language model trained from scratch for **Hebrew, Arabic, Persian (Farsi), and English** โ a script-diverse, low-resource language cluster centered on Semitic languages.
## Model Details
| Property | Value |
|----------|-------|
| Parameters | 3.14B |
| Architecture | GPT (RoPE, SwiGLU, RMSNorm, fused QKV) |
| Vocab Size | 32,000 (custom multilingual SentencePiece BPE) |
| Max Seq Length | 2,048 |
| Pretraining Data | 4.48B tokens (HE 40%, AR 20%, FA 20%, EN 20%) |
| SFT Data | 36,980 samples (sentiment + translation) |
## Key Results
### Sentiment Classification (v4, clean balanced eval)
| Language | Base โ SFT (Logprob) | Generative |
|----------|---------------------|------------|
| ๐ฎ๐ฑ Hebrew | 53.0% โ **84.5%** | **82%** |
| ๐ธ๐ฆ Arabic | 45.0% โ **60.5%** | **64%** |
| ๐ฎ๐ท Farsi | 60.5% โ **78.5%** | **74%** |
| ๐บ๐ธ English | 51.5% โ **73.0%** | **64%** |
### Cross-lingual Transfer (Experiment B)
English-only SFT barely transfers to non-English languages, proving **multilingual SFT is necessary**:
| Language | Base | EN-SFT | Multi-SFT |
|----------|------|--------|-----------|
| Hebrew | 53.0% | 51.5% | **84.5%** |
| Arabic | 45.0% | 46.5% | **60.5%** |
| Farsi | 60.5% | 58.5% | **78.5%** |
| English | 51.5% | 52.0% | **73.0%** |
### Tokenizer Efficiency (Experiment C)
Our tokenizer uses **49-69% fewer tokens** than Llama-2 for Hebrew/Arabic/Farsi:
| Language | Ours (tok/byte) | Llama-2 (tok/byte) | Improvement |
|----------|----------------|-------------------|-------------|
| Hebrew | 0.195 | 0.569 | **+65.6%** |
| Arabic | 0.288 | 0.565 | **+49.1%** |
| Farsi | 0.175 | 0.561 | **+68.8%** |
| English | 0.270 | 0.264 | -2.2% |
## Files
- `base_model.pt` โ Pretrained base model (no SFT)
- `sft_model_v4.pt` โ Fine-tuned model (v4, sentiment + translation)
- `multilingual_32k.model` โ SentencePiece tokenizer
- `config.json` โ Model configuration
- `exp_ab_results.json` โ Experiment A+B results
- `exp_c_tokenizer_ablation.json` โ Experiment C results
## Usage
```python
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor('multilingual_32k.model')
# Load model (see model_arch.py for architecture)
from model_arch import GPT
model = GPT()
state = torch.load('sft_model_v4.pt', map_location='cpu', weights_only=True)
model.load_state_dict(state['model_state_dict'])
model.eval()
# Generate
prompt = "<|user|> ืกืืื ืืช ืืจืืฉ ืฉื ืืืงืกื ืืื (ืืืืื/ืฉืืืื):\nืื ื ืืืื ืืช ืืกืคืจ ืืื!\n<|assistant|> "
ids = sp.encode(prompt)
x = torch.tensor([ids])
with torch.no_grad():
for _ in range(20):
logits = model(x)
next_id = logits[0, -1].argmax().item()
if next_id == 2: break # EOS
x = torch.cat([x, torch.tensor([[next_id]])], dim=1)
print(sp.decode(x[0, len(ids):].tolist()))
# โ ืืืืื
```
## Citation
Paper forthcoming.
## License
Apache 2.0
|