File size: 3,193 Bytes
ebf013f
d1ae2f6
ebf013f
d1ae2f6
 
 
 
ebf013f
d1ae2f6
 
 
 
 
 
 
ebf013f
 
 
 
 
d1ae2f6
ebf013f
d1ae2f6
ebf013f
 
 
2fb73f9
d1ae2f6
 
 
 
 
ebf013f
 
 
d1ae2f6
ebf013f
d1ae2f6
 
 
 
 
 
ebf013f
d1ae2f6
ebf013f
d1ae2f6
ebf013f
d1ae2f6
 
 
 
 
 
ebf013f
d1ae2f6
ebf013f
d1ae2f6
ebf013f
d1ae2f6
 
 
 
 
 
ebf013f
 
 
d1ae2f6
 
 
 
 
 
ebf013f
 
 
 
 
 
 
 
d1ae2f6
ebf013f
d1ae2f6
ebf013f
d1ae2f6
 
 
 
ebf013f
 
d1ae2f6
 
 
 
 
 
 
 
 
 
 
ebf013f
 
 
 
d1ae2f6
ebf013f
d1ae2f6
ebf013f
d1ae2f6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: apache-2.0
language:
- he
- ar
- fa
- en
tags:
- multilingual
- hebrew
- arabic
- persian
- semitic
- sentiment-analysis
- cross-lingual
pipeline_tag: text-generation
---

# SemiticGPT-3B

A 3.14B parameter multilingual language model trained from scratch for **Hebrew, Arabic, Persian (Farsi), and English** โ€” a script-diverse, low-resource language cluster centered on Semitic languages.

## Model Details

| Property | Value |
|----------|-------|
| Parameters | 3.14B |
| Architecture | GPT (RoPE, SwiGLU, RMSNorm, fused QKV) |
| Vocab Size | 32,000 (custom multilingual SentencePiece BPE) |
| Max Seq Length | 2,048 |
| Pretraining Data | 4.48B tokens (HE 40%, AR 20%, FA 20%, EN 20%) |
| SFT Data | 36,980 samples (sentiment + translation) |

## Key Results

### Sentiment Classification (v4, clean balanced eval)

| Language | Base โ†’ SFT (Logprob) | Generative |
|----------|---------------------|------------|
| ๐Ÿ‡ฎ๐Ÿ‡ฑ Hebrew | 53.0% โ†’ **84.5%** | **82%** |
| ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic | 45.0% โ†’ **60.5%** | **64%** |
| ๐Ÿ‡ฎ๐Ÿ‡ท Farsi | 60.5% โ†’ **78.5%** | **74%** |
| ๐Ÿ‡บ๐Ÿ‡ธ English | 51.5% โ†’ **73.0%** | **64%** |

### Cross-lingual Transfer (Experiment B)

English-only SFT barely transfers to non-English languages, proving **multilingual SFT is necessary**:

| Language | Base | EN-SFT | Multi-SFT |
|----------|------|--------|-----------|
| Hebrew | 53.0% | 51.5% | **84.5%** |
| Arabic | 45.0% | 46.5% | **60.5%** |
| Farsi | 60.5% | 58.5% | **78.5%** |
| English | 51.5% | 52.0% | **73.0%** |

### Tokenizer Efficiency (Experiment C)

Our tokenizer uses **49-69% fewer tokens** than Llama-2 for Hebrew/Arabic/Farsi:

| Language | Ours (tok/byte) | Llama-2 (tok/byte) | Improvement |
|----------|----------------|-------------------|-------------|
| Hebrew | 0.195 | 0.569 | **+65.6%** |
| Arabic | 0.288 | 0.565 | **+49.1%** |
| Farsi | 0.175 | 0.561 | **+68.8%** |
| English | 0.270 | 0.264 | -2.2% |

## Files

- `base_model.pt` โ€” Pretrained base model (no SFT)
- `sft_model_v4.pt` โ€” Fine-tuned model (v4, sentiment + translation)
- `multilingual_32k.model` โ€” SentencePiece tokenizer
- `config.json` โ€” Model configuration
- `exp_ab_results.json` โ€” Experiment A+B results
- `exp_c_tokenizer_ablation.json` โ€” Experiment C results

## Usage

```python
import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor('multilingual_32k.model')

# Load model (see model_arch.py for architecture)
from model_arch import GPT
model = GPT()
state = torch.load('sft_model_v4.pt', map_location='cpu', weights_only=True)
model.load_state_dict(state['model_state_dict'])
model.eval()

# Generate
prompt = "<|user|> ืกื•ื•ื’ ืืช ื”ืจื’ืฉ ืฉืœ ื”ื˜ืงืกื˜ ื”ื‘ื (ื—ื™ื•ื‘ื™/ืฉืœื™ืœื™):\nืื ื™ ืื•ื”ื‘ ืืช ื”ืกืคืจ ื”ื–ื”!\n<|assistant|> "
ids = sp.encode(prompt)
x = torch.tensor([ids])
with torch.no_grad():
    for _ in range(20):
        logits = model(x)
        next_id = logits[0, -1].argmax().item()
        if next_id == 2: break  # EOS
        x = torch.cat([x, torch.tensor([[next_id]])], dim=1)
print(sp.decode(x[0, len(ids):].tolist()))
# โ†’ ื—ื™ื•ื‘ื™
```

## Citation

Paper forthcoming.

## License

Apache 2.0