sungpt-swe-410m
A 410M-parameter instruction-tuned chat model trained from scratch on Swedish text, English web text, math, and code,
then fine-tuned in two stages (chat + coding SFT).
Built with the sungpt training framework β a Llama-style architecture
(RoPE + RMSNorm + SwiGLU + GQA) with weights exported directly to LlamaForCausalLM for zero-friction HF compatibility.
Model details
| Hyperparameter | Value |
|---|---|
| Architecture | LlamaForCausalLM (RoPE + RMSNorm + SwiGLU + GQA) |
| Hidden size | 1024 |
| Layers | 24 |
| Attention heads | 16 |
| KV heads (GQA) | 8 |
| FFN intermediate | 4096 (SwiGLU) |
| Max sequence length | 4096 |
| Vocab size | 32,000 |
| Parameters | ~435M |
| Precision | bfloat16 |
| Tied embeddings | Yes |
Quick start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "revana/sungpt-swe-410m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
Chat / instruction
This model uses the Alpaca prompt format:
### Instruction:
What is machine learning?
### Response:
messages = [
{"role": "user", "content": "What is machine learning?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
)
# Decode only the newly generated assistant tokens
reply = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(reply)
Completion (base-style)
prompts = {
"code": "def merge_sort(arr):\n \"\"\"Sort a list using merge sort.\"\"\"\n",
"math": "To solve the equation 2x + 5 = 13, we first subtract 5 from both sides to get",
"english": "The transformer architecture was introduced in the paper 'Attention is All You Need' and works by",
"swedish": "Sverige ar kant for sin starka valfardsmodell och",
}
for domain, prompt in prompts.items():
print(f"\n--- {domain} ---")
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=150,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
CPU / low-VRAM:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
Default generation settings (generation_config.json): temperature=0.8, top_p=0.95, top_k=50,
repetition_penalty=1.1, max_new_tokens=512 β so a bare model.generate(**inputs) already samples.
Training
Pretraining
| Property | Value |
|---|---|
| Framework | sungpt (custom, Llama-style) |
| Hardware | 1x H200 80 GB |
| Precision | bfloat16, gradient checkpointing, torch.compile |
| Optimizer | AdamW, lr 2e-4, beta=(0.9, 0.95), cosine decay |
| Batch size | 64 sequences x 4096 tokens = ~262K tokens/step |
| Throughput | ~48K tokens/sec at plateau |
Pretraining data mix (~1.2B tokens):
| Dataset | Samples | Notes |
|---|---|---|
| HuggingFaceFW/fineweb | 200,000 | English web |
| codeparrot/github-code | 400,000 | Code |
| HuggingFaceFW/fineweb-edu | 200,000 | Educational web |
| meta-math/MetaMathQA | 395,000 | Math reasoning |
Data was pre-tokenized into memmap shards before training for maximum GPU throughput.
Fine-tuning (SFT β 2-stage pipeline)
Stage 1 β Chat SFT (teaches instruction-following format):
| Property | Value |
|---|---|
| Dataset | tatsu-lab/alpaca (~52K examples) |
| Format | Alpaca (### Instruction / ### Response) |
| Epochs | 3 (~4,875 steps) |
| Batch size | 32 |
| LR | 2e-5, cosine decay, 100 warmup steps |
| Precision | bfloat16 |
Stage 2 β Coding SFT (teaches code-on-demand generation):
| Property | Value |
|---|---|
| Dataset | theblackcat102/evol-codealpaca-v1 (~111K examples) |
| Format | Alpaca (### Instruction / ### Response) |
| Epochs | 3 (~10,406 steps) |
| Batch size | 32 |
| LR | 1e-5, cosine decay, 100 warmup steps |
| Precision | bfloat16 |
Tokenizer
Custom BPE tokenizer (32,000 vocab) trained on Swedish + English + code text.
Special tokens: [BOS] (id 2), [EOS] (id 3), [PAD] (id 1).
tokenizer = AutoTokenizer.from_pretrained("revana/sungpt-swe-410m")
tokens = tokenizer("Hej varlden!", return_tensors="pt")
Limitations
- Swedish skew β stronger at Swedish and code than general English.
- No RLHF / safety alignment β outputs may be biased or inappropriate; use with care in production.
- 410M parameters β capacity is limited; expect repetition on long contexts without
repetition_penalty.
License
Apache 2.0 β see LICENSE.
- Downloads last month
- -