sungpt-swe-410m

A 410M-parameter instruction-tuned chat model trained from scratch on Swedish text, English web text, math, and code, then fine-tuned in two stages (chat + coding SFT). Built with the sungpt training framework — a Llama-style architecture (RoPE + RMSNorm + SwiGLU + GQA) with weights exported directly to LlamaForCausalLM for zero-friction HF compatibility.

Model details

Hyperparameter	Value
Architecture	LlamaForCausalLM (RoPE + RMSNorm + SwiGLU + GQA)
Hidden size	1024
Layers	24
Attention heads	16
KV heads (GQA)	8
FFN intermediate	4096 (SwiGLU)
Max sequence length	4096
Vocab size	32,000
Parameters	~435M
Precision	bfloat16
Tied embeddings	Yes

Quick start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "revana/sungpt-swe-410m"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Chat / instruction

This model uses the Alpaca prompt format:

### Instruction:
What is machine learning?

### Response:

messages = [
    {"role": "user", "content": "What is machine learning?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
)
# Decode only the newly generated assistant tokens
reply = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(reply)

Completion (base-style)

prompts = {
    "code":    "def merge_sort(arr):\n    \"\"\"Sort a list using merge sort.\"\"\"\n",
    "math":    "To solve the equation 2x + 5 = 13, we first subtract 5 from both sides to get",
    "english": "The transformer architecture was introduced in the paper 'Attention is All You Need' and works by",
    "swedish": "Sverige ar kant for sin starka valfardsmodell och",
}

for domain, prompt in prompts.items():
    print(f"\n--- {domain} ---")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
    )
    print(tokenizer.decode(out[0], skip_special_tokens=True))

CPU / low-VRAM:

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)

Default generation settings (generation_config.json): temperature=0.8, top_p=0.95, top_k=50, repetition_penalty=1.1, max_new_tokens=512 — so a bare model.generate(**inputs) already samples.

Training

Pretraining

Property	Value
Framework	sungpt (custom, Llama-style)
Hardware	1x H200 80 GB
Precision	bfloat16, gradient checkpointing, `torch.compile`
Optimizer	AdamW, lr 2e-4, beta=(0.9, 0.95), cosine decay
Batch size	64 sequences x 4096 tokens = ~262K tokens/step
Throughput	~48K tokens/sec at plateau

Pretraining data mix (~1.2B tokens):

Dataset	Samples	Notes
HuggingFaceFW/fineweb	200,000	English web
codeparrot/github-code	400,000	Code
HuggingFaceFW/fineweb-edu	200,000	Educational web
meta-math/MetaMathQA	395,000	Math reasoning

Data was pre-tokenized into memmap shards before training for maximum GPU throughput.

Fine-tuning (SFT — 2-stage pipeline)

Stage 1 — Chat SFT (teaches instruction-following format):

Property	Value
Dataset	tatsu-lab/alpaca (~52K examples)
Format	Alpaca (`### Instruction / ### Response`)
Epochs	3 (~4,875 steps)
Batch size	32
LR	2e-5, cosine decay, 100 warmup steps
Precision	bfloat16

Stage 2 — Coding SFT (teaches code-on-demand generation):

Property	Value
Dataset	theblackcat102/evol-codealpaca-v1 (~111K examples)
Format	Alpaca (`### Instruction / ### Response`)
Epochs	3 (~10,406 steps)
Batch size	32
LR	1e-5, cosine decay, 100 warmup steps
Precision	bfloat16

Tokenizer

Custom BPE tokenizer (32,000 vocab) trained on Swedish + English + code text. Special tokens: [BOS] (id 2), [EOS] (id 3), [PAD] (id 1).

tokenizer = AutoTokenizer.from_pretrained("revana/sungpt-swe-410m")
tokens = tokenizer("Hej varlden!", return_tensors="pt")

Limitations

Swedish skew — stronger at Swedish and code than general English.
No RLHF / safety alignment — outputs may be biased or inappropriate; use with care in production.
410M parameters — capacity is limited; expect repetition on long contexts without repetition_penalty.

License

Apache 2.0 — see LICENSE.

Downloads last month: -