sungpt-swe-410m

A 410M-parameter instruction-tuned chat model trained from scratch on Swedish text, English web text, math, and code, then fine-tuned in two stages (chat + coding SFT). Built with the sungpt training framework β€” a Llama-style architecture (RoPE + RMSNorm + SwiGLU + GQA) with weights exported directly to LlamaForCausalLM for zero-friction HF compatibility.


Model details

Hyperparameter Value
Architecture LlamaForCausalLM (RoPE + RMSNorm + SwiGLU + GQA)
Hidden size 1024
Layers 24
Attention heads 16
KV heads (GQA) 8
FFN intermediate 4096 (SwiGLU)
Max sequence length 4096
Vocab size 32,000
Parameters ~435M
Precision bfloat16
Tied embeddings Yes

Quick start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "revana/sungpt-swe-410m"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Chat / instruction

This model uses the Alpaca prompt format:

### Instruction:
What is machine learning?

### Response:
messages = [
    {"role": "user", "content": "What is machine learning?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
)
# Decode only the newly generated assistant tokens
reply = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(reply)

Completion (base-style)

prompts = {
    "code":    "def merge_sort(arr):\n    \"\"\"Sort a list using merge sort.\"\"\"\n",
    "math":    "To solve the equation 2x + 5 = 13, we first subtract 5 from both sides to get",
    "english": "The transformer architecture was introduced in the paper 'Attention is All You Need' and works by",
    "swedish": "Sverige ar kant for sin starka valfardsmodell och",
}

for domain, prompt in prompts.items():
    print(f"\n--- {domain} ---")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
    )
    print(tokenizer.decode(out[0], skip_special_tokens=True))

CPU / low-VRAM:

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)

Default generation settings (generation_config.json): temperature=0.8, top_p=0.95, top_k=50, repetition_penalty=1.1, max_new_tokens=512 β€” so a bare model.generate(**inputs) already samples.


Training

Pretraining

Property Value
Framework sungpt (custom, Llama-style)
Hardware 1x H200 80 GB
Precision bfloat16, gradient checkpointing, torch.compile
Optimizer AdamW, lr 2e-4, beta=(0.9, 0.95), cosine decay
Batch size 64 sequences x 4096 tokens = ~262K tokens/step
Throughput ~48K tokens/sec at plateau

Pretraining data mix (~1.2B tokens):

Dataset Samples Notes
HuggingFaceFW/fineweb 200,000 English web
codeparrot/github-code 400,000 Code
HuggingFaceFW/fineweb-edu 200,000 Educational web
meta-math/MetaMathQA 395,000 Math reasoning

Data was pre-tokenized into memmap shards before training for maximum GPU throughput.

Fine-tuning (SFT β€” 2-stage pipeline)

Stage 1 β€” Chat SFT (teaches instruction-following format):

Property Value
Dataset tatsu-lab/alpaca (~52K examples)
Format Alpaca (### Instruction / ### Response)
Epochs 3 (~4,875 steps)
Batch size 32
LR 2e-5, cosine decay, 100 warmup steps
Precision bfloat16

Stage 2 β€” Coding SFT (teaches code-on-demand generation):

Property Value
Dataset theblackcat102/evol-codealpaca-v1 (~111K examples)
Format Alpaca (### Instruction / ### Response)
Epochs 3 (~10,406 steps)
Batch size 32
LR 1e-5, cosine decay, 100 warmup steps
Precision bfloat16

Tokenizer

Custom BPE tokenizer (32,000 vocab) trained on Swedish + English + code text. Special tokens: [BOS] (id 2), [EOS] (id 3), [PAD] (id 1).

tokenizer = AutoTokenizer.from_pretrained("revana/sungpt-swe-410m")
tokens = tokenizer("Hej varlden!", return_tensors="pt")

Limitations

  • Swedish skew β€” stronger at Swedish and code than general English.
  • No RLHF / safety alignment β€” outputs may be biased or inappropriate; use with care in production.
  • 410M parameters β€” capacity is limited; expect repetition on long contexts without repetition_penalty.

License

Apache 2.0 β€” see LICENSE.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support