You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Til-2B (base)

Til-2B — 1.98B-параметрлік көптілді base-модель: қазақ тілін бірінші кезекте қолдайтын, орыс/ағылшын/код/математиканы қамтитын тілдік модель. Til-Corpus корпусында нөлден бастап оқытылған.

Til-2B is a 1.98B-parameter multilingual base language model with first-class Kazakh support, trained from scratch on the Til-Corpus.

This is a base (non-instruct) model — it completes text; it does not follow chat instructions. Instruct and grammar-correction (GEC) fine-tunes are released separately under the TilQazyna organization.

Model details


Architecture	DeepSeek-V3-style dense decoder with MLA (Multi-head Latent Attention)
Parameters	1977.1M (tied input/output embeddings)
Hidden / layers	2048 / 30
Attention	16 heads, MLA: q_lora_rank 512, kv_lora_rank 256, qk_rope 32, qk_nope 96, v_head 96
FFN intermediate	8192 (SwiGLU)
Context length	4096
Position encoding	RoPE, θ = 100 000
Vocab	131 072 — Til-Tokenizer-128k
Precision	bf16

MLA compresses the KV-cache via low-rank latent projections, keeping the model memory-efficient at inference time.

Tokenizer

TilQazyna/Til-Tokenizer-128k — 131 072 BPE vocabulary trained with a focus on Kazakh morphology (≈1 token per Kazakh word on average), efficient for Russian, English, code and math. Special tokens: pad=0, <s>=1, </s>=2, <|im_start|>=6, <|im_end|>=7.

Training data

One full epoch over Til-Corpus — 47.0B tokens:

Domain	Tokens	Share
English	11.9B	25%
Code	9.9B	21%
Kazakh	9.7B	21%
Math	9.0B	19%
Russian	6.6B	14%

Documents are tokenized, concatenated with </s> separators and packed into fixed 4096-token sequences. Batches are fully shuffled across domains.

Training procedure


Global batch	256 sequences × 4096 = 1.05M tokens/step
Optimizer	AdamW, lr 6e-4, weight decay 0.1, grad clip 1.0
LR schedule	WSD (warmup → stable → linear decay over final 30%)
Precision	bf16
Hardware	8×H200, DDP

Evaluation

Bits-per-byte (BPB) on a frozen held-out set, 5 domains. BPB normalizes by UTF-8 bytes, so the number is independent of the tokenizer:

Domain	BPB ↓
Kazakh (kk)	0.4521
Code	0.4250
Russian (ru)	0.4969
Math	0.7524
English (en)	0.9056
Macro	0.6064

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-2B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")

ids = tok("Абай Құнанбайұлы — қазақ халқының", return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=80, do_sample=True,
                     temperature=0.7, top_p=0.9, repetition_penalty=1.1, pad_token_id=0)
print(tok.decode(out[0], skip_special_tokens=True))

Intended use & limitations

Intended: research on Kazakh/multilingual NLP; foundation for fine-tunes (instruct, GEC, domain adaptation); on-device text completion after quantization.
Base model: completes text, does not answer questions or follow instructions.
Factuality: hallucinates facts and numbers; do not use raw outputs as truth.
Reasoning/code: surface form is fluent; logical/arithmetic correctness is not guaranteed.
No safety alignment has been applied.

License

Apache 2.0. Access is gated (manual approval) for usage tracking.

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

F32

Model tree for TilQazyna/Til-2B

Finetunes

2 models

TilQazyna
/

Til-2B