SVEN-175M
A 175M parameter language model trained from scratch for ~$7.
SVEN-175M is the full-scale model in the SVEN family, built entirely from scratch - custom tokenizer, custom architecture, custom training loop. No fine-tuning. No LoRA. Trained on 1.2 billion tokens of real English text, math, code, and instruction data on a single RTX 3090 GPU.
Model Details
| Architecture | Decoder-only transformer (LLaMA-style) |
| Parameters | 175,215,488 (~175M) |
| Context length | 1,024 tokens |
| Vocabulary | 32,000 (BPE, trained on training corpus) |
| Layers | 16 |
| Hidden size | 896 |
| Attention heads | 16 Q heads, 4 KV heads (GQA) |
| FFN hidden size | 2,660 |
| Activation | SwiGLU |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Training steps | 10,000 |
| Training tokens | 1,219,641,241 (~1.2B) |
| Final loss | ~3.0 |
| Precision | bfloat16 |
| GPU | 1x NVIDIA RTX 3090 (24GB) |
| Training time | ~13 hours |
| Training cost | ~$7 |
Training Data
Trained on a curated English-only mix of 1.36M documents from 6 public sources:
| Source | Documents | Content | Mix |
|---|---|---|---|
| FineWeb-Edu | 599,878 | High-quality educational web text | 44% |
| Wikipedia EN | 199,708 | English Wikipedia articles | 15% |
| OpenWebMath | 149,098 | Mathematical reasoning and problems | 11% |
| OpenHermes 2.5 | 149,139 | GPT-4 generated instruction data | 11% |
| SlimOrca | 98,058 | Curated reasoning and Q&A | 7% |
| Python codes | 46,376 | Python programming examples | 3% |
| Code instructions | 118,842 | Code instruction-response pairs | 9% |
| Total | 1,361,099 | 1.2B tokens | 100% |
All data filtered for English (ASCII ratio + common word detection), quality-filtered for minimum length and content density, and deduplicated before training.
Tokenizer: Custom BPE tokenizer trained on the full 1.36M document corpus using SentencePiece. 32,000 vocab size. Trained specifically for this model - not borrowed from another project.
Architecture Notes
SVEN-175M uses a modern LLaMA-style architecture:
- RoPE - Rotary positional embeddings applied to Q and K in every attention layer. Better extrapolation than learned positions.
- RMSNorm - Root Mean Square Layer Normalization. No mean subtraction, no bias. Faster than standard LayerNorm.
- SwiGLU - Swish-gated linear unit feed-forward network. Better gradient flow than GELU.
- Grouped Query Attention - 16 query heads, 4 KV heads. 4x memory saving on KV cache with minimal quality loss.
- Weight-tied embeddings - Input token embeddings and output projection share weights. Reduces parameter count without hurting quality.
- No bias in linear layers - Standard for modern LLMs.
- Flash Attention 2 - Used during training for faster attention computation.
Training Details
Optimizer: AdamW
Learning rate: 3e-4 peak, cosine decay to 3e-5
Warmup steps: 2,000
Weight decay: 0.1
Gradient clip: 1.0
Batch size: 4
Gradient accumulation steps: 32
Effective batch size: 128 sequences
Sequence length: 1,024 tokens
Training steps: 10,000
Loss curve:
step 0: 10.41 (random init, expected log(32000) = 10.37)
step 1,000: 6.90 (fast early learning)
step 2,000: 4.05 (warmup complete)
step 3,000: 3.62 (solid progress)
step 5,000: 3.37 (checkpoint)
step 10,000: 3.00 (final)
Intended Use
SVEN-175M is an English general-purpose language model trained from scratch as a learning and research project.
It is intended for:
- Text generation and completion in English
- General question answering on common topics
- Basic reasoning and instruction following
- Experimentation and research at small model scale
- Educational reference for from-scratch LLM training
It is not intended for:
- Production use cases requiring reliability
- Tasks requiring factual accuracy or up-to-date knowledge
- Safety-critical applications
- Replacing larger, properly aligned models
Limitations
- No instruction tuning - this is a base pretrained model, not a chat model. It completes text, it does not follow instructions reliably.
- No alignment - no RLHF, no DPO, no safety training of any kind.
- Knowledge cutoff - trained on a static dataset with no real-time knowledge.
- Scale - 175M parameters is small by modern standards. It cannot match the reasoning or knowledge depth of 7B+ models.
- Undertrained - 1.2B tokens is far below the Chinchilla-optimal ~3.5T tokens for this model size. The model has significant room to improve with more training.
- Not benchmarked - formal ARC, HellaSwag, and PIQA evals have not been run yet.
What's Different About This Model
Most models on HuggingFace are fine-tunes or quantizations of existing models. SVEN-175M is trained from random initialization on real data with a custom tokenizer.
Random weights
+
Custom 32k BPE tokenizer (trained on this corpus)
+
1.2B tokens of real English data
+
LLaMA-style architecture built from scratch
+
Single RTX 3090, 13 hours, ~$7
=
SVEN-175M
Model Family
| Model | Parameters | Loss | HuggingFace |
|---|---|---|---|
| SVEN-10M | 11.5M | 6.90 | sriksven/sven-10m |
| SVEN-175M | 175M | 3.00 | sriksven/sven-175m |
Files
| File | Description |
|---|---|
model.pt |
Full model checkpoint (weights + optimizer state) |
tokenizer.model |
SentencePiece BPE tokenizer model |
tokenizer.vocab |
Tokenizer vocabulary file |
config.yaml |
Model architecture configuration |
Quick Start
import sentencepiece as spm
import torch
from huggingface_hub import hf_hub_download
# download files
model_path = hf_hub_download("sriksven/sven-175m", "model.pt")
tok_path = hf_hub_download("sriksven/sven-175m", "tokenizer.model")
# load tokenizer
sp = spm.SentencePieceProcessor()
sp.load(tok_path)
# load model (requires sven-175m repo cloned)
# see github.com/sriksven/sven-175m for full inference code
Training Infrastructure
Platform: RunPod (cloud GPU rental)
GPU: NVIDIA RTX 3090 24GB
Instance type: On-demand
Cost: $0.46/hr
Total runtime: ~13 hours
Total cost: ~$7
Data stored: RunPod ephemeral disk (deleted after training)
Weights: HuggingFace Hub (permanent)
Monitoring: Weights & Biases
Citation
@misc{sven-175m,
author = {Sri Krishna Venkatesh},
title = {SVEN-175M: A 175M Parameter LLM Trained from Scratch},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/sriksven/sven-175m}
}
About
SVEN stands for Sri Krishna Venkatesh — hidden in plain sight.
Built from scratch. No shortcuts. ~$7.
- Downloads last month
- 2