SVEN-175M

A 175M parameter language model trained from scratch for ~$7.

SVEN-175M is the full-scale model in the SVEN family, built entirely from scratch - custom tokenizer, custom architecture, custom training loop. No fine-tuning. No LoRA. Trained on 1.2 billion tokens of real English text, math, code, and instruction data on a single RTX 3090 GPU.

Model Details


Architecture	Decoder-only transformer (LLaMA-style)
Parameters	175,215,488 (~175M)
Context length	1,024 tokens
Vocabulary	32,000 (BPE, trained on training corpus)
Layers	16
Hidden size	896
Attention heads	16 Q heads, 4 KV heads (GQA)
FFN hidden size	2,660
Activation	SwiGLU
Positional encoding	RoPE
Normalization	RMSNorm
Training steps	10,000
Training tokens	1,219,641,241 (~1.2B)
Final loss	~3.0
Precision	bfloat16
GPU	1x NVIDIA RTX 3090 (24GB)
Training time	~13 hours
Training cost	~$7

Training Data

Trained on a curated English-only mix of 1.36M documents from 6 public sources:

Source	Documents	Content	Mix
FineWeb-Edu	599,878	High-quality educational web text	44%
Wikipedia EN	199,708	English Wikipedia articles	15%
OpenWebMath	149,098	Mathematical reasoning and problems	11%
OpenHermes 2.5	149,139	GPT-4 generated instruction data	11%
SlimOrca	98,058	Curated reasoning and Q&A	7%
Python codes	46,376	Python programming examples	3%
Code instructions	118,842	Code instruction-response pairs	9%
Total	1,361,099	1.2B tokens	100%

All data filtered for English (ASCII ratio + common word detection), quality-filtered for minimum length and content density, and deduplicated before training.

Tokenizer: Custom BPE tokenizer trained on the full 1.36M document corpus using SentencePiece. 32,000 vocab size. Trained specifically for this model - not borrowed from another project.

Architecture Notes

SVEN-175M uses a modern LLaMA-style architecture:

RoPE - Rotary positional embeddings applied to Q and K in every attention layer. Better extrapolation than learned positions.
RMSNorm - Root Mean Square Layer Normalization. No mean subtraction, no bias. Faster than standard LayerNorm.
SwiGLU - Swish-gated linear unit feed-forward network. Better gradient flow than GELU.
Grouped Query Attention - 16 query heads, 4 KV heads. 4x memory saving on KV cache with minimal quality loss.
Weight-tied embeddings - Input token embeddings and output projection share weights. Reduces parameter count without hurting quality.
No bias in linear layers - Standard for modern LLMs.
Flash Attention 2 - Used during training for faster attention computation.

Training Details

Optimizer:                    AdamW
Learning rate:                3e-4 peak, cosine decay to 3e-5
Warmup steps:                 2,000
Weight decay:                 0.1
Gradient clip:                1.0
Batch size:                   4
Gradient accumulation steps:  32
Effective batch size:         128 sequences
Sequence length:              1,024 tokens
Training steps:               10,000

Loss curve:

step 0:      10.41   (random init, expected log(32000) = 10.37)
step 1,000:   6.90   (fast early learning)
step 2,000:   4.05   (warmup complete)
step 3,000:   3.62   (solid progress)
step 5,000:   3.37   (checkpoint)
step 10,000:  3.00   (final)

Intended Use

SVEN-175M is an English general-purpose language model trained from scratch as a learning and research project.

It is intended for:

Text generation and completion in English
General question answering on common topics
Basic reasoning and instruction following
Experimentation and research at small model scale
Educational reference for from-scratch LLM training

It is not intended for:

Production use cases requiring reliability
Tasks requiring factual accuracy or up-to-date knowledge
Safety-critical applications
Replacing larger, properly aligned models

Limitations

No instruction tuning - this is a base pretrained model, not a chat model. It completes text, it does not follow instructions reliably.
No alignment - no RLHF, no DPO, no safety training of any kind.
Knowledge cutoff - trained on a static dataset with no real-time knowledge.
Scale - 175M parameters is small by modern standards. It cannot match the reasoning or knowledge depth of 7B+ models.
Undertrained - 1.2B tokens is far below the Chinchilla-optimal ~3.5T tokens for this model size. The model has significant room to improve with more training.
Not benchmarked - formal ARC, HellaSwag, and PIQA evals have not been run yet.

What's Different About This Model

Most models on HuggingFace are fine-tunes or quantizations of existing models. SVEN-175M is trained from random initialization on real data with a custom tokenizer.

Random weights
      +
Custom 32k BPE tokenizer (trained on this corpus)
      +
1.2B tokens of real English data
      +
LLaMA-style architecture built from scratch
      +
Single RTX 3090, 13 hours, ~$7
      =
SVEN-175M

Model Family

Model	Parameters	Loss	HuggingFace
SVEN-10M	11.5M	6.90	sriksven/sven-10m
SVEN-175M	175M	3.00	sriksven/sven-175m

Files

File	Description
`model.pt`	Full model checkpoint (weights + optimizer state)
`tokenizer.model`	SentencePiece BPE tokenizer model
`tokenizer.vocab`	Tokenizer vocabulary file
`config.yaml`	Model architecture configuration

Quick Start

import sentencepiece as spm
import torch
from huggingface_hub import hf_hub_download

# download files
model_path = hf_hub_download("sriksven/sven-175m", "model.pt")
tok_path = hf_hub_download("sriksven/sven-175m", "tokenizer.model")

# load tokenizer
sp = spm.SentencePieceProcessor()
sp.load(tok_path)

# load model (requires sven-175m repo cloned)
# see github.com/sriksven/sven-175m for full inference code

Training Infrastructure

Platform:      RunPod (cloud GPU rental)
GPU:           NVIDIA RTX 3090 24GB
Instance type: On-demand
Cost:          $0.46/hr
Total runtime: ~13 hours
Total cost:    ~$7
Data stored:   RunPod ephemeral disk (deleted after training)
Weights:       HuggingFace Hub (permanent)
Monitoring:    Weights & Biases

Citation

@misc{sven-175m,
  author    = {Sri Krishna Venkatesh},
  title     = {SVEN-175M: A 175M Parameter LLM Trained from Scratch},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/sriksven/sven-175m}
}

About

SVEN stands for Sri Krishna Venkatesh — hidden in plain sight.

Built from scratch. No shortcuts. ~$7.

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support