SVEN-175M

A 175M parameter language model trained from scratch for ~$7.

SVEN-175M is the full-scale model in the SVEN family, built entirely from scratch - custom tokenizer, custom architecture, custom training loop. No fine-tuning. No LoRA. Trained on 1.2 billion tokens of real English text, math, code, and instruction data on a single RTX 3090 GPU.


Model Details

Architecture Decoder-only transformer (LLaMA-style)
Parameters 175,215,488 (~175M)
Context length 1,024 tokens
Vocabulary 32,000 (BPE, trained on training corpus)
Layers 16
Hidden size 896
Attention heads 16 Q heads, 4 KV heads (GQA)
FFN hidden size 2,660
Activation SwiGLU
Positional encoding RoPE
Normalization RMSNorm
Training steps 10,000
Training tokens 1,219,641,241 (~1.2B)
Final loss ~3.0
Precision bfloat16
GPU 1x NVIDIA RTX 3090 (24GB)
Training time ~13 hours
Training cost ~$7

Training Data

Trained on a curated English-only mix of 1.36M documents from 6 public sources:

Source Documents Content Mix
FineWeb-Edu 599,878 High-quality educational web text 44%
Wikipedia EN 199,708 English Wikipedia articles 15%
OpenWebMath 149,098 Mathematical reasoning and problems 11%
OpenHermes 2.5 149,139 GPT-4 generated instruction data 11%
SlimOrca 98,058 Curated reasoning and Q&A 7%
Python codes 46,376 Python programming examples 3%
Code instructions 118,842 Code instruction-response pairs 9%
Total 1,361,099 1.2B tokens 100%

All data filtered for English (ASCII ratio + common word detection), quality-filtered for minimum length and content density, and deduplicated before training.

Tokenizer: Custom BPE tokenizer trained on the full 1.36M document corpus using SentencePiece. 32,000 vocab size. Trained specifically for this model - not borrowed from another project.


Architecture Notes

SVEN-175M uses a modern LLaMA-style architecture:

  • RoPE - Rotary positional embeddings applied to Q and K in every attention layer. Better extrapolation than learned positions.
  • RMSNorm - Root Mean Square Layer Normalization. No mean subtraction, no bias. Faster than standard LayerNorm.
  • SwiGLU - Swish-gated linear unit feed-forward network. Better gradient flow than GELU.
  • Grouped Query Attention - 16 query heads, 4 KV heads. 4x memory saving on KV cache with minimal quality loss.
  • Weight-tied embeddings - Input token embeddings and output projection share weights. Reduces parameter count without hurting quality.
  • No bias in linear layers - Standard for modern LLMs.
  • Flash Attention 2 - Used during training for faster attention computation.

Training Details

Optimizer:                    AdamW
Learning rate:                3e-4 peak, cosine decay to 3e-5
Warmup steps:                 2,000
Weight decay:                 0.1
Gradient clip:                1.0
Batch size:                   4
Gradient accumulation steps:  32
Effective batch size:         128 sequences
Sequence length:              1,024 tokens
Training steps:               10,000

Loss curve:

step 0:      10.41   (random init, expected log(32000) = 10.37)
step 1,000:   6.90   (fast early learning)
step 2,000:   4.05   (warmup complete)
step 3,000:   3.62   (solid progress)
step 5,000:   3.37   (checkpoint)
step 10,000:  3.00   (final)

Intended Use

SVEN-175M is an English general-purpose language model trained from scratch as a learning and research project.

It is intended for:

  • Text generation and completion in English
  • General question answering on common topics
  • Basic reasoning and instruction following
  • Experimentation and research at small model scale
  • Educational reference for from-scratch LLM training

It is not intended for:

  • Production use cases requiring reliability
  • Tasks requiring factual accuracy or up-to-date knowledge
  • Safety-critical applications
  • Replacing larger, properly aligned models

Limitations

  • No instruction tuning - this is a base pretrained model, not a chat model. It completes text, it does not follow instructions reliably.
  • No alignment - no RLHF, no DPO, no safety training of any kind.
  • Knowledge cutoff - trained on a static dataset with no real-time knowledge.
  • Scale - 175M parameters is small by modern standards. It cannot match the reasoning or knowledge depth of 7B+ models.
  • Undertrained - 1.2B tokens is far below the Chinchilla-optimal ~3.5T tokens for this model size. The model has significant room to improve with more training.
  • Not benchmarked - formal ARC, HellaSwag, and PIQA evals have not been run yet.

What's Different About This Model

Most models on HuggingFace are fine-tunes or quantizations of existing models. SVEN-175M is trained from random initialization on real data with a custom tokenizer.

Random weights
      +
Custom 32k BPE tokenizer (trained on this corpus)
      +
1.2B tokens of real English data
      +
LLaMA-style architecture built from scratch
      +
Single RTX 3090, 13 hours, ~$7
      =
SVEN-175M

Model Family

Model Parameters Loss HuggingFace
SVEN-10M 11.5M 6.90 sriksven/sven-10m
SVEN-175M 175M 3.00 sriksven/sven-175m

Files

File Description
model.pt Full model checkpoint (weights + optimizer state)
tokenizer.model SentencePiece BPE tokenizer model
tokenizer.vocab Tokenizer vocabulary file
config.yaml Model architecture configuration

Quick Start

import sentencepiece as spm
import torch
from huggingface_hub import hf_hub_download

# download files
model_path = hf_hub_download("sriksven/sven-175m", "model.pt")
tok_path = hf_hub_download("sriksven/sven-175m", "tokenizer.model")

# load tokenizer
sp = spm.SentencePieceProcessor()
sp.load(tok_path)

# load model (requires sven-175m repo cloned)
# see github.com/sriksven/sven-175m for full inference code

Training Infrastructure

Platform:      RunPod (cloud GPU rental)
GPU:           NVIDIA RTX 3090 24GB
Instance type: On-demand
Cost:          $0.46/hr
Total runtime: ~13 hours
Total cost:    ~$7
Data stored:   RunPod ephemeral disk (deleted after training)
Weights:       HuggingFace Hub (permanent)
Monitoring:    Weights & Biases

Citation

@misc{sven-175m,
  author    = {Sri Krishna Venkatesh},
  title     = {SVEN-175M: A 175M Parameter LLM Trained from Scratch},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/sriksven/sven-175m}
}

About

SVEN stands for Sri Krishna Venkatesh — hidden in plain sight.

Built from scratch. No shortcuts. ~$7.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support