GPT
This is my own implementation of GPT 2
This covers both Tokenization and Pretraining from scratch.
I wanted to try to make this a little bit end to end by implementing SFT with LoRA using transformers, quantization using llama.cpp and inference with llama.cpp and huggingface. I'll try implementing these from scratch in future projects.
I skiped out on RLHF. That's probably be for a better project :)
I used an A100 and an 8xA100 GPU cluster, attatched to a filesystem. This setup roughly cost me 80 dollars.
Setup, tokenization, pretraining, sft roughly takes 4 hours.
You can run this in llama.cpp or if you want to try running this on your ios/android device, i reccomend pocketpal.
Transformer Architecture
train-model.py
This file contains both the model implementation and the training loop.
1) Model code
Key classes and components
This doc maps the main code in this repo to the conceptual pieces of the model and training pipeline.
train-model.py
GPTConfig
Holds model hyperparameters:
block_size(context length, max T)vocab_size(V)n_layer(L)n_head(H)n_embd(C)
GPT
Top-level decoder-only Transformer.
Key submodules:
wte: token embedding matrix $W_E \in \mathbb{R}^{V\times C}$wpe: position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$h: stack ofBlockmodules (lengthn_layer)ln_f: final LayerNormlm_head: output projection to vocab (tied towte)
Block
One Transformer block (pre-LN residual):
ln_1→attn→ residual addln_2→mlp→ residual add
CausalSelfAttention
Multi-head causal self-attention:
- projects $X$ into $Q,K,V$
- uses causal masking ($s>t$ masked) via fused attention
- projects concatenated heads back to $C$
MLP
Feed-forward network:
- $C\to 4C\to C$ with GELU
DataLoaderLite
Loads tokenized .npy shards and yields (x, y) pairs for next-token prediction.
fineweb.py
- downloads FineWeb-Edu split
- tokenizes with GPT-2 BPE (tiktoken)
- writes token shards to disk as
.npy(first shard often used as validation)
evals.py
Standalone HellaSwag evaluator (multiple-choice by loss).
To use it with your trained model, swap HF model loading for your GPT checkpoint.
Transformer formulas
Notation: batch $B$, sequence length $T$, embedding dim $C$, heads $H$, head dim $d=C/H$, vocab $V$, layers $L$.
Token + positional embeddings
Let token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ and position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$. With token IDs $\mathrm{idx}\in{0,\dots,V-1}^{B\times T}$ and positions $p_t=t$:
LayerNorm (per token, over channels)
For $u\in\mathbb{R}^{C}$:
with learnable $\gamma,\beta\in\mathbb{R}^{C}$.
Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$
Multi-head causal self-attention
Given $X=\mathrm{LN}(x)\in\mathbb{R}^{B\times T\times C}$, per head $h\in{1,\dots,H}$:
where $Q^{(h)},K^{(h)},V^{(h)}\in\mathbb{R}^{B\times T\times d}$.
Causal masked scores:
Attention weights and output:
Concatenate heads and project:
MLP (feed-forward)
Uses expansion ratio $4\times$:
GELU (tanh approximation)
Final LayerNorm + LM head (tied weights)
Let $\hat{x}=\mathrm{LN}_f(x_L)$:
Training objective (next-token cross entropy)
With targets $y\in{0,\dots,V-1}^{B\times T}$:
Walkthrough: fineweb.py
This script builds tokenized training shards from FineWeb-Edu.
What it does:
Loads dataset:
load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train")
Tokenizes with
tiktoken.get_encoding("gpt2")Prepends
<|endoftext|>token to each documentWrites
.npyshards ofshard_size = 1e8tokens eachUses multiprocessing (
os.cpu_count()//2workers)Naming:
- first shard is
"val"(so you always have a validation shard) - later shards are
"train" - output files look like:
edufineweb_train_000001.npy(numpy adds.npy)
- first shard is
What to change most often:
remote_name(sample-10BT→ bigger subsets if you want)shard_size(smaller if disk is tight / you want more shards)local_dir(just make it matchtrain-model.py’sdata_root)
Start to End Execution
Repo layout
Training / pipeline code (all in
src/)fineweb.py→ tokenization tosrc/edu_fineweb10B/train-model.py→ pretraining, outputs tosrc/log/convert_ckpt_to_hf.py→.ptcheckpoint → HF folderevals.py→ eval HF folder on HellaSwagsft.py→ LoRA SFT (outputssrc/hf_sft_lora/)merge_lora.py→ merge LoRA into base (outputs merged HF folder)user_assistant.jinja→ your llama.cpp chat template
Artifacts produced(these will not be included in the git repo but you can find them on huggingface)
src/log/model_19072.pt(final pretrain ckpt)src/hf_pretrained/(HF-exported base)src/hf_sft_lora/(LoRA adapters + checkpoints)src/hf_sft_merged/(merged HF model)src/gpt2-Q4_K_M_2.gguf(quantized GGUF)
Two-instance architecture (what runs where)
I used two Lambda GPU instances mounted to the same shared filesystem:
- Instance A (8×A100): Pretraining
- Instance B (1×A100): Tokenization + Eval + SFT + Merge + GGUF conversion + Quantization + Upload
I attach the filesystem in Lambda’s UI. After attaching, verify both machines see the same path (example below).
Verify shared mount (both instances)
df -h
ls -lah /home/ubuntu/GPT
Clone the repo :
export REPO=/home/ubuntu/GPT
git clone https://github.com/ShrithikShahapure/GPT.git
cd $REPO
1) Environment setup (run on BOTH instances)
System deps
sudo apt-get update
sudo apt-get install -y git git-lfs python3-venv python3-pip build-essential cmake pkg-config
git lfs install
Python venv + packages
cd $REPO
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install numpy tqdm requests tiktoken datasets transformers accelerate safetensors peft trl huggingface_hub
pip install torch --index-url https://download.pytorch.org/whl/cu121
(Optional but recommended) Put HF caches on the shared filesystem
This avoids re-downloading datasets/models separately on each instance:
export HF_HOME=$REPO/.hf
export HF_DATASETS_CACHE=$HF_HOME/datasets
export TRANSFORMERS_CACHE=$HF_HOME/transformers
mkdir -p "$HF_HOME"
2) Build llama.cpp (do on the machine you’ll convert/quantize/test on)
cd $REPO
git clone https://github.com/ggml-org/llama.cpp
cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build -j
llama.cpp expects models in GGUF and provides conversion scripts. ([GitHub][1])
3) Tokenization (run on Instance B: 1×A100)
3.1 Run FineWeb tokenization
This writes shards under src/edu_fineweb10B/:
cd $REPO
source .venv/bin/activate
python src/fineweb.py
3.3 Ensure hellaswag.py exists (only if missing)
train-model.py imports hellaswag. If src/hellaswag.py doesn’t exist, do:
cd $REPO/src
cp -f evals.py hellaswag.py
4) Pretraining (run on Instance A: 8×A100)
From your repo root:
cd $REPO
source .venv/bin/activate
cd src
torchrun --standalone --nproc_per_node=8 train-model.py
Watch training logs
tail -n 200 -F $REPO/src/log/log.txt
Checkpoints
Your checkpoints land in:
ls -lah $REPO/src/log/model_*.pt
Example you currently have:
model_05000.ptmodel_10000.ptmodel_15000.ptmodel_19072.pt
5) Convert checkpoint → Hugging Face model (run on Instance B)
Convert your final checkpoint into HF format (you already have hf_pretrained, but here’s the exact command):
cd $REPO
source .venv/bin/activate
python3 src/convert_ckpt_to_hf.py \
--ckpt src/log/model_19072.pt \
--out src/hf_pretrained
6) Evaluate HF model (run on Instance B)
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_pretrained-2 -d cuda
7) SFT (LoRA) (run on Instance A)
Your repo has src/sft.py. Run it:
cd $REPO
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 python3 src/sft.py
Expected output:
src/hf_sft_lora/(adapters + checkpoints)
8) Merge LoRA → merged HF model (run on Instance B)
Run:
cd $REPO
source .venv/bin/activate
python3 src/merge_lora.py
Evaluate merged model:
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_sft_merged -d cuda
9) Convert merged HF → GGUF + Quantize (run on Instance B)
9.1 HF → fp16 GGUF
cd $REPO
source .venv/bin/activate
python llama.cpp/convert_hf_to_gguf.py \
src/hf_sft_merged \
--outfile src/gpt2-f16.gguf
9.2 Quantize to Q4_K_M
cd $REPO
./llama.cpp/build/bin/llama-quantize \
src/gpt2-f16.gguf \
src/gpt2-Q4_K_M_2.gguf \
Q4_K_M
Login
pip install -U "huggingface_hub[cli]"
hf auth login
Upload the GGUF
Set your repo:
export HF_REPO="your-username/your-gguf-repo"
Upload:
hf upload "$HF_REPO" src/gpt2-Q4_K_M_2.gguf gpt2-Q4_K_M_2.gguf --repo-type model
(Optionally upload your template too)
hf upload "$HF_REPO" src/user_assistant.jinja user_assistant.jinja --repo-type model
11) Inference with llama.cpp (your exact “ONE paragraph + ” flow)
From repo root (note: your GGUF is src/gpt2-Q4_K_M_2.gguf):
cd $REPO
./llama.cpp/build/bin/llama-cli \
-m src/gpt2-Q4_K_M_2.gguf \
--jinja --chat-template-file src/user_assistant.jinja \
-cnv -st \
-sys "Answer in exactly ONE paragraph (no blank lines). End your answer with: <END>" \
-r "<END>" \
-n 200 --temp 0.2 --top-p 0.95 \
--repeat-penalty 1.15 --repeat-last-n 128
12) Pull the GGUF from Hugging Face (Mac/Linux) and use it
Download via HF CLI
pip install -U "huggingface_hub[cli]"
hf download "$HF_REPO" gpt2-Q4_K_M_2.gguf --local-dir .
Or let llama.cpp pull from HF (caching supported)
Hugging Face documents running GGUFs with llama.cpp by pointing to the HF repo/file, and llama.cpp caches downloads (cache path controlled by LLAMA_CACHE). ([Hugging Face][2])
13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline
PocketPal supports downloading GGUF model weights (including from Hugging Face) and chatting offline. ([App Store][3]) PocketPal’s project also notes Hugging Face Hub integration (browse/download models inside the app). ([GitHub][4])
Recommended flow
Upload
gpt2-Q4_K_M_2.ggufto Hugging Face (Section 10).On iPhone:
- Open PocketPal
- Go to Models / Download / Hugging Face (wording varies by version)
- Search your repo (
your-username/your-gguf-repo) - Download
gpt2-Q4_K_M_2.gguf
Start a chat offline.
If PocketPal asks for a chat template, choose the one that matches your “User / Assistant” formatting, or keep prompts simple (your
user_assistant.jinjatemplate is what you used in llama.cpp).
- Downloads last month
- 38
4-bit
16-bit
