GPT

This is my own implementation of GPT 2

This covers both Tokenization and Pretraining from scratch.

I wanted to try to make this a little bit end to end by implementing SFT with LoRA using transformers, quantization using llama.cpp and inference with llama.cpp and huggingface. I'll try implementing these from scratch in future projects.

I skiped out on RLHF. That's probably be for a better project :)

I used an A100 and an 8xA100 GPU cluster, attatched to a filesystem. This setup roughly cost me 80 dollars.

Setup, tokenization, pretraining, sft roughly takes 4 hours.

You can run this in llama.cpp or if you want to try running this on your ios/android device, i reccomend pocketpal.

Transformer Architecture

`train-model.py`

This file contains both the model implementation and the training loop.

1) Model code

Key classes and components

This doc maps the main code in this repo to the conceptual pieces of the model and training pipeline.

train-model.py

`GPTConfig`

Holds model hyperparameters:

block_size (context length, max T)
vocab_size (V)
n_layer (L)
n_head (H)
n_embd (C)

`GPT`

Top-level decoder-only Transformer.

Key submodules:

wte: token embedding matrix $W_E \in \mathbb{R}^{V\times C}$
wpe: position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$
h: stack of Block modules (length n_layer)
ln_f: final LayerNorm
lm_head: output projection to vocab (tied to wte)

`Block`

One Transformer block (pre-LN residual):

ln_1 → attn → residual add
ln_2 → mlp → residual add

`CausalSelfAttention`

Multi-head causal self-attention:

projects $X$ into $Q,K,V$
uses causal masking ($s>t$ masked) via fused attention
projects concatenated heads back to $C$

`MLP`

Feed-forward network:

$C\to 4C\to C$ with GELU

`DataLoaderLite`

Loads tokenized .npy shards and yields (x, y) pairs for next-token prediction.

fineweb.py

downloads FineWeb-Edu split
tokenizes with GPT-2 BPE (tiktoken)
writes token shards to disk as .npy (first shard often used as validation)

evals.py

Standalone HellaSwag evaluator (multiple-choice by loss). To use it with your trained model, swap HF model loading for your GPT checkpoint.

Transformer formulas

Notation: batch $B$, sequence length $T$, embedding dim $C$, heads $H$, head dim $d=C/H$, vocab $V$, layers $L$.

Token + positional embeddings

Let token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ and position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$. With token IDs $\mathrm{idx}\in{0,\dots,V-1}^{B\times T}$ and positions $p_t=t$:

$\begin{aligned} x_0[b,t,:] = W_E[\mathrm{idx}_{b,t}] + W_P[p_t] \quad \in \mathbb{R}^{C}. \end{aligned}$

LayerNorm (per token, over channels)

For $u\in\mathbb{R}^{C}$:

$\begin{aligned} \mu(u) &= \frac{1}{C}\sum_{i=1}^{C} u_i, \\ \sigma^2(u) &= \frac{1}{C}\sum_{i=1}^{C}\left(u_i-\mu(u)\right)^2, \\ \mathrm{LN}(u) &= \gamma \odot \frac{u-\mu(u)}{\sqrt{\sigma^2(u)+\varepsilon}} + \beta, \end{aligned}$

with learnable $\gamma,\beta\in\mathbb{R}^{C}$.

Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$

$\begin{aligned} x_{l+\frac12} &= x_l + \mathrm{Attn}\!\left(\mathrm{LN}(x_l)\right),\\ x_{l+1} &= x_{l+\frac12} + \mathrm{MLP}\!\left(\mathrm{LN}(x_{l+\frac12})\right). \end{aligned}$

Multi-head causal self-attention

Given $X=\mathrm{LN}(x)\in\mathbb{R}^{B\times T\times C}$, per head $h\in{1,\dots,H}$:

$\begin{aligned} Q^{(h)} &= X W_Q^{(h)} + b_Q^{(h)},\\ K^{(h)} &= X W_K^{(h)} + b_K^{(h)},\\ V^{(h)} &= X W_V^{(h)} + b_V^{(h)}, \end{aligned}$

where $Q^{(h)},K^{(h)},V^{(h)}\in\mathbb{R}^{B\times T\times d}$.

Causal masked scores:

$\begin{aligned} S^{(h)}_{t,s} &= \frac{\langle Q^{(h)}_{t},K^{(h)}_{s}\rangle}{\sqrt{d}} + M_{t,s},\\ M_{t,s} &= \begin{cases} 0, & s \le t,\\ -\infty, & s > t. \end{cases} \end{aligned}$

Attention weights and output:

$\begin{aligned} A^{(h)}_{t,s} &= \mathrm{softmax}_s\!\left(S^{(h)}_{t,s}\right),\\ O^{(h)}_{t} &= \sum_{s=0}^{T-1} A^{(h)}_{t,s}\, V^{(h)}_{s}. \end{aligned}$

Concatenate heads and project:

$\begin{aligned} O &= \mathrm{Concat}\left(O^{(1)},\dots,O^{(H)}\right)\in\mathbb{R}^{B\times T\times C},\\ \mathrm{Attn}(X) &= O W_O + b_O \in \mathbb{R}^{B\times T\times C}. \end{aligned}$

MLP (feed-forward)

Uses expansion ratio $4\times$:

$\begin{aligned} \mathrm{MLP}(X) &= \mathrm{GELU}(XW_1+b_1)\,W_2 + b_2,\\ W_1 &\in \mathbb{R}^{C\times 4C},\quad W_2\in \mathbb{R}^{4C\times C}. \end{aligned}$

GELU (tanh approximation)

$\mathrm{GELU}(z)=\frac12 z\left(1+\tanh\left(\sqrt{\frac{2}{\pi}}\left(z+0.044715 z^3\right)\right)\right).$

Final LayerNorm + LM head (tied weights)

Let $\hat{x}=\mathrm{LN}_f(x_L)$:

$\begin{aligned} \mathrm{logits} &= \hat{x} W_U + b_U \in \mathbb{R}^{B\times T\times V},\\ W_U &= W_E^\top \quad \text{(weight tying)}. \end{aligned}$

Training objective (next-token cross entropy)

With targets $y\in{0,\dots,V-1}^{B\times T}$:

$\begin{aligned} p(y_{b,t}\mid x) &= \mathrm{softmax}(\mathrm{logits}_{b,t,:})_{y_{b,t}},\\ \mathcal{L} &= -\frac{1}{BT}\sum_{b=1}^{B}\sum_{t=1}^{T}\log p(y_{b,t}\mid x). \end{aligned}$

Walkthrough: `fineweb.py`

This script builds tokenized training shards from FineWeb-Edu.

What it does:

Loads dataset:
- load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train")
Tokenizes with tiktoken.get_encoding("gpt2")
Prepends <|endoftext|> token to each document
Writes .npy shards of shard_size = 1e8 tokens each
Uses multiprocessing (os.cpu_count()//2 workers)
Naming:
- first shard is "val" (so you always have a validation shard)
- later shards are "train"
- output files look like: edufineweb_train_000001.npy (numpy adds .npy)

What to change most often:

remote_name (sample-10BT → bigger subsets if you want)
shard_size (smaller if disk is tight / you want more shards)
local_dir (just make it match train-model.py’s data_root)

Start to End Execution

Repo layout

Training / pipeline code (all in src/)
- fineweb.py → tokenization to src/edu_fineweb10B/
- train-model.py → pretraining, outputs to src/log/
- convert_ckpt_to_hf.py → .pt checkpoint → HF folder
- evals.py → eval HF folder on HellaSwag
- sft.py → LoRA SFT (outputs src/hf_sft_lora/)
- merge_lora.py → merge LoRA into base (outputs merged HF folder)
- user_assistant.jinja → your llama.cpp chat template
Artifacts produced(these will not be included in the git repo but you can find them on huggingface)
- src/log/model_19072.pt (final pretrain ckpt)
- src/hf_pretrained/ (HF-exported base)
- src/hf_sft_lora/ (LoRA adapters + checkpoints)
- src/hf_sft_merged/ (merged HF model)
- src/gpt2-Q4_K_M_2.gguf (quantized GGUF)

Two-instance architecture (what runs where)

I used two Lambda GPU instances mounted to the same shared filesystem:

Instance A (8×A100): Pretraining
Instance B (1×A100): Tokenization + Eval + SFT + Merge + GGUF conversion + Quantization + Upload

I attach the filesystem in Lambda’s UI. After attaching, verify both machines see the same path (example below).

Verify shared mount (both instances)

df -h
ls -lah /home/ubuntu/GPT

Clone the repo :

export REPO=/home/ubuntu/GPT

git clone https://github.com/ShrithikShahapure/GPT.git
cd $REPO

1) Environment setup (run on BOTH instances)

System deps

sudo apt-get update
sudo apt-get install -y git git-lfs python3-venv python3-pip build-essential cmake pkg-config
git lfs install

Python venv + packages

cd $REPO
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

pip install numpy tqdm requests tiktoken datasets transformers accelerate safetensors peft trl huggingface_hub
pip install torch --index-url https://download.pytorch.org/whl/cu121

(Optional but recommended) Put HF caches on the shared filesystem

This avoids re-downloading datasets/models separately on each instance:

export HF_HOME=$REPO/.hf
export HF_DATASETS_CACHE=$HF_HOME/datasets
export TRANSFORMERS_CACHE=$HF_HOME/transformers
mkdir -p "$HF_HOME"

2) Build llama.cpp (do on the machine you’ll convert/quantize/test on)

cd $REPO
git clone https://github.com/ggml-org/llama.cpp
cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build -j

llama.cpp expects models in GGUF and provides conversion scripts. ([GitHub][1])

3) Tokenization (run on Instance B: 1×A100)

3.1 Run FineWeb tokenization

This writes shards under src/edu_fineweb10B/:

cd $REPO
source .venv/bin/activate
python src/fineweb.py

3.3 Ensure `hellaswag.py` exists (only if missing)

train-model.py imports hellaswag. If src/hellaswag.py doesn’t exist, do:

cd $REPO/src
cp -f evals.py hellaswag.py

4) Pretraining (run on Instance A: 8×A100)

From your repo root:

cd $REPO
source .venv/bin/activate
cd src
torchrun --standalone --nproc_per_node=8 train-model.py

Watch training logs

tail -n 200 -F $REPO/src/log/log.txt

Checkpoints

Your checkpoints land in:

ls -lah $REPO/src/log/model_*.pt

Example you currently have:

model_05000.pt
model_10000.pt
model_15000.pt
model_19072.pt

5) Convert checkpoint → Hugging Face model (run on Instance B)

Convert your final checkpoint into HF format (you already have hf_pretrained, but here’s the exact command):

cd $REPO
source .venv/bin/activate
python3 src/convert_ckpt_to_hf.py \
  --ckpt src/log/model_19072.pt \
  --out  src/hf_pretrained

6) Evaluate HF model (run on Instance B)

cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_pretrained-2 -d cuda

7) SFT (LoRA) (run on Instance A)

Your repo has src/sft.py. Run it:

cd $REPO
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 python3 src/sft.py

Expected output:

src/hf_sft_lora/ (adapters + checkpoints)

8) Merge LoRA → merged HF model (run on Instance B)

Run:

cd $REPO
source .venv/bin/activate
python3 src/merge_lora.py

Evaluate merged model:

cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_sft_merged -d cuda

9) Convert merged HF → GGUF + Quantize (run on Instance B)

9.1 HF → fp16 GGUF

cd $REPO
source .venv/bin/activate

python llama.cpp/convert_hf_to_gguf.py \
  src/hf_sft_merged \
  --outfile src/gpt2-f16.gguf

9.2 Quantize to Q4_K_M

cd $REPO
./llama.cpp/build/bin/llama-quantize \
  src/gpt2-f16.gguf \
  src/gpt2-Q4_K_M_2.gguf \
  Q4_K_M

Login

pip install -U "huggingface_hub[cli]"
hf auth login

Upload the GGUF

Set your repo:

export HF_REPO="your-username/your-gguf-repo"

Upload:

hf upload "$HF_REPO" src/gpt2-Q4_K_M_2.gguf gpt2-Q4_K_M_2.gguf --repo-type model

(Optionally upload your template too)

hf upload "$HF_REPO" src/user_assistant.jinja user_assistant.jinja --repo-type model

11) Inference with llama.cpp (your exact “ONE paragraph + ” flow)

From repo root (note: your GGUF is src/gpt2-Q4_K_M_2.gguf):

cd $REPO

./llama.cpp/build/bin/llama-cli \
  -m src/gpt2-Q4_K_M_2.gguf \
  --jinja --chat-template-file src/user_assistant.jinja \
  -cnv -st \
  -sys "Answer in exactly ONE paragraph (no blank lines). End your answer with: <END>" \
  -r "<END>" \
  -n 200 --temp 0.2 --top-p 0.95 \
  --repeat-penalty 1.15 --repeat-last-n 128

12) Pull the GGUF from Hugging Face (Mac/Linux) and use it

Download via HF CLI

pip install -U "huggingface_hub[cli]"
hf download "$HF_REPO" gpt2-Q4_K_M_2.gguf --local-dir .

Or let llama.cpp pull from HF (caching supported)

Hugging Face documents running GGUFs with llama.cpp by pointing to the HF repo/file, and llama.cpp caches downloads (cache path controlled by LLAMA_CACHE). ([Hugging Face][2])

13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline

PocketPal supports downloading GGUF model weights (including from Hugging Face) and chatting offline. ([App Store][3]) PocketPal’s project also notes Hugging Face Hub integration (browse/download models inside the app). ([GitHub][4])

Recommended flow

Upload gpt2-Q4_K_M_2.gguf to Hugging Face (Section 10).
On iPhone:
- Open PocketPal
- Go to Models / Download / Hugging Face (wording varies by version)
- Search your repo (your-username/your-gguf-repo)
- Download gpt2-Q4_K_M_2.gguf
Start a chat offline.

If PocketPal asks for a chat template, choose the one that matches your “User / Assistant” formatting, or keep prompts simple (your user_assistant.jinja template is what you used in llama.cpp).

Downloads last month: 38

GGUF

Model size

0.1B params

Architecture

gpt2

Hardware compatibility

4-bit

16-bit

GPT

Transformer Architecture

train-model.py

1) Model code

Key classes and components

train-model.py

GPTConfig

GPT

Block

CausalSelfAttention

MLP

DataLoaderLite

fineweb.py

evals.py

Transformer formulas

Token + positional embeddings

LayerNorm (per token, over channels)

Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$

Multi-head causal self-attention

MLP (feed-forward)

GELU (tanh approximation)

Final LayerNorm + LM head (tied weights)

Training objective (next-token cross entropy)

Walkthrough: fineweb.py

Start to End Execution

Repo layout

Two-instance architecture (what runs where)

Verify shared mount (both instances)

1) Environment setup (run on BOTH instances)

System deps

Python venv + packages

(Optional but recommended) Put HF caches on the shared filesystem

2) Build llama.cpp (do on the machine you’ll convert/quantize/test on)

3) Tokenization (run on Instance B: 1×A100)

3.1 Run FineWeb tokenization

3.3 Ensure hellaswag.py exists (only if missing)

4) Pretraining (run on Instance A: 8×A100)

Watch training logs

Checkpoints

5) Convert checkpoint → Hugging Face model (run on Instance B)

6) Evaluate HF model (run on Instance B)

7) SFT (LoRA) (run on Instance A)

8) Merge LoRA → merged HF model (run on Instance B)

9) Convert merged HF → GGUF + Quantize (run on Instance B)

9.1 HF → fp16 GGUF

9.2 Quantize to Q4_K_M

Login

Upload the GGUF

11) Inference with llama.cpp (your exact “ONE paragraph + ” flow)

12) Pull the GGUF from Hugging Face (Mac/Linux) and use it

Download via HF CLI

Or let llama.cpp pull from HF (caching supported)

13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline

Recommended flow

`train-model.py`

`GPTConfig`

`GPT`

`Block`

`CausalSelfAttention`

`MLP`

`DataLoaderLite`

Walkthrough: `fineweb.py`

3.3 Ensure `hellaswag.py` exists (only if missing)