OpenTransformer
/

AGILLM-4

Model card Files Files and versions

AGILLM-4 / README.md

OpenTransformer's picture

OpenTransformer

AGILLM4_training_script_and_bounded_uploads

269c08f verified 6 days ago

|

history blame contribute delete

2.03 kB

	---
	library_name: pytorch
	tags:
	- transformer
	- language-model
	- long-context
	- agillm
	- experimental
	---

	# AGILLM-4

	AGILLM-4 is the next training target after AGILLM-3. The current code is a
	production-oriented starting point, copied from the proven single-file trainer
	and extended for:

	- >1B parameter floor preset (`agillm4_floor`) and ~1.7B main preset (`agillm4_main`) with AR+SAT+NAT heads
	- 100 tokens per parameter target ratio, above the AGILLM-3 training ratio
	- longer block-size work on 24GB, B200, and B300 class GPUs
	- AR+SAT+NAT training, with sequential backward to reduce peak VRAM
	- SDPA and experimental sublinear local+landmark attention backends
	- exact M-fold expansion attention harvested from n1.py, with local verifier
	- fused QKV projection harvested from n1.py, with legacy checkpoint loading
	- profiling tools for memory, throughput, AR cost, SAT cost, and optimizer cost
	- synthetic long-context curriculum generation for recall and multi-hop tests

	Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command
	recipes. The current sublinear backend is intentionally experimental: profile it
	against SDPA before using it for a real run.

	On RTX 4090-class 24GB cards, `run_agillm4_4090_longblock.sh` now defaults to
	`agillm4_floor` instead of the AGILLM-3-sized `large` preset, starts at block
	`1280`, and backs off in smaller 20% steps if VRAM is too tight.
	For the current v47 seed, launch tmux with
	`/workspace/agillm-4/launch_agillm4_4090_floor_from_v47.sh`; it writes
	`/workspace/agillm4_floor_train.log`.

	Checkpoint upload policy is intentionally bounded for the public HF storage
	quota: status and log tails upload every 30 minutes, the latest multi-GB delta
	uploads at most daily, and full checkpoints upload at most weekly with only two
	current remote files retained. Local full saves default to daily and local
	retention is one full plus one delta, so the 64GB Vast disk does not slowly fill.

	Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).