| --- |
| library_name: pytorch |
| tags: |
| - transformer |
| - language-model |
| - long-context |
| - agillm |
| - experimental |
| --- |
| |
| # AGILLM-4 |
|
|
| AGILLM-4 is the next training target after AGILLM-3. The current code is a |
| production-oriented starting point, copied from the proven single-file trainer |
| and extended for: |
|
|
| - >1B parameter floor preset (`agillm4_floor`) and ~1.7B main preset (`agillm4_main`) with AR+SAT+NAT heads |
| - 100 tokens per parameter target ratio, above the AGILLM-3 training ratio |
| - longer block-size work on 24GB, B200, and B300 class GPUs |
| - AR+SAT+NAT training, with sequential backward to reduce peak VRAM |
| - SDPA and experimental sublinear local+landmark attention backends |
| - exact M-fold expansion attention harvested from n1.py, with local verifier |
| - fused QKV projection harvested from n1.py, with legacy checkpoint loading |
| - profiling tools for memory, throughput, AR cost, SAT cost, and optimizer cost |
| - synthetic long-context curriculum generation for recall and multi-hop tests |
|
|
| Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command |
| recipes. The current sublinear backend is intentionally experimental: profile it |
| against SDPA before using it for a real run. |
|
|
| On RTX 4090-class 24GB cards, `run_agillm4_4090_longblock.sh` now defaults to |
| `agillm4_floor` instead of the AGILLM-3-sized `large` preset, starts at block |
| `1280`, and backs off in smaller 20% steps if VRAM is too tight. |
| For the current v47 seed, launch tmux with |
| `/workspace/agillm-4/launch_agillm4_4090_floor_from_v47.sh`; it writes |
| `/workspace/agillm4_floor_train.log`. |
|
|
| Checkpoint upload policy is intentionally bounded for the public HF storage |
| quota: status and log tails upload every 30 minutes, the latest multi-GB delta |
| uploads at most daily, and full checkpoints upload at most weekly with only two |
| current remote files retained. Local full saves default to daily and local |
| retention is one full plus one delta, so the 64GB Vast disk does not slowly fill. |
|
|
| Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md). |
|
|