---
library_name: pawn
license: apache-2.0
tags:
  - chess
  - transformer
  - world-model
  - causal-lm
  - next-token-prediction
  - representation-learning
  - pytorch
  - rust
model_name: PAWN-Base
pipeline_tag: other
citation: |
  @software{schweich2026pawn,
    author = {Schweich, Thomas},
    title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess},
    year = {2026},
    url = {https://github.com/thomas-schweich/PAWN},
    license = {Apache-2.0}
  }
model_params: 34651136
d_model: 512
n_layers: 8
n_heads: 8
d_ff: 2048
context_length: 512
vocab_size: 1980
datasets:
  - random-chess-games
language:
  - en
metrics:
  - accuracy
model-index:
  - name: PAWN-Base
    results:
      - task:
          type: next-token-prediction
          name: Chess Move Prediction (Random Games)
        metrics:
          - name: Game Completion Rate
            type: accuracy
            value: 0.989746
          - name: Legal Move Rate
            type: accuracy
            value: 0.999962
          - name: Top-1 Accuracy
            type: accuracy
            value: 0.0857
          - name: Top-5 Accuracy
            type: accuracy
            value: 0.3545
          - name: Val Loss
            type: loss
            value: 2.8679
          - name: Total Training Sequences
            type: other
            value: 51200000
---

# PAWN-Base

**PAWN** (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases.

This is the **base (default)** variant (~34.7M parameters). PAWN is designed as a frozen backbone for parameter-efficient finetuning into player models with arbitrary playstyles.

**[GitHub Repository](https://github.com/thomas-schweich/PAWN)** -- full source code, training scripts, adapter implementations, and documentation.

## All Variants

| Variant | Parameters | Link |
|---------|------------|------|
| PAWN-Small | ~9M | [thomas-schweich/pawn-small](https://huggingface.co/thomas-schweich/pawn-small) |
| PAWN (Base) | ~35M | [thomas-schweich/pawn-base](https://huggingface.co/thomas-schweich/pawn-base) |
| PAWN-Large | ~67M | [thomas-schweich/pawn-large](https://huggingface.co/thomas-schweich/pawn-large) |

A previous generation of PAWN backbones (`pawn-{small,base,large}-legacy`) used a 4,278-token coordinate vocabulary, a 256-token context window, and outcome conditioning. They are still available on HuggingFace; see [docs/LEGACY.md](https://github.com/thomas-schweich/PAWN/blob/main/docs/LEGACY.md) for the full story.

## Headline Metrics

These come from the published `model.safetensors` (step 195,000 out of 200,000 — the best 5,000-step-cadence checkpoint by val loss), measured on a fresh validation set of random games.

| Metric | Value |
|--------|-------|
| Game completion rate | 98.97% |
| Per-move legal rate | 99.9962% |
| Late-game legal rate | 100.0000% |
| Top-1 accuracy | 8.57% |
| Top-5 accuracy | 35.45% |
| Val loss | 2.868 |
| Val perplexity | 17.60 |

**Game completion rate** is the share of validation games in which *every* prediction along one side's plies was a legal move. The measurement is **non-autoregressive**: at each ply the model is shown the true ground-truth history and asked for that side's next move, and an illegal prediction at any ply forfeits the game. Errors do not corrupt subsequent positions — each prediction is independent given the true history. Autoregressive game completion has not been measured for these checkpoints and could be higher or lower; see the [game completion section of the architecture doc](https://github.com/thomas-schweich/PAWN/blob/main/docs/ARCHITECTURE.md#game-completion-rate) for the full definition. Game completion rate is a much stricter metric than per-move legal rate, and is the main signal that separates capacity between sizes.

| Compound-legality detail | Value |
|--------------------------|-------|
| Average plies completed per game | 347 |
| Average % of game completed | 99.27% |
| Median forfeit ply (when forfeit) | 101 |

### Accuracy ceiling

PAWN is trained on uniformly random chess games. At each position with N legal moves, the next move is drawn uniformly, so the Bayes-optimal predictor that does not know the game's outcome can do no better than 1/N at that position. Averaged over the position distribution induced by random games of up to 512 plies, the top-1 ceiling is **E\[1/N_legal\] ≈ 8.43%** (95% CI \[8.41%, 8.45%\], computed over 50,000 fresh random games — see [docs/ACCURACY_CEILING.md](https://github.com/thomas-schweich/PAWN/blob/main/docs/ACCURACY_CEILING.md)).

This model's top-1 accuracy of **8.57%** is **102% of that ceiling** — i.e., essentially at the limit of what any predictor can achieve on this task without outcome information.


## Probe Results

Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features. The model is never explicitly told about pieces, sides, or rules — these representations emerge purely from next-token prediction on random games.

| Probe | Metric | Description |
|-------|--------|-------------|
| Piece type | 91.8% | Per-square piece type (13 classes x 64 squares) |
| Side to move | 100.0% | Whose turn it is |
| Is check | 95.0% | Whether the side to move is in check |
| Castling rights | 99.3% | KQkq castling availability |
| En passant square | 99.9% | En passant target square (64 + none) |
| Material count | R² 0.84 (MAE 4.1) | Piece counts per type per color |
| Legal move count | R² 0.68 (MAE 5.2) | Number of legal moves available |
| Halfmove clock | R² 0.47 (MAE 10.4) | Plies since last capture or pawn move |
| Game phase | 95.8% | Opening / middlegame / endgame |


## Diagnostic Results

Edge-case diagnostics measure the model's legal move rate in specific tactical situations.

| Category | Positions | Legal Rate |
|----------|-----------|------------|
| In check | 10000 | 99.1% |
| Double check | 10000 | 96.8% |
| Pin restricts movement | 10000 | 98.9% |
| En passant available | 10000 | 99.6% |
| Castling legal (kingside) | 10000 | 99.8% |
| Castling legal (queenside) | 10000 | 99.7% |
| Castling blocked by check | 10000 | 99.6% |
| Promotion available | 10000 | 99.7% |
| Checkmate (terminal) | 10000 | 91.5% |
| Stalemate (terminal) | 10000 | 96.7% |


## Architecture

| Parameter | Value |
|-----------|-------|
| Architecture | Decoder-only transformer |
| d_model | 512 |
| Layers | 8 |
| Attention heads | 8 |
| Head dimension | 64 |
| d_ff | 2048 |
| Parameters | ~34.7M |
| Vocabulary | 1,980 tokens (1,968 searchless_chess actions + 1 PAD + 11 outcome tokens) |
| Context length | 512 tokens |
| Normalization | Pre-norm RMSNorm |
| FFN | SwiGLU (4x expansion) |
| Positional encoding | Rotary (RoPE, base 10000) |
| Embeddings | Factored (src + dst + promo) |
| Dropout | 0.0 |

## Training Details

| Parameter | Value |
|-----------|-------|
| Training data | On-the-fly uniformly random legal games (no external dataset) |
| Objective | Next-token cross-entropy (non-padding positions only) |
| Outcome conditioning | Disabled (prepend_outcome=False) — pure moves, no outcome leakage |
| Total steps | 200,000 |
| Batch size | 256 |
| Total training sequences | 51,200,000 (= total steps × batch size; the published checkpoint is the best 5K-cadence step by val loss, at step 195,000 ≈ 49,920,000 sequences) |
| Max ply per example | 512 |
| Learning rate | 0.0003 (cosine decay with 10,000-step warmup) |
| Optimizer | AdamW (weight decay 0.01) |
| Precision | Mixed (AMP) |

## Usage

### Loading the model

```python
import torch
from safetensors.torch import load_file
from pawn.config import CLMConfig
from pawn.model import PAWNCLM

cfg = CLMConfig.base()
model = PAWNCLM(cfg).cuda().eval()
weights = load_file("model.safetensors", device="cuda")
model.load_state_dict(weights)
```

Or load directly from HuggingFace:

```python
from pawn.checkpoint import load_backbone_weights
from pawn.config import CLMConfig
from pawn.model import PAWNCLM

weights, config = load_backbone_weights("thomas-schweich/pawn-base")
cfg = CLMConfig.base()
model = PAWNCLM(cfg).eval()
model.load_state_dict(weights)
```

### Finetuning with an adapter

```bash
uv run python scripts/train.py --run-type adapter --strategy bottleneck \
    --checkpoint thomas-schweich/pawn-base \
    --pgn thomas-schweich/pawn-lichess-full \
    --bottleneck-dim 32 --lr 1e-4 --local-checkpoints
```

## Acknowledgments

PAWN builds on ideas and tools from the following projects and publications:

| Component | Reference |
|-----------|-----------|
| Transformer | [Vaswani et al., "Attention Is All You Need", NeurIPS 2017](https://arxiv.org/abs/1706.03762) |
| RMSNorm | [Zhang & Sennrich, "Root Mean Square Layer Normalization", NeurIPS 2019](https://arxiv.org/abs/1910.07467) |
| RoPE | [Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", 2021](https://arxiv.org/abs/2104.09864) |
| SwiGLU | [Shazeer, "GLU Variants Improve Transformer", 2020](https://arxiv.org/abs/2002.05202) |
| AdamW | [Loshchilov & Hutter, "Decoupled Weight Decay Regularization", ICLR 2019](https://arxiv.org/abs/1711.05101) |
| Cosine schedule | [Loshchilov & Hutter, "SGDR: Stochastic Gradient Descent with Warm Restarts", ICLR 2017](https://arxiv.org/abs/1608.03983) |
| Mixed precision | [Micikevicius et al., "Mixed Precision Training", ICLR 2018](https://arxiv.org/abs/1710.03740) |
| Bottleneck adapters | [Houlsby et al., "Parameter-Efficient Transfer Learning for NLP", ICML 2019](https://arxiv.org/abs/1902.00751) |
| LoRA | [Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models", ICLR 2022](https://arxiv.org/abs/2106.09685) |
| FiLM | [Perez et al., "FiLM: Visual Reasoning with a General Conditioning Layer", AAAI 2018](https://arxiv.org/abs/1709.07871) |
| RoSA | [Nikdan et al., "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation", 2024](https://arxiv.org/abs/2401.04679) |
| Linear probes | [Alain & Bengio, "Understanding Intermediate Layers Using Linear Classifier Probes", ICLR Workshop 2017](https://arxiv.org/abs/1610.01644) |
| Searchless Chess (action vocab) | [Ruoss et al., "Amortized Planning with Large-Scale Transformers: A Case Study on Chess", 2024](https://arxiv.org/abs/2402.04494) |
| MAIA | [McIlroy-Young et al., "Aligning Superhuman AI with Human Behavior: Chess as a Model System", KDD 2020](https://arxiv.org/abs/2006.01855) |
| AlphaZero | [Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play", Science 2018](https://arxiv.org/abs/1712.01815) |
| Leela Chess Zero | [github.com/LeelaChessZero/lc0](https://github.com/LeelaChessZero/lc0) |
| shakmaty | [github.com/niklasf/shakmaty](https://github.com/niklasf/shakmaty) |
| PyO3 | [github.com/PyO3/pyo3](https://github.com/PyO3/pyo3) |
| Lichess | [lichess.org](https://lichess.org/) / [database.lichess.org](https://database.lichess.org/) |

## Citation


```bibtex
@software{schweich2026pawn,
  author = {Schweich, Thomas},
  title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess},
  year = {2026},
  url = {https://github.com/thomas-schweich/PAWN},
  license = {Apache-2.0}
}
```


## License

Apache 2.0. See [LICENSE](https://github.com/thomas-schweich/PAWN/blob/main/LICENSE).