--- tags: - reinforcement-learning - game-theory - codenames - neurips-2025 - graph-neural-networks - preference-learning - llm-distillation license: mit --- # Codenames: Graph-Based RL with LLM-Guided Preference Distillation ![Status](https://img.shields.io/badge/status-trained-success) ![Framework](https://img.shields.io/badge/framework-PyTorch-orange) ![License](https://img.shields.io/badge/license-MIT-blue) This repository contains trained **Codenames agents** developed for the **NeurIPS 2025 MindGames Workshop**. The system combines a structured graph-based reinforcement learning policy with **LLM-guided preference learning and distillation**, targeting improved risk calibration and decision robustness. --- ## Overview The approach integrates: - **Graph Neural Networks** for structured board and history representation - **Proximal Policy Optimization (PPO)** for policy learning - **Role-conditioned decoding** for spymaster and operative behaviors - **Rollout-grounded preference learning** using large language models - **Supervised fine tuning (SFT)** and **Direct Preference Optimization (DPO)** for teacher alignment - **Knowledge distillation** from the aligned teacher back into a compact policy The objective is to improve strategic consistency and reduce catastrophic failures such as assassin selections, while maintaining efficient inference suitable for interactive play. --- ## Game Configuration - **Game**: Codenames - **Board size**: 25 words - **Roles**: Spymaster and Operative - **Evaluation games**: 600 full episodes - **Opponents**: Scripted baseline agents --- ## Policy Architecture ### Graph-Based State Encoder - Heterogeneous graph with **30–40 nodes** - Node types include: - Word nodes with semantic and state features - Historical clue nodes - Global summary node - Node feature dimension: **35** - Encoder: - 3 Graph Attention layers - 6 attention heads - Hidden size 192 ### Role Conditioning - Shared policy trunk - Role-conditioned action decoding: - Clue generation and constraint handling for spymaster - Guess selection and stopping decisions for operative ### Model Size - Total parameters: **~6.8M** - Enables fast inference under competitive constraints --- ## Training Pipeline Training follows a multi-stage curriculum: 1. **Graph PPO Pretraining** - PPO with clip ratio 0.2 - Discount factor γ = 0.99 - GAE λ = 0.95 - Trained against scripted Codenames agents 2. **Preference Generation via Rollouts** - ~800 intermediate states sampled - Candidate actions proposed by: - Llama 3.1 Instruct - Qwen 2.5 Instruct - Each proposal evaluated using multiple stochastic rollouts - Higher-return actions labeled preferred 3. **Teacher Alignment** - Supervised Fine Tuning on chosen actions - Direct Preference Optimization using frozen reference model 4. **Policy Distillation** - Aligned teacher generates state-and-role to action labels - Graph policy trained via cross-entropy imitation 5. **PPO Refinement** - PPO resumes using environment rewards - Stabilizes policy after distillation --- ## Evaluation Results Evaluation uses **600 full games** against scripted opponents. | Agent | Win Rate | Assassin Rate | |------|---------|---------------| | Graph PPO | 44.8% | 12.6% | | PPO + Distillation | 52.9% | 6.9% | - Distillation yields an **8.1 point** absolute win-rate improvement - Assassin-triggered losses are reduced by **45%** - Improvements arise primarily from **better risk calibration**, not increased guessing aggressiveness --- ## Repository Contents ### Policy Checkpoints - `policy_models/policy_after_ppo.pt` - `policy_models/policy_after_distill.pt` ### Teacher Models - `sft_model/` – supervised fine-tuned teacher - `dpo_model/` – preference-aligned teacher ### Configuration and Logs - `master_config.json` - `evaluation_results.json` --- ## Usage ### Load Policy ```python import torch from policy import GraphPolicy policy = GraphPolicy(...) policy.load_state_dict(torch.load("policy_models/policy_after_distill.pt")) policy.eval() ``` ### Loading Fine-tuned LLM ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load SFT or DPO model tokenizer = AutoTokenizer.from_pretrained("./sft_model") model = AutoModelForCausalLM.from_pretrained("./sft_model") # Use for inference inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=32) ``` ## 🎓 Research Context This work targets the **NeurIPS 2025 MindGames Workshop** with a focus on: - Language models provide useful strategic priors when grounded by rollouts - Graph-based representations enable structured reasoning in semantic games - Distillation transfers high-level reasoning into efficient, deployable agents ### Key Innovations 1. **Heterogeneous Graph Representation**: Novel graph structure for Blotto game states 2. **Ground-truth Counterfactual Learning**: Exploiting game determinism 3. **Multi-scale Representation**: Field-level, round-level, and game-level embeddings 4. **LLM-to-RL Distillation**: Transferring strategic reasoning to efficient policies ## 📄 License MIT License - See LICENSE file for details ## 🙏 Acknowledgments - Built for **NeurIPS 2025 MindGames Workshop** - Uses PyTorch, HuggingFace Transformers, and PEFT - Training infrastructure: NVIDIA H200 GPU --- **Generated**: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")} **Uploaded from**: Notebook Environment