Spaces:

Uday
/

ctm-energy-based-halting

Paused

App Files Files Community

LukeDarlow commited on May 11

Commit

68b32f4

0 Parent(s):

Welcome to the CTM. This is the first commit of the public repo. Enjoy!

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +18 -0
README.md +134 -0
data/custom_datasets.py +324 -0
examples/01_mnist.ipynb +0 -0
models/README.md +7 -0
models/constants.py +10 -0
models/ctm.py +552 -0
models/ctm_qamnist.py +205 -0
models/ctm_rl.py +192 -0
models/ctm_sort.py +126 -0
models/ff.py +75 -0
models/lstm.py +244 -0
models/lstm_qamnist.py +184 -0
models/lstm_rl.py +96 -0
models/modules.py +692 -0
models/resnet.py +374 -0
models/utils.py +122 -0
requirements.txt +15 -0
tasks/image_classification/README.md +29 -0
tasks/image_classification/analysis/README.md +12 -0
tasks/image_classification/analysis/run_imagenet_analysis.py +972 -0
tasks/image_classification/imagenet_classes.py +1007 -0
tasks/image_classification/plotting.py +494 -0
tasks/image_classification/scripts/train_cifar10.sh +286 -0
tasks/image_classification/scripts/train_imagenet.sh +38 -0
tasks/image_classification/train.py +685 -0
tasks/image_classification/train_distributed.py +799 -0
tasks/mazes/README.md +10 -0
tasks/mazes/analysis/README.md +10 -0
tasks/mazes/analysis/run.py +407 -0
tasks/mazes/plotting.py +198 -0
tasks/mazes/scripts/train_ctm.sh +35 -0
tasks/mazes/train.py +698 -0
tasks/mazes/train_distributed.py +782 -0
tasks/parity/README.md +16 -0
tasks/parity/analysis/make_blog_gifs.py +263 -0
tasks/parity/analysis/run.py +269 -0
tasks/parity/plotting.py +896 -0
tasks/parity/scripts/train_ctm_100_50.sh +46 -0
tasks/parity/scripts/train_ctm_10_5.sh +46 -0
tasks/parity/scripts/train_ctm_1_1.sh +46 -0
tasks/parity/scripts/train_ctm_25_10.sh +46 -0
tasks/parity/scripts/train_ctm_50_25.sh +46 -0
tasks/parity/scripts/train_ctm_75_25.sh +46 -0
tasks/parity/scripts/train_lstm_1.sh +39 -0
tasks/parity/scripts/train_lstm_10.sh +39 -0
tasks/parity/scripts/train_lstm_100.sh +39 -0
tasks/parity/scripts/train_lstm_10_certain.sh +40 -0
tasks/parity/scripts/train_lstm_25.sh +39 -0
tasks/parity/scripts/train_lstm_25_certain.sh +40 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+*/__pycache__
+logs
+.DS_Store
+*.png
+*.pdf
+*.gif
+*.out
+*.pyc
+*.env
+*.pt
+*.mp4
+.vscode*
+*outputs*
+data/*
+!assets/*.gif
+!data/custom_datasets.py
+examples/*
+!examples/01_mnist.ipynb

README.md ADDED Viewed

	@@ -0,0 +1,134 @@

+# 🕰️ The Continuous Thought Machine
+📚 [PAPER: Technical Report](https://pub.sakana.ai/ctm/paper) | 📝 [Blog](https://sakana.ai/ctm/) | 🕹️ [Interactive Website](https:pub.sakana.ai/ctm)
+![Activations](assets/activations.gif)
+We present the Continuous Thought Machine (CTM), a model designed to unfold and then leverage neural activity as the underlying mechanism for observation and action. The CTM has two core innovations:
+1. Neuron-level temporal processing, where each neuron uses unique weight parameters to process a history of incoming signals, enabling fine-grained temporal dynamics.
+2. Neural synchronisation, employed as a direct latent representation for modulating data and producing outputs, thus directly encoding information in the timing of neural activity.
+We demonstrate the CTM's strong performance and versatility across a range of challenging tasks, including ImageNet classification, solving 2D mazes, sorting, parity computation, question-answering, and RL tasks.
+We provide all necessary code to reproduce our results and invite others to build upon and use CTMs in their own work.
+## Repo structure
+```
+├── tasks
+│   ├── image_classification
+│   │   ├── train.py                          # Training code for image classification (cifar, imagenet)
+│   │   ├── imagenet_classes.py               # Helper for imagenet class names
+│   │   ├── plotting.py                       # Plotting utils specific to this task
+│   │   └── analysis
+│   │       ├──run_imagenet_analysis.py       # ImageNet eval and visualisation code
+│   │       └──outputs/                       # Folder for outputs of analysis
+│   ├── mazes
+│   │   ├── train.py                          # Training code for solving 2D mazes (by way of a route; see paper)
+│   │   └── plotting.py                       # Plotting utils specific to this task
+│   │   └── analysis
+│   │       ├──run.py                         # Maze analysis code
+│   │       └──outputs/                       # Folder for outputs of analysis
+│   ├── sort
+│   │   ├── train.py                          # Training code for sorting
+│   │   └── utils.py                          # Sort specific utils (e.g., CTC decode)
+│   ├── parity
+│   │   ├── train.py                          # Training code for parity task
+│   │   ├── utils.py                          # Parity-specific helper functions
+│   │   ├── plotting.py                       # Plotting utils specific to this task
+│   │   ├── scripts/
+│   │   │   └── *.sh                          # Training scripts for different experimental setups
+│   │   └── analysis/
+│   │       └── run.py                        # Entry point for parity analysis
+│   ├── qamnist
+│   │   ├── train.py                          # Training code for QAMNIST task (quantized MNIST)
+│   │   ├── utils.py                          # QAMNIST-specific helper functions
+│   │   ├── plotting.py                       # Plotting utils specific to this task
+│   │   ├── scripts/
+│   │   │   └── *.sh                          # Training scripts for different experimental setups
+│   │   └── analysis/
+│   │       └── run.py                        # Entry point for QAMNIST analysis
+│   └── rl
+│       ├── train.py                          # Training code for RL environments
+│       ├── utils.py                          # RL-specific helper functions
+│       ├── plotting.py                       # Plotting utils specific to this task
+│       ├── envs.py                           # Custom RL environment wrappers
+│       ├── scripts/
+│       │   ├── 4rooms/
+│       │   │   └── *.sh                      # Training scripts for MiniGrid-FourRooms-v0 environment
+│       │   ├── acrobot/
+│       │   │   └── *.sh                      # Training scripts for Acrobot-v1 environment
+│       │   └── cartpole/
+│       │       └── *.sh                      # Training scripts for CartPole-v1 environment
+│       └── analysis/
+│           └── run.py                        # Entry point for RL analysis
+├── data                                      # This is where data will be saved and downloaded to
+│   └── custom_datasets.py                    # Custom datasets (e.g., Mazes), sort
+├── models
+│   ├── ctm.py                                # Main model code, used for: image classification, solving mazes, sort
+│   ├── ctm_*.py                              # Other model code, standalone adjustments for other tasks
+│   ├── ff.py                                 # feed-forward (simple) baseline code (e.g., for image classification)
+│   ├── lstm.py                               # LSTM baseline code (e.g., for image classification)
+│   ├── lstm_*.py                              # Other baseline code, standalone adjustments for other tasks
+│   ├── modules.py                            # Helper modules, including Neuron-level models and the Synapse UNET
+│   ├── utils.py                              # Helper functions (e.g., synch decay)
+│   └── resnet.py                             # Wrapper for ResNet featuriser
+├── utils
+│   ├── housekeeping.py                       # Helper functions for keeping things neat
+│   ├── losses.py                             # Loss functions for various tasks (mostly with reshaping stuff)
+│   └── schedulers.py                         # Helper wrappers for learning rate schedulers
+└── checkpoints
+    └── imagenet, mazes, ...                  # Checkpoint directories (see google drive link for files)
+```
+## Setup
+To set up the environment using conda:
+```
+conda create --name=ctm python=3.12
+conda activate ctm
+pip install -r requirements.txt
+```
+If there are issues with PyTorch versions, the following can be ran:
+```
+pip uninstall torch
+pip install torch --index-url https://download.pytorch.org/whl/cu121
+```
+## Model training
+Each task has its own (set of) training code. See for instance [tasks/image_classification/train.py](tasks/image_classification/train.py). We have set it up like this to ensure ease-of-use as opposed to clinical efficiency. This code is for researchers and we hope to have it shared in a way that fosters collaboration and learning.
+While we have provided reasonable defaults in the argparsers of each training setup, scripts to replicate the setups in the paper will typically be found in the accompanying script folders. If you simply want to dive in, run the following as a module (setup like this to make it easy to run many high-level training scripts from the top directory):
+```
+python -m tasks.image_classification.train
+```
+For debugging in VSCode, this configuration example might be helpful to you:
+```
+{
+    "name": "Debug: train image classifier",
+    "type": "debugpy",
+    "request": "launch",
+    "module": "tasks.image_classification.train",
+    "console": "integratedTerminal",
+    "justMyCode": false
+}
+```
+## Running analyses
+We also provide analysis and plotting code to replicate many of the plots in our paper. See `tasks/.../analysis/*` for more details on that. We als provide some data (e.g., the mazes we generated for training) and checkpoints (see [here](#checkpoints-and-data))
+## Checkpoints and data
+You can download the data and checkpoints from here: https://drive.google.com/drive/folders/1f4N0ndIDrRvac5fUnWof33KWhvz8iqo_?usp=drive_link
+Checkpoints go in the `checkpoints` folder. For instance, when properly populated, the checkpoints folder will have the maze checkpoint in `checkpoints/mazes/...`

data/custom_datasets.py ADDED Viewed

	@@ -0,0 +1,324 @@

+import torch
+from torchvision.datasets import ImageFolder
+from torch.utils.data import Dataset
+import random
+import numpy as np
+from tqdm.auto import tqdm
+from PIL import Image
+from datasets import load_dataset
+class SortDataset(Dataset):
+    def __init__(self, N):
+       self.N = N
+    def __len__(self):
+        return 10000000
+    def __getitem__(self, idx):
+        data = torch.zeros(self.N).normal_()
+        ordering = torch.argsort(data)
+        inputs = data
+        return (inputs), (ordering)
+class QAMNISTDataset(Dataset):
+    """A QAMNIST dataset that includes plus and minus operations on MNIST digits."""
+    def __init__(self, base_dataset, num_images, num_images_delta, num_repeats_per_input, num_operations, num_operations_delta):
+        self.base_dataset = base_dataset
+        self.num_images = num_images
+        self.num_images_delta = num_images_delta
+        self.num_images_range = self._calculate_num_images_range()
+        self.operators = ["+", "-"]
+        self.num_operations = num_operations
+        self.num_operations_delta = num_operations_delta
+        self.num_operations_range = self._calculate_num_operations_range()
+        self.num_repeats_per_input = num_repeats_per_input
+        self.current_num_digits = num_images
+        self.current_num_operations = num_operations
+        self.modulo_base = 10
+        self.output_range = [0, 9]
+    def _calculate_num_images_range(self):
+        min_val = self.num_images - self.num_images_delta
+        max_val = self.num_images + self.num_images_delta
+        assert min_val >= 1, f"Minimum number of images must be at least 1, got {min_val}"
+        return [min_val, max_val]
+    def _calculate_num_operations_range(self):
+        min_val = self.num_operations - self.num_operations_delta
+        max_val = self.num_operations + self.num_operations_delta
+        assert min_val >= 1, f"Minimum number of operations must be at least 1, got {min_val}"
+        return [min_val, max_val]
+    def set_num_digits(self, num_digits):
+        self.current_num_digits = num_digits
+    def set_num_operations(self, num_operations):
+        self.current_num_operations = num_operations
+    def _get_target_and_question(self, targets):
+        question = []
+        equations = []
+        num_digits = self.current_num_digits
+        num_operations = self.current_num_operations
+        # Select the initial digit
+        selection_idx = np.random.randint(num_digits)
+        first_digit = targets[selection_idx]
+        question.extend([selection_idx] * self.num_repeats_per_input)
+        # Set current_value to the initial digit (mod is applied in each operation)
+        current_value = first_digit % self.modulo_base
+        # For each operation, build an equation line
+        for _ in range(num_operations):
+            # Choose the operator ('+' or '-')
+            operator_idx = np.random.randint(len(self.operators))
+            operator = self.operators[operator_idx]
+            encoded_operator = -(operator_idx + 1)  # -1 for '+', -2 for '-'
+            question.extend([encoded_operator] * self.num_repeats_per_input)
+            # Choose the next digit
+            selection_idx = np.random.randint(num_digits)
+            digit = targets[selection_idx]
+            question.extend([selection_idx] * self.num_repeats_per_input)
+            # Compute the new value with immediate modulo reduction
+            if operator == '+':
+                new_value = (current_value + digit) % self.modulo_base
+            else:  # operator is '-'
+                new_value = (current_value - digit) % self.modulo_base
+            # Build the equation string for this step
+            equations.append(f"({current_value} {operator} {digit}) mod {self.modulo_base} = {new_value}")
+            # Update current value for the next operation
+            current_value = new_value
+        target = current_value
+        question_readable = "\n".join(equations)
+        return target, question, question_readable
+    def __len__(self):
+        return len(self.base_dataset)
+    def __getitem__(self, idx):
+        images, targets = [],[]
+        for _ in range(self.current_num_digits):
+            image, target = self.base_dataset[np.random.randint(self.__len__())]
+            images.append(image)
+            targets.append(target)
+        observations = torch.repeat_interleave(torch.stack(images, 0), repeats=self.num_repeats_per_input, dim=0)
+        target, question, question_readable = self._get_target_and_question(targets)
+        return observations, question, question_readable, target
+class ImageNet(Dataset):
+    def __init__(self, which_split, transform):
+        """
+        Most simple form of the custom dataset structure.
+        Args:
+            base_dataset (Dataset): The base dataset to sample from.
+            N (int): The number of images to construct into an observable sequence.
+            R (int): number of repeats
+            operators (list): list of operators from which to sample
+            action to take on observations (str): can be 'global' to compute operator over full observations, or 'select_K', where K=integer.
+        """
+        dataset = load_dataset('imagenet-1k', split=which_split, trust_remote_code=True)
+        self.transform = transform
+        self.base_dataset = dataset
+    def __len__(self):
+        return len(self.base_dataset)
+    def __getitem__(self, idx):
+        data_item = self.base_dataset[idx]
+        image = self.transform(data_item['image'].convert('RGB'))
+        target = data_item['label']
+        return image, target
+class MazeImageFolder(ImageFolder):
+    """
+    A custom dataset class that extends the ImageFolder class.
+    Args:
+        root (string): Root directory path.
+        transform (callable, optional): A function/transform that takes in
+            a sample and returns a transformed version.
+            E.g, ``transforms.RandomCrop`` for images.
+        target_transform (callable, optional): A function/transform that takes
+            in the target and transforms it.
+        loader (callable, optional): A function to load an image given its path.
+        is_valid_file (callable, optional): A function that takes path of an Image file
+            and check if the file is a valid file (used to check of corrupt files)
+    Attributes:
+        classes (list): List of the class names.
+        class_to_idx (dict): Dict with items (class_name, class_index).
+        imgs (list): List of (image path, class_index) tuples
+    """
+    def __init__(self, root, transform=None, target_transform=None,
+                 loader=Image.open,
+                 is_valid_file=None,
+                 which_set='train',
+                 augment_p=0.5,
+                 maze_route_length=10,
+                 trunc=False,
+                 expand_range=True):
+        super(MazeImageFolder, self).__init__(root, transform, target_transform, loader, is_valid_file)
+        self.which_set = which_set
+        self.augment_p = augment_p
+        self.maze_route_length = maze_route_length
+        self.all_paths = {}
+        self.trunc = trunc
+        self.expand_range = expand_range
+        self._preload()
+        print('Solving all mazes...')
+        for index in range(len(self.preloaded_samples)):
+            path = self.get_solution(self.preloaded_samples[index])
+            self.all_paths[index] = path
+    def _preload(self):
+        preloaded_samples = []
+        with tqdm(total=self.__len__(), initial=0, leave=True, position=0, dynamic_ncols=True) as pbar:
+            for index in range(self.__len__()):
+                pbar.set_description('Loading mazes')
+                path, target = self.samples[index]
+                sample = self.loader(path)
+                sample = np.array(sample).astype(np.float32)/255
+                preloaded_samples.append(sample)
+                pbar.update(1)
+                if self.trunc and index == 999: break
+        self.preloaded_samples = preloaded_samples
+    def __len__(self):
+        if hasattr(self, 'preloaded_samples') and self.preloaded_samples is not None:
+            return len(self.preloaded_samples)
+        else:
+            return super().__len__()
+    def get_solution(self, x):
+        x = np.copy(x)
+        # Find start (red) and end (green) pixel coordinates
+        start_coords = np.argwhere((x == [1, 0, 0]).all(axis=2))
+        end_coords = np.argwhere((x == [0, 1, 0]).all(axis=2))
+        if len(start_coords) == 0 or len(end_coords) == 0:
+            print("Start or end point not found.")
+            return None
+        start_y, start_x = start_coords[0]
+        end_y, end_x = end_coords[0]
+        current_y, current_x = start_y, start_x
+        path = [4] * self.maze_route_length
+        pi = 0
+        while (current_y, current_x) != (end_y, end_x):
+            next_y, next_x = -1, -1  # Initialize to invalid coordinates
+            direction = -1  # Initialize to an invalid direction
+            # Check Up
+            if current_y > 0 and ((x[current_y - 1, current_x] == [0, 0, 1]).all() or (x[current_y - 1, current_x] == [0, 1, 0]).all()):
+                next_y, next_x = current_y - 1, current_x
+                direction = 0
+            # Check Down
+            elif current_y < x.shape[0] - 1 and ((x[current_y + 1, current_x] == [0, 0, 1]).all() or (x[current_y + 1, current_x] == [0, 1, 0]).all()):
+                next_y, next_x = current_y + 1, current_x
+                direction = 1
+            # Check Left
+            elif current_x > 0 and ((x[current_y, current_x - 1] == [0, 0, 1]).all() or (x[current_y, current_x - 1] == [0, 1, 0]).all()):
+                next_y, next_x = current_y, current_x - 1
+                direction = 2
+            # Check Right
+            elif current_x < x.shape[1] - 1 and ((x[current_y, current_x + 1] == [0, 0, 1]).all() or (x[current_y, current_x + 1] == [0, 1, 0]).all()):
+                next_y, next_x = current_y, current_x + 1
+                direction = 3
+            path[pi] = direction
+            pi += 1
+            x[current_y, current_x] = [255,255,255] # mark the current as white to avoid going in circles
+            current_y, current_x = next_y, next_x
+            if pi == len(path):
+                break
+        return np.array(path)
+    def __getitem__(self, index):
+        """
+        Args:
+            index (int): Index
+        Returns:
+            tuple: (sample, target) where target is class_index of the target class.
+        """
+        sample = np.copy(self.preloaded_samples[index])
+        path = np.copy(self.all_paths[index])
+        if self.which_set == 'train':
+            # Randomly rotate -90 or +90 degrees
+            if random.random() < self.augment_p:
+                which_rot = random.choice([-1, 1])
+                sample = np.rot90(sample, k=which_rot, axes=(0, 1))
+                for pi in range(len(path)):
+                    if path[pi] == 0: path[pi] = 3 if which_rot == -1 else 2
+                    elif path[pi] == 1: path[pi] = 2 if which_rot == -1 else 3
+                    elif path[pi] == 2: path[pi] = 0 if which_rot == -1 else 1
+                    elif path[pi] == 3: path[pi] = 1 if which_rot == -1 else 0
+            # Random horizontal flip
+            if random.random() < self.augment_p:
+                sample = np.fliplr(sample)
+                for pi in range(len(path)):
+                    if path[pi] == 2: path[pi] = 3
+                    elif path[pi] == 3: path[pi] = 2
+            # Random vertical flip
+            if random.random() < self.augment_p:
+                sample = np.flipud(sample)
+                for pi in range(len(path)):
+                    if path[pi] == 0: path[pi] = 1
+                    elif path[pi] == 1: path[pi] = 0
+        sample = torch.from_numpy(np.copy(sample)).permute(2,0,1)
+        blue_mask = (sample[0] == 0) & (sample[1] == 0) & (sample[2] == 1)
+        sample[:, blue_mask] = 1
+        target = path
+        if not self.expand_range:
+            return sample, target
+        return (sample*2)-1, (target)
+class ParityDataset(Dataset):
+    def __init__(self, sequence_length=64, length=100000):
+        self.sequence_length = sequence_length
+        self.length = length
+    def __len__(self):
+        return self.length
+    def __getitem__(self, idx):
+        vector = 2 * torch.randint(0, 2, (self.sequence_length,)) - 1
+        vector = vector.float()
+        negatives = (vector == -1).to(torch.long)
+        cumsum = torch.cumsum(negatives, dim=0)
+        target = (cumsum % 2 != 0).to(torch.long)
+        return vector, target

examples/01_mnist.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

models/README.md ADDED Viewed

	@@ -0,0 +1,7 @@

+# Continuous Thought Machines
+## Models
+This folder contains all model-related code.
+Some notes for clarity:
+1. The resnet structure we used (see resnet.py) has a few minor changes that enable constraining the receptive field of the features yielded. We do this because we want the CTM (or baseline methods) to learn a process whereby they gather information. Neural networks that use SGD will find the [path of least resistence](https://era.ed.ac.uk/handle/1842/39606), even if that path doesn't result in actually intelligent behaviour. Constraining the receptive field helps to prevent this, a bit.

models/constants.py ADDED Viewed

	@@ -0,0 +1,10 @@

+VALID_NEURON_SELECT_TYPES = ['first-last', 'random', 'random-pairing']
+VALID_BACKBONE_TYPES = [
+    f'resnet{depth}-{i}' for depth in [18, 34, 50, 101, 152] for i in range(1, 5)
+] + ['shallow-wide', 'parity_backbone']
+VALID_POSITIONAL_EMBEDDING_TYPES = [
+    'learnable-fourier', 'multi-learnable-fourier',
+    'custom-rotational', 'custom-rotational-1d'
+]

models/ctm.py ADDED Viewed

	@@ -0,0 +1,552 @@

+import torch.nn as nn
+import torch
+import numpy as np
+import math
+from models.modules import ParityBackbone, SynapseUNET, Squeeze, SuperLinear, LearnableFourierPositionalEncoding, MultiLearnableFourierPositionalEncoding, CustomRotationalEmbedding, CustomRotationalEmbedding1D, ShallowWide
+from models.resnet import prepare_resnet_backbone
+from models.utils import compute_normalized_entropy
+from models.constants import (
+    VALID_NEURON_SELECT_TYPES,
+    VALID_BACKBONE_TYPES,
+    VALID_POSITIONAL_EMBEDDING_TYPES
+)
+class ContinuousThoughtMachine(nn.Module):
+    """
+    Continuous Thought Machine (CTM).
+    Technical report: TODO:LINK
+    Technical report (web version): TODO:LINK
+    Blog: TODO:LINK
+    Thought takes time and reasoning is a process.
+    The CTM consists of three main ideas:
+    1. The use of internal recurrence, enabling a dimension over which a concept analogous to thought can occur.
+    1. Neuron-level models, that compute post-activations by applying private (i.e., on a per-neuron basis) MLP
+       models to a history of incoming pre-activations.
+    2. Synchronisation as representation, where the neural activity over time is tracked and used to compute how
+       pairs of neurons synchronise with one another over time. This measure of synchronisation is the representation
+       with which the CTM takes action and makes predictions.
+    Args:
+        iterations (int): Number of internal 'thought' ticks (T, in paper).
+        d_model (int): Core dimensionality of the CTM's latent space (D, in paper).
+                       NOTE: Note that this is NOT the representation used for action or prediction, but rather that which
+                       is fully internal to the model and not directly connected to data.
+        d_input (int): Dimensionality of projected attention outputs or direct input features.
+        heads (int): Number of attention heads.
+        n_synch_out (int): Number of neurons used for output synchronisation (D_out, in paper).
+        n_synch_action (int): Number of neurons used for action/attention synchronisation (D_action, in paper).
+        synapse_depth (int): Depth of the synapse model (U-Net if > 1, else MLP).
+        memory_length (int): History length for Neuron-Level Models (M, in paper).
+        deep_nlms (bool): Use deeper (2-layer) NLMs if True, else linear.
+                        NOTE: we almost always use deep NLMs, but a linear NLM is faster.
+        memory_hidden_dims (int): Hidden dimension size for deep NLMs.
+        do_layernorm_nlm (bool): Apply LayerNorm within NLMs.
+                        NOTE: we never set this to true in the paper. If you set this to true you will get strange behaviour,
+                        but you can potentially encourage more periodic behaviour in the dynamics. Untested; be careful.
+        backbone_type (str): Type of feature extraction backbone (e.g., 'resnet18-2', 'none').
+        positional_embedding_type (str): Type of positional embedding for backbone features.
+        out_dims (int): Output dimension size.
+                        NOTE: projected from synchronisation!
+        prediction_reshaper (list): Shape for reshaping predictions before certainty calculation (task-specific).
+                        NOTE: this is used to compute certainty and is needed when applying softmax for probabilities
+        dropout (float): Dropout rate.
+        neuron_select_type (str): Neuron selection strategy ('first-last', 'random', 'random-pairing').
+                        NOTE: some of this is legacy from our experimentation, but all three strategies are valid and useful.
+                            We dilineate exactly which strategies we use per experiment in the paper.
+                        - first-last: build a 'dense' sync matrix for output from the first D_out neurons and action from the
+                                      last D_action neurons. Flatten this matrix into the synchronisation representation.
+                                      This approach shares relationships for neurons and bottlenecks the gradients through them.
+                                      NOTE: the synchronisation size will be (D_out/action * (D_out/action + 1))/2
+                        - random: randomly select D_out neurons for the 'i' side pairings, and also D_out for the 'j' side pairings,
+                                      also pairing those accross densely, resulting in a bottleneck roughly 2x as wide.
+                                      NOTE: the synchronisation size will be (D_out/action * (D_out/action + 1))/2
+                        - random-pairing (DEFAULT!): randomly select D_out neurons and pair these with another D_out neurons.
+                                      This results in much less bottlenecking and is the most up-to-date variant.
+                                      NOTE: the synchronisation size will be D_out in this case; better control.
+        n_random_pairing_self (int): Number of neurons to select for self-to-self synch when random-pairing is used.
+                        NOTE: when using random-pairing, i-to-i (self) synchronisation is rare, meaning that 'recovering a
+                        snapshot representation' (see paper) is difficult. This alleviates that.
+                        NOTE: works fine when set to 0.
+    """
+    def __init__(self,
+                 iterations,
+                 d_model,
+                 d_input,
+                 heads,
+                 n_synch_out,
+                 n_synch_action,
+                 synapse_depth,
+                 memory_length,
+                 deep_nlms,
+                 memory_hidden_dims,
+                 do_layernorm_nlm,
+                 backbone_type,
+                 positional_embedding_type,
+                 out_dims,
+                 prediction_reshaper=[-1],
+                 dropout=0,
+                 dropout_nlm=None,
+                 neuron_select_type='random-pairing',
+                 n_random_pairing_self=0,
+                 ):
+        super(ContinuousThoughtMachine, self).__init__()
+        # --- Core Parameters ---
+        self.iterations = iterations
+        self.d_model = d_model
+        self.d_input = d_input
+        self.memory_length = memory_length
+        self.prediction_reshaper = prediction_reshaper
+        self.n_synch_out = n_synch_out
+        self.n_synch_action = n_synch_action
+        self.backbone_type = backbone_type
+        self.out_dims = out_dims
+        self.positional_embedding_type = positional_embedding_type
+        self.neuron_select_type = neuron_select_type
+        self.memory_length = memory_length
+        dropout_nlm = dropout if dropout_nlm is None else dropout_nlm
+        # --- Assertions ---
+        self.verify_args()
+        # --- Input Processing  ---
+        d_backbone = self.get_d_backbone()
+        self.set_initial_rgb()
+        self.set_backbone()
+        self.positional_embedding = self.get_positional_embedding(d_backbone)
+        self.kv_proj = nn.Sequential(nn.LazyLinear(self.d_input), nn.LayerNorm(self.d_input)) if heads else None
+        self.q_proj = nn.LazyLinear(self.d_input) if heads else None
+        self.attention = nn.MultiheadAttention(self.d_input, heads, dropout, batch_first=True) if heads else None
+        # --- Core CTM Modules ---
+        self.synapses = self.get_synapses(synapse_depth, d_model, dropout)
+        self.trace_processor = self.get_neuron_level_models(deep_nlms, do_layernorm_nlm, memory_length, memory_hidden_dims, d_model, dropout_nlm)
+        #  --- Start States ---
+        self.register_parameter('start_activated_state', nn.Parameter(torch.zeros((d_model)).uniform_(-math.sqrt(1/(d_model)), math.sqrt(1/(d_model)))))
+        self.register_parameter('start_trace', nn.Parameter(torch.zeros((d_model, memory_length)).uniform_(-math.sqrt(1/(d_model+memory_length)), math.sqrt(1/(d_model+memory_length)))))
+        # --- Synchronisation ---
+        self.neuron_select_type_out, self.neuron_select_type_action = self.get_neuron_select_type()
+        self.synch_representation_size_action = self.calculate_synch_representation_size(self.n_synch_action)
+        self.synch_representation_size_out = self.calculate_synch_representation_size(self.n_synch_out)
+        for synch_type, size in (('action', self.synch_representation_size_action), ('out', self.synch_representation_size_out)):
+            print(f"Synch representation size {synch_type}: {size}")
+        if self.synch_representation_size_action:  # if not zero
+            self.set_synchronisation_parameters('action', self.n_synch_action, n_random_pairing_self)
+        self.set_synchronisation_parameters('out', self.n_synch_out, n_random_pairing_self)
+        # --- Output Procesing ---
+        self.output_projector = nn.Sequential(nn.LazyLinear(self.out_dims))
+    # --- Core CTM Methods ---
+    def compute_synchronisation(self, activated_state, decay_alpha, decay_beta, r, synch_type):
+        """
+        Computes synchronisation to be used as a vector representation.
+        A neuron has what we call a 'trace', which is a history (time series) that changes with internal
+        recurrence. i.e., it gets longer with every internal tick. There are pre-activation traces
+        that are used in the NLMs and post-activation traces that, in theory, are used in this method.
+        We define sychronisation between neuron i and j as the dot product between their respective
+        time series. Since there can be many internal ticks, this process can be quite compute heavy as it
+        involves many dot products that repeat computation at each step.
+        Therefore, in practice, we update the synchronisation based on the current post-activations,
+        which we call the 'activated state' here. This is possible because the inputs to synchronisation
+        are only updated recurrently at each step, meaning that there is a linear recurrence we can
+        leverage.
+        See Appendix TODO of the Technical Report (TODO:LINK) for the maths that enables this method.
+        """
+        if synch_type == 'action': # Get action parameters
+            n_synch = self.n_synch_action
+            neuron_indices_left = self.action_neuron_indices_left
+            neuron_indices_right = self.action_neuron_indices_right
+        elif synch_type == 'out': # Get input parameters
+            n_synch = self.n_synch_out
+            neuron_indices_left = self.out_neuron_indices_left
+            neuron_indices_right = self.out_neuron_indices_right
+        if self.neuron_select_type in ('first-last', 'random'):
+            # For first-last and random, we compute the pairwise sync between all selected neurons
+            if self.neuron_select_type == 'first-last':
+                if synch_type == 'action': # Use last n_synch neurons for action
+                    selected_left = selected_right = activated_state[:, -n_synch:]
+                elif synch_type == 'out': # Use first n_synch neurons for out
+                    selected_left = selected_right = activated_state[:, :n_synch]
+            else: # Use the randomly selected neurons
+                selected_left = activated_state[:, neuron_indices_left]
+                selected_right = activated_state[:, neuron_indices_right]
+            # Compute outer product of selected neurons
+            outer = selected_left.unsqueeze(2) * selected_right.unsqueeze(1)
+            # Resulting matrix is symmetric, so we only need the upper triangle
+            i, j = torch.triu_indices(n_synch, n_synch)
+            pairwise_product = outer[:, i, j]
+        elif self.neuron_select_type == 'random-pairing':
+            # For random-pairing, we compute the sync between specific pairs of neurons
+            left = activated_state[:, neuron_indices_left]
+            right = activated_state[:, neuron_indices_right]
+            pairwise_product = left * right
+        else:
+            raise ValueError("Invalid neuron selection type")
+        # Compute synchronisation recurrently
+        if decay_alpha is None or decay_beta is None:
+            decay_alpha = pairwise_product
+            decay_beta = torch.ones_like(pairwise_product)
+        else:
+            decay_alpha = r * decay_alpha + pairwise_product
+            decay_beta = r * decay_beta + 1
+        synchronisation = decay_alpha / (torch.sqrt(decay_beta))
+        return synchronisation, decay_alpha, decay_beta
+    def compute_features(self, x):
+        """
+        Compute the key-value features from the input data using the backbone.
+        """
+        initial_rgb = self.initial_rgb(x)
+        self.kv_features = self.backbone(initial_rgb)
+        pos_emb = self.positional_embedding(self.kv_features)
+        combined_features = (self.kv_features + pos_emb).flatten(2).transpose(1, 2)
+        kv = self.kv_proj(combined_features)
+        return kv
+    def compute_certainty(self, current_prediction):
+        """
+        Compute the certainty of the current prediction.
+        We define certainty as being 1-normalised entropy.
+        For legacy reasons we stack that in a 2D vector as this can be used for optimisation later.
+        """
+        B = current_prediction.size(0)
+        reshaped_pred = current_prediction.reshape([B] + self.prediction_reshaper)
+        ne = compute_normalized_entropy(reshaped_pred)
+        current_certainty = torch.stack((ne, 1-ne), -1)
+        return current_certainty
+    # --- Setup Methods ---
+    def set_initial_rgb(self):
+        """
+        This is largely to accommodate training on grescale images and is legacy, but it
+        doesn't hurt the model in any way that we can tell.
+        """
+        if 'resnet' in self.backbone_type:
+            self.initial_rgb = nn.LazyConv2d(3, 1, 1) # Adapts input channels lazily
+        else:
+            self.initial_rgb = nn.Identity()
+    def get_d_backbone(self):
+        """
+        Get the dimensionality of the backbone output, to be used for positional embedding setup.
+        This is a little bit complicated for resnets, but the logic should be easy enough to read below.
+        """
+        if self.backbone_type == 'shallow-wide':
+            return 2048
+        elif self.backbone_type == 'parity_backbone':
+            return self.d_input
+        elif 'resnet' in self.backbone_type:
+            if '18' in self.backbone_type or '34' in self.backbone_type:
+                if self.backbone_type.split('-')[1]=='1': return 64
+                elif self.backbone_type.split('-')[1]=='2': return 128
+                elif self.backbone_type.split('-')[1]=='3': return 256
+                elif self.backbone_type.split('-')[1]=='4': return 512
+                else:
+                    raise NotImplementedError
+            else:
+                if self.backbone_type.split('-')[1]=='1': return 256
+                elif self.backbone_type.split('-')[1]=='2': return 512
+                elif self.backbone_type.split('-')[1]=='3': return 1024
+                elif self.backbone_type.split('-')[1]=='4': return 2048
+                else:
+                    raise NotImplementedError
+        elif self.backbone_type == 'none':
+            return None
+        else:
+            raise ValueError(f"Invalid backbone_type: {self.backbone_type}")
+    def set_backbone(self):
+        """
+        Set the backbone module based on the specified type.
+        """
+        if self.backbone_type == 'shallow-wide':
+            self.backbone = ShallowWide()
+        elif self.backbone_type == 'parity_backbone':
+            d_backbone = self.get_d_backbone()
+            self.backbone = ParityBackbone(n_embeddings=2, d_embedding=d_backbone)
+        elif 'resnet' in self.backbone_type:
+            self.backbone = prepare_resnet_backbone(self.backbone_type)
+        elif self.backbone_type == 'none':
+            self.backbone = nn.Identity()
+        else:
+            raise ValueError(f"Invalid backbone_type: {self.backbone_type}")
+    def get_positional_embedding(self, d_backbone):
+        """
+        Get the positional embedding module.
+        For Imagenet and mazes we used NO positional embedding, and largely don't think
+        that it is necessary as the CTM can build up its own internal world model when
+        observing.
+        LearnableFourierPositionalEncoding:
+            Implements Algorithm 1 from "Learnable Fourier Features for Multi-Dimensional
+            Spatial Positional Encoding" (https://arxiv.org/pdf/2106.02795.pdf).
+            Provides positional information for 2D feature maps.
+            (MultiLearnableFourierPositionalEncoding uses multiple feature scales)
+        CustomRotationalEmbedding:
+            Simple sinusoidal embedding to encourage interpretability
+        """
+        if self.positional_embedding_type == 'learnable-fourier':
+            return LearnableFourierPositionalEncoding(d_backbone, gamma=1 / 2.5)
+        elif self.positional_embedding_type == 'multi-learnable-fourier':
+            return MultiLearnableFourierPositionalEncoding(d_backbone)
+        elif self.positional_embedding_type == 'custom-rotational':
+            return CustomRotationalEmbedding(d_backbone)
+        elif self.positional_embedding_type == 'custom-rotational-1d':
+            return CustomRotationalEmbedding1D(d_backbone)
+        elif self.positional_embedding_type == 'none':
+            return lambda x: 0  # Default no-op
+        else:
+            raise ValueError(f"Invalid positional_embedding_type: {self.positional_embedding_type}")
+    def get_neuron_level_models(self, deep_nlms, do_layernorm_nlm, memory_length, memory_hidden_dims, d_model, dropout):
+        """
+        Neuron level models are one of the core innovations of the CTM. They apply separate MLPs/linears to
+        each neuron.
+        NOTE: the name 'SuperLinear' is largely legacy, but its purpose is to apply separate linear layers
+            per neuron. It is sort of a 'grouped linear' function, where the group size is equal to 1.
+            One could make the group size bigger and use fewer parameters, but that is future work.
+        NOTE: We used GLU() nonlinearities because they worked well in practice.
+        """
+        if deep_nlms:
+            return nn.Sequential(
+                nn.Sequential(
+                    SuperLinear(in_dims=memory_length, out_dims=2 * memory_hidden_dims, N=d_model,
+                                do_norm=do_layernorm_nlm, dropout=dropout),
+                    nn.GLU(),
+                    SuperLinear(in_dims=memory_hidden_dims, out_dims=2, N=d_model,
+                                do_norm=do_layernorm_nlm, dropout=dropout),
+                    nn.GLU(),
+                    Squeeze(-1)
+                )
+            )
+        else:
+            return nn.Sequential(
+                nn.Sequential(
+                    SuperLinear(in_dims=memory_length, out_dims=2, N=d_model,
+                                do_norm=do_layernorm_nlm, dropout=dropout),
+                    nn.GLU(),
+                    Squeeze(-1)
+                )
+            )
+    def get_synapses(self, synapse_depth, d_model, dropout):
+        """
+        The synapse model is the recurrent model in the CTM. It's purpose is to share information
+        across neurons. If using depth of 1, this is just a simple single layer with nonlinearity and layernomr.
+        For deeper synapse models we use a U-NET structure with many skip connections. In practice this performs
+        better as it enables multi-level information mixing.
+        The intuition with having a deep UNET model for synapses is that the action of synaptic connections is
+        not necessarily a linear one, and that approximate a synapose 'update' step in the brain is non trivial.
+        Hence, we set it up so that the CTM can learn some complex internal rule instead of trying to approximate
+        it ourselves.
+        """
+        if synapse_depth == 1:
+            return nn.Sequential(
+                nn.Dropout(dropout),
+                nn.LazyLinear(d_model * 2),
+                nn.GLU(),
+                nn.LayerNorm(d_model)
+            )
+        else:
+            return SynapseUNET(d_model, synapse_depth, 16, dropout)  # hard-coded minimum width of 16; future work TODO.
+    def set_synchronisation_parameters(self, synch_type: str, n_synch: int, n_random_pairing_self: int = 0):
+            """
+            1. Set the buffers for selecting neurons so that these indices are saved into the model state_dict.
+            2. Set the parameters for learnable exponential decay when computing synchronisation between all
+                neurons.
+            """
+            assert synch_type in ('out', 'action'), f"Invalid synch_type: {synch_type}"
+            left, right = self.initialize_left_right_neurons(synch_type, self.d_model, n_synch, n_random_pairing_self)
+            synch_representation_size = self.synch_representation_size_action if synch_type == 'action' else self.synch_representation_size_out
+            self.register_buffer(f'{synch_type}_neuron_indices_left', left)
+            self.register_buffer(f'{synch_type}_neuron_indices_right', right)
+            self.register_parameter(f'decay_params_{synch_type}', nn.Parameter(torch.zeros(synch_representation_size), requires_grad=True))
+    def initialize_left_right_neurons(self, synch_type, d_model, n_synch, n_random_pairing_self=0):
+        """
+        Initialize the left and right neuron indices based on the neuron selection type.
+        This complexity is owing to legacy experiments, but we retain that these types of
+        neuron selections are interesting to experiment with.
+        """
+        if self.neuron_select_type=='first-last':
+            if synch_type == 'out':
+                neuron_indices_left = neuron_indices_right = torch.arange(0, n_synch)
+            elif synch_type == 'action':
+                neuron_indices_left = neuron_indices_right = torch.arange(d_model-n_synch, d_model)
+        elif self.neuron_select_type=='random':
+            neuron_indices_left = torch.from_numpy(np.random.choice(np.arange(d_model), size=n_synch))
+            neuron_indices_right = torch.from_numpy(np.random.choice(np.arange(d_model), size=n_synch))
+        elif self.neuron_select_type=='random-pairing':
+            assert n_synch > n_random_pairing_self, f"Need at least {n_random_pairing_self} pairs for {self.neuron_select_type}"
+            neuron_indices_left = torch.from_numpy(np.random.choice(np.arange(d_model), size=n_synch))
+            neuron_indices_right = torch.concatenate((neuron_indices_left[:n_random_pairing_self], torch.from_numpy(np.random.choice(np.arange(d_model), size=n_synch-n_random_pairing_self))))
+        device = self.start_activated_state.device
+        return neuron_indices_left.to(device), neuron_indices_right.to(device)
+    def get_neuron_select_type(self):
+        """
+        Another helper method to accomodate our legacy neuron selection types.
+        TODO: additional experimentation and possible removal of 'first-last' and 'random'
+        """
+        print(f"Using neuron select type: {self.neuron_select_type}")
+        if self.neuron_select_type == 'first-last':
+            neuron_select_type_out, neuron_select_type_action = 'first', 'last'
+        elif self.neuron_select_type in ('random', 'random-pairing'):
+            neuron_select_type_out = neuron_select_type_action = self.neuron_select_type
+        else:
+            raise ValueError(f"Invalid neuron selection type: {self.neuron_select_type}")
+        return neuron_select_type_out, neuron_select_type_action
+    # --- Utilty Methods ---
+    def verify_args(self):
+        """
+        Verify the validity of the input arguments to ensure consistent behaviour.
+        Specifically when selecting neurons for sychronisation using 'first-last' or 'random',
+        one needs the right number of neurons
+        """
+        assert self.neuron_select_type in VALID_NEURON_SELECT_TYPES, \
+            f"Invalid neuron selection type: {self.neuron_select_type}"
+        assert self.backbone_type in VALID_BACKBONE_TYPES + ['none'], \
+            f"Invalid backbone_type: {self.backbone_type}"
+        assert self.positional_embedding_type in VALID_POSITIONAL_EMBEDDING_TYPES + ['none'], \
+            f"Invalid positional_embedding_type: {self.positional_embedding_type}"
+        if self.neuron_select_type == 'first-last':
+            assert self.d_model >= (self.n_synch_out + self.n_synch_action), \
+                "d_model must be >= n_synch_out + n_synch_action for neuron subsets"
+        if self.backbone_type=='none' and self.positional_embedding_type!='none':
+            raise AssertionError("There should be no positional embedding if there is no backbone.")
+    def calculate_synch_representation_size(self, n_synch):
+        """
+        Calculate the size of the synchronisation representation based on neuron selection type.
+        """
+        if self.neuron_select_type == 'random-pairing':
+            synch_representation_size = n_synch
+        elif self.neuron_select_type in ('first-last', 'random'):
+            synch_representation_size = (n_synch * (n_synch + 1)) // 2
+        else:
+            raise ValueError(f"Invalid neuron selection type: {self.neuron_select_type}")
+        return synch_representation_size
+    def forward(self, x, track=False):
+        B = x.size(0)
+        device = x.device
+        # --- Tracking Initialization ---
+        pre_activations_tracking = []
+        post_activations_tracking = []
+        synch_out_tracking = []
+        synch_action_tracking = []
+        attention_tracking = []
+        # --- Featurise Input Data ---
+        kv = self.compute_features(x)
+        # --- Initialise Recurrent State ---
+        state_trace = self.start_trace.unsqueeze(0).expand(B, -1, -1) # Shape: (B, H, T)
+        activated_state = self.start_activated_state.unsqueeze(0).expand(B, -1) # Shape: (B, H)
+        # --- Prepare Storage for Outputs per Iteration ---
+        predictions = torch.empty(B, self.out_dims, self.iterations, device=device, dtype=torch.float32)
+        certainties = torch.empty(B, 2, self.iterations, device=device, dtype=torch.float32)
+        # --- Initialise Recurrent Synch Values  ---
+        decay_alpha_action, decay_beta_action = None, None
+        r_action, r_out = torch.exp(-torch.clamp(self.decay_params_action, 0, 15)).unsqueeze(0).repeat(B, 1), torch.exp(-torch.clamp(self.decay_params_out, 0, 15)).unsqueeze(0).repeat(B, 1)
+        _, decay_alpha_out, decay_beta_out = self.compute_synchronisation(activated_state, None, None, r_out, synch_type='out')
+        # Compute learned weighting for synchronisation
+        # --- Recurrent Loop  ---
+        for stepi in range(self.iterations):
+            # --- Calculate Synchronisation for Input Data Interaction ---
+            synchronisation_action, decay_alpha_action, decay_beta_action = self.compute_synchronisation(activated_state, decay_alpha_action, decay_beta_action, r_action, synch_type='action')
+            # --- Interact with Data via Attention ---
+            q = self.q_proj(synchronisation_action).unsqueeze(1)
+            attn_out, attn_weights = self.attention(q, kv, kv, average_attn_weights=False, need_weights=True)
+            attn_out = attn_out.squeeze(1)
+            pre_synapse_input = torch.concatenate((attn_out, activated_state), dim=-1)
+            # --- Apply Synapses ---
+            state = self.synapses(pre_synapse_input)
+            # The 'state_trace' is the history of incoming pre-activations
+            state_trace = torch.cat((state_trace[:, :, 1:], state.unsqueeze(-1)), dim=-1)
+            # --- Apply Neuron-Level Models ---
+            activated_state = self.trace_processor(state_trace)
+            # One would also keep an 'activated_state_trace' as the history of outgoing post-activations
+            # BUT, this is unnecessary because the synchronisation calculation is fully linear and can be
+            # done using only the currect activated state (see compute_synchronisation method for explanation)
+            # --- Calculate Synchronisation for Output Predictions ---
+            synchronisation_out, decay_alpha_out, decay_beta_out = self.compute_synchronisation(activated_state, decay_alpha_out, decay_beta_out, r_out, synch_type='out')
+            # --- Get Predictions and Certainties ---
+            current_prediction = self.output_projector(synchronisation_out)
+            current_certainty = self.compute_certainty(current_prediction)
+            predictions[..., stepi] = current_prediction
+            certainties[..., stepi] = current_certainty
+            # --- Tracking ---
+            if track:
+                pre_activations_tracking.append(state_trace[:,:,-1].detach().cpu().numpy())
+                post_activations_tracking.append(activated_state.detach().cpu().numpy())
+                attention_tracking.append(attn_weights.detach().cpu().numpy())
+                synch_out_tracking.append(synchronisation_out.detach().cpu().numpy())
+                synch_action_tracking.append(synchronisation_action.detach().cpu().numpy())
+        # --- Return Values ---
+        if track:
+            return predictions, certainties, (np.array(synch_out_tracking), np.array(synch_action_tracking)), np.array(pre_activations_tracking), np.array(post_activations_tracking), np.array(attention_tracking)
+        return predictions, certainties, synchronisation_out

models/ctm_qamnist.py ADDED Viewed

	@@ -0,0 +1,205 @@

+import torch
+import numpy as np
+from models.ctm import ContinuousThoughtMachine
+from models.modules import MNISTBackbone, QAMNISTIndexEmbeddings, QAMNISTOperatorEmbeddings
+class ContinuousThoughtMachineQAMNIST(ContinuousThoughtMachine):
+    def __init__(self,
+                 iterations,
+                 d_model,
+                 d_input,
+                 heads,
+                 n_synch_out,
+                 n_synch_action,
+                 synapse_depth,
+                 memory_length,
+                 deep_nlms,
+                 memory_hidden_dims,
+                 do_layernorm_nlm,
+                 out_dims,
+                 iterations_per_digit,
+                 iterations_per_question_part,
+                 iterations_for_answering,
+                 prediction_reshaper=[-1],
+                 dropout=0,
+                 neuron_select_type='first-last',
+                 n_random_pairing_self=256
+                 ):
+        super().__init__(
+            iterations=iterations,
+            d_model=d_model,
+            d_input=d_input,
+            heads=heads,
+            n_synch_out=n_synch_out,
+            n_synch_action=n_synch_action,
+            synapse_depth=synapse_depth,
+            memory_length=memory_length,
+            deep_nlms=deep_nlms,
+            memory_hidden_dims=memory_hidden_dims,
+            do_layernorm_nlm=do_layernorm_nlm,
+            out_dims=out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=dropout,
+            neuron_select_type=neuron_select_type,
+            n_random_pairing_self=n_random_pairing_self,
+            backbone_type='none',
+            positional_embedding_type='none',
+        )
+        # --- Core Parameters ---
+        self.iterations_per_digit = iterations_per_digit
+        self.iterations_per_question_part = iterations_per_question_part
+        self.iterations_for_answering = iterations_for_answering
+    # --- Setup Methods ---
+    def set_initial_rgb(self):
+        """Set the initial RGB values for the backbone."""
+        return None
+    def get_d_backbone(self):
+        """Get the dimensionality of the backbone output."""
+        return self.d_input
+    def set_backbone(self):
+        """Set the backbone module based on the specified type."""
+        self.backbone_digit = MNISTBackbone(self.d_input)
+        self.index_backbone = QAMNISTIndexEmbeddings(50, self.d_input)
+        self.operator_backbone = QAMNISTOperatorEmbeddings(2, self.d_input)
+        pass
+    # --- Utilty Methods ---
+    def determine_step_type(self, total_iterations_for_digits, total_iterations_for_question, stepi: int):
+        """Determine whether the current step is for digits, questions, or answers."""
+        is_digit_step = stepi < total_iterations_for_digits
+        is_question_step = total_iterations_for_digits <= stepi < total_iterations_for_digits + total_iterations_for_question
+        is_answer_step = stepi >= total_iterations_for_digits + total_iterations_for_question
+        return is_digit_step, is_question_step, is_answer_step
+    def determine_index_operator_step_type(self, total_iterations_for_digits, stepi: int):
+        """Determine whether the current step is for index or operator."""
+        step_within_questions = stepi - total_iterations_for_digits
+        if step_within_questions % (2 * self.iterations_per_question_part) < self.iterations_per_question_part:
+            is_index_step = True
+            is_operator_step = False
+        else:
+            is_index_step = False
+            is_operator_step = True
+        return is_index_step, is_operator_step
+    def get_kv_for_step(self, total_iterations_for_digits, total_iterations_for_question, stepi, x, z, prev_input=None, prev_kv=None):
+        """Get the key-value for the current step."""
+        is_digit_step, is_question_step, is_answer_step = self.determine_step_type(total_iterations_for_digits, total_iterations_for_question, stepi)
+        if is_digit_step:
+            current_input = x[:, stepi]
+            if prev_input is not None and torch.equal(current_input, prev_input):
+                return prev_kv, prev_input
+            kv = self.kv_proj(self.backbone_digit(current_input).flatten(2).permute(0, 2, 1))
+        elif is_question_step:
+            offset = stepi - total_iterations_for_digits
+            current_input = z[:, offset]
+            if prev_input is not None and torch.equal(current_input, prev_input):
+                return prev_kv, prev_input
+            is_index_step, is_operator_step = self.determine_index_operator_step_type(total_iterations_for_digits, stepi)
+            if is_index_step:
+                kv = self.index_backbone(current_input)
+            elif is_operator_step:
+                kv = self.operator_backbone(current_input)
+            else:
+                raise ValueError("Invalid step type for question processing.")
+        elif is_answer_step:
+            current_input = None
+            kv = torch.zeros((x.size(0), self.d_input), device=x.device)
+        else:
+            raise ValueError("Invalid step type.")
+        return kv, current_input
+    def forward(self, x, z, track=False):
+        B = x.size(0)
+        device = x.device
+        # --- Tracking Initialization ---
+        pre_activations_tracking = []
+        post_activations_tracking = []
+        attention_tracking = []
+        embedding_tracking = []
+        total_iterations_for_digits = x.size(1)
+        total_iterations_for_question = z.size(1)
+        total_iterations = total_iterations_for_digits + total_iterations_for_question + self.iterations_for_answering
+        # --- Initialise Recurrent State ---
+        state_trace = self.start_trace.unsqueeze(0).expand(B, -1, -1) # Shape: (B, H, T)
+        activated_state = self.start_activated_state.unsqueeze(0).expand(B, -1) # Shape: (B, H)
+        # --- Storage for outputs per iteration ---
+        predictions = torch.empty(B, self.out_dims, total_iterations, device=device, dtype=x.dtype)
+        certainties = torch.empty(B, 2, total_iterations, device=device, dtype=x.dtype)
+        # --- Initialise Recurrent Synch Values  ---
+        decay_alpha_action, decay_beta_action = None, None
+        r_action, r_out = torch.exp(-torch.clamp(self.decay_params_action, 0, 15)).unsqueeze(0).repeat(B, 1), torch.exp(-torch.clamp(self.decay_params_out, 0, 15)).unsqueeze(0).repeat(B, 1)
+        _, decay_alpha_out, decay_beta_out = self.compute_synchronisation(activated_state, None, None, r_out, synch_type='out')
+        prev_input = None
+        prev_kv = None
+        # --- Recurrent Loop  ---
+        for stepi in range(total_iterations):
+            is_digit_step, is_question_step, is_answer_step = self.determine_step_type(total_iterations_for_digits, total_iterations_for_question, stepi)
+            kv, prev_input = self.get_kv_for_step(total_iterations_for_digits, total_iterations_for_question, stepi, x, z, prev_input, prev_kv)
+            prev_kv = kv
+            synchronization_action, decay_alpha_action, decay_beta_action = self.compute_synchronisation(activated_state, decay_alpha_action, decay_beta_action, r_action, synch_type='action')
+            # --- Interact with Data via Attention ---
+            attn_weights = None
+            if is_digit_step:
+                q = self.q_proj(synchronization_action).unsqueeze(1)
+                attn_out, attn_weights = self.attention(q, kv, kv, average_attn_weights=False, need_weights=True)
+                attn_out = attn_out.squeeze(1)
+                pre_synapse_input = torch.concatenate((attn_out, activated_state), dim=-1)
+            else:
+                kv = kv.squeeze(1)
+                pre_synapse_input = torch.concatenate((kv, activated_state), dim=-1)
+            # --- Apply Synapses ---
+            state = self.synapses(pre_synapse_input)
+            state_trace = torch.cat((state_trace[:, :, 1:], state.unsqueeze(-1)), dim=-1)
+            # --- Apply NLMs ---
+            activated_state = self.trace_processor(state_trace)
+            # --- Calculate Synchronisation for Output Predictions ---
+            synchronization_out, decay_alpha_out, decay_beta_out = self.compute_synchronisation(activated_state, decay_alpha_out, decay_beta_out, r_out, synch_type='out')
+            # --- Get Predictions and Certainties ---
+            current_prediction = self.output_projector(synchronization_out)
+            current_certainty = self.compute_certainty(current_prediction)
+            predictions[..., stepi] = current_prediction
+            certainties[..., stepi] = current_certainty
+            # --- Tracking ---
+            if track:
+                pre_activations_tracking.append(state_trace[:,:,-1].detach().cpu().numpy())
+                post_activations_tracking.append(activated_state.detach().cpu().numpy())
+                if attn_weights is not None:
+                    attention_tracking.append(attn_weights.detach().cpu().numpy())
+                if is_question_step:
+                    embedding_tracking.append(kv.detach().cpu().numpy())
+        # --- Return Values ---
+        if track:
+            return predictions, certainties, synchronization_out, np.array(pre_activations_tracking), np.array(post_activations_tracking), np.array(attention_tracking), np.array(embedding_tracking)
+        return predictions, certainties, synchronization_out

models/ctm_rl.py ADDED Viewed

	@@ -0,0 +1,192 @@

+import torch
+import torch.nn as nn
+import numpy as np
+import math
+from models.ctm import ContinuousThoughtMachine
+from models.modules import MiniGridBackbone, ClassicControlBackbone, SynapseUNET
+from models.utils import compute_decay
+from models.constants import VALID_NEURON_SELECT_TYPES
+class ContinuousThoughtMachineRL(ContinuousThoughtMachine):
+    def __init__(self,
+                 iterations,
+                 d_model,
+                 d_input,
+                 n_synch_out,
+                 synapse_depth,
+                 memory_length,
+                 deep_nlms,
+                 memory_hidden_dims,
+                 do_layernorm_nlm,
+                 backbone_type,
+                 prediction_reshaper=[-1],
+                 dropout=0,
+                 neuron_select_type='first-last',
+                 ):
+        super().__init__(
+            iterations=iterations,
+            d_model=d_model,
+            d_input=d_input,
+            heads=0,  # Set heads to 0 will return None
+            n_synch_out=n_synch_out,
+            n_synch_action=0,
+            synapse_depth=synapse_depth,
+            memory_length=memory_length,
+            deep_nlms=deep_nlms,
+            memory_hidden_dims=memory_hidden_dims,
+            do_layernorm_nlm=do_layernorm_nlm,
+            out_dims=0,
+            prediction_reshaper=prediction_reshaper,
+            dropout=dropout,
+            neuron_select_type=neuron_select_type,
+            backbone_type=backbone_type,
+            n_random_pairing_self=0,
+            positional_embedding_type='none',
+        )
+        # --- Use a minimal CTM w/out input (action) synch ---
+        self.neuron_select_type_action = None
+        self.synch_representation_size_action = None
+        # --- Start dynamics with a learned activated state trace ---
+        self.register_parameter('start_activated_trace', nn.Parameter(torch.zeros((d_model, memory_length)).uniform_(-math.sqrt(1/(d_model+memory_length)), math.sqrt(1/(d_model+memory_length))), requires_grad=True))
+        self.start_activated_state = None
+        self.register_buffer('diagonal_mask_out', torch.triu(torch.ones(self.n_synch_out, self.n_synch_out, dtype=torch.bool)))
+        self.attention = None  # Should already be None because super(... heads=0... )
+        self.q_proj = None  # Should already be None because super(... heads=0... )
+        self.kv_proj = None  # Should already be None because super(... heads=0... )
+        self.output_projector = None
+    # --- Core CTM Methods ---
+    def compute_synchronisation(self, activated_state_trace):
+        """Compute the synchronisation between neurons."""
+        assert self.neuron_select_type == "first-last", "only fisrst-last neuron selection is supported here"
+        # For RL tasks we track a sliding window of activations from which we compute synchronisation
+        S = activated_state_trace.permute(0, 2, 1)
+        diagonal_mask = self.diagonal_mask_out.to(S.device)
+        decay = compute_decay(S.size(1), self.decay_params_out, clamp_lims=(0, 4))
+        synchronisation = ((decay.unsqueeze(0) *(S[:,:,-self.n_synch_out:].unsqueeze(-1) * S[:,:,-self.n_synch_out:].unsqueeze(-2))[:,:,diagonal_mask]).sum(1))/torch.sqrt(decay.unsqueeze(0).sum(1,))
+        return synchronisation
+    # --- Setup Methods ---
+    def set_initial_rgb(self):
+        """Set the initial RGB values for the backbone."""
+        return None
+    def get_d_backbone(self):
+        """Get the dimensionality of the backbone output."""
+        return self.d_input
+    def set_backbone(self):
+        """Set the backbone module based on the specified type."""
+        if self.backbone_type == 'navigation-backbone':
+            self.backbone = MiniGridBackbone(self.d_input)
+        elif self.backbone_type == 'classic-control-backbone':
+            self.backbone = ClassicControlBackbone(self.d_input)
+        else:
+            raise NotImplemented('The only backbone supported for RL are for navigation (symbolic C x H x W inputs) and classic control (vectors of length D).')
+        pass
+    def get_positional_embedding(self, d_backbone):
+        """Get the positional embedding module."""
+        return None
+    def get_synapses(self, synapse_depth, d_model, dropout):
+        """
+        Get the synapse module.
+        We found in our early experimentation that a single Linear, GLU and LayerNorm block performed worse than two blocks.
+        For that reason we set the default synapse depth to two blocks.
+        TODO: This is legacy and needs further experimentation to iron out.
+        """
+        if synapse_depth == 1:
+            return nn.Sequential(
+                nn.Dropout(dropout),
+                nn.LazyLinear(d_model*2),
+                nn.GLU(),
+                nn.LayerNorm(d_model),
+                nn.LazyLinear(d_model*2),
+                nn.GLU(),
+                nn.LayerNorm(d_model)
+            )
+        else:
+            return SynapseUNET(d_model, synapse_depth, 16, dropout)
+    def set_synchronisation_parameters(self, synch_type: str, n_synch: int, n_random_pairing_self: int = 0):
+        """Set the parameters for the synchronisation of neurons."""
+        if synch_type == 'action':
+            pass
+        elif synch_type == 'out':
+            left, right = self.initialize_left_right_neurons("out", self.d_model, n_synch, n_random_pairing_self)
+            self.register_buffer(f'out_neuron_indices_left', left)
+            self.register_buffer(f'out_neuron_indices_right', right)
+            self.register_parameter(f'decay_params_out', nn.Parameter(torch.zeros(self.synch_representation_size_out), requires_grad=True))
+            pass
+        else:
+            raise ValueError(f"Invalid synch_type: {synch_type}")
+    # --- Utilty Methods ---
+    def verify_args(self):
+        """Verify the validity of the input arguments."""
+        assert self.neuron_select_type in VALID_NEURON_SELECT_TYPES, \
+            f"Invalid neuron selection type: {self.neuron_select_type}"
+        assert self.neuron_select_type != 'random-pairing', \
+            f"Random pairing is not supported for RL."
+        assert self.backbone_type in ('navigation-backbone', 'classic-control-backbone'), \
+            f"Invalid backbone_type: {self.backbone_type}"
+        assert self.d_model >= (self.n_synch_out), \
+            "d_model must be >= n_synch_out for neuron subsets"
+        pass
+    def forward(self, x, hidden_states, track=False):
+        # --- Tracking Initialization ---
+        pre_activations_tracking = []
+        post_activations_tracking = []
+        # --- Featurise Input Data ---
+        features = self.backbone(x)
+        # ---  Get Recurrent State ---
+        state_trace, activated_state_trace = hidden_states
+        # --- Recurrent Loop  ---
+        for stepi in range(self.iterations):
+            pre_synapse_input = torch.concatenate((features.reshape(x.size(0), -1), activated_state_trace[:,:,-1]), -1)
+            # --- Apply Synapses ---
+            state = self.synapses(pre_synapse_input)
+            state_trace = torch.cat((state_trace[:, :, 1:], state.unsqueeze(-1)), dim=-1)
+            # --- Apply NLMs ---
+            activated_state = self.trace_processor(state_trace)
+            activated_state_trace = torch.concatenate((activated_state_trace[:,:,1:], activated_state.unsqueeze(-1)), -1)
+            # --- Tracking ---
+            if track:
+                pre_activations_tracking.append(state_trace[:,:,-1].detach().cpu().numpy())
+                post_activations_tracking.append(activated_state.detach().cpu().numpy())
+        hidden_states = (
+            state_trace,
+            activated_state_trace,
+        )
+        # --- Calculate Output Synchronisation ---
+        synchronisation_out = self.compute_synchronisation(activated_state_trace)
+        # --- Return Values ---
+        if track:
+            return synchronisation_out, hidden_states, np.array(pre_activations_tracking), np.array(post_activations_tracking)
+        return synchronisation_out, hidden_states

models/ctm_sort.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import torch
+import numpy as np
+from models.ctm import ContinuousThoughtMachine
+class ContinuousThoughtMachineSORT(ContinuousThoughtMachine):
+    """
+    Slight adaption of the CTM to work with the sort task.
+    """
+    def __init__(self,
+                 iterations,
+                 d_model,
+                 d_input,
+                 heads,
+                 n_synch_out,
+                 n_synch_action,
+                 synapse_depth,
+                 memory_length,
+                 deep_nlms,
+                 memory_hidden_dims,
+                 do_layernorm_nlm,
+                 backbone_type,
+                 positional_embedding_type,
+                 out_dims,
+                 prediction_reshaper=[-1],
+                 dropout=0,
+                 dropout_nlm=None,
+                 neuron_select_type='random-pairing',
+                 n_random_pairing_self=0,
+                 ):
+        super().__init__(
+            iterations=iterations,
+            d_model=d_model,
+            d_input=d_input,
+            heads=0,
+            n_synch_out=n_synch_out,
+            n_synch_action=0,
+            synapse_depth=synapse_depth,
+            memory_length=memory_length,
+            deep_nlms=deep_nlms,
+            memory_hidden_dims=memory_hidden_dims,
+            do_layernorm_nlm=do_layernorm_nlm,
+            backbone_type='none',
+            positional_embedding_type='none',
+            out_dims=out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=dropout,
+            dropout_nlm=dropout_nlm,
+            neuron_select_type=neuron_select_type,
+            n_random_pairing_self=n_random_pairing_self,
+        )
+        # --- Use a minimal CTM w/out input (action) synch ---
+        self.neuron_select_type_action = None
+        self.synch_representation_size_action = None
+        self.attention = None  # Should already be None because super(... heads=0... )
+        self.q_proj = None  # Should already be None because super(... heads=0... )
+        self.kv_proj = None  # Should already be None because super(... heads=0... )
+    def forward(self, x, track=False):
+        B = x.size(0)
+        device = x.device
+        # --- Tracking Initialization ---
+        pre_activations_tracking = []
+        post_activations_tracking = []
+        synch_out_tracking = []
+        attention_tracking = []
+        # --- For SORT: no need to featurise data ---
+        # --- Initialise Recurrent State ---
+        state_trace = self.start_trace.unsqueeze(0).expand(B, -1, -1) # Shape: (B, H, T)
+        activated_state = self.start_activated_state.unsqueeze(0).expand(B, -1) # Shape: (B, H)
+        # --- Prepare Storage for Outputs per Iteration ---
+        predictions = torch.empty(B, self.out_dims, self.iterations, device=device, dtype=x.dtype)
+        certainties = torch.empty(B, 2, self.iterations, device=device, dtype=x.dtype)
+        # --- Initialise Recurrent Synch Values  ---
+        r_out = torch.exp(-torch.clamp(self.decay_params_out, 0, 15)).unsqueeze(0).repeat(B, 1)
+        _, decay_alpha_out, decay_beta_out = self.compute_synchronisation(activated_state, None, None, r_out, synch_type='out')
+        # Compute learned weighting for synchronisation
+        # --- Recurrent Loop  ---
+        for stepi in range(self.iterations):
+            pre_synapse_input = torch.concatenate((x, activated_state), dim=-1)
+            # --- Apply Synapses ---
+            state = self.synapses(pre_synapse_input)
+            # The 'state_trace' is the history of incoming pre-activations
+            state_trace = torch.cat((state_trace[:, :, 1:], state.unsqueeze(-1)), dim=-1)
+            # --- Apply Neuron-Level Models ---
+            activated_state = self.trace_processor(state_trace)
+            # One would also keep an 'activated_state_trace' as the history of outgoing post-activations
+            # BUT, this is unnecessary because the synchronisation calculation is fully linear and can be
+            # done using only the currect activated state (see compute_synchronisation method for explanation)
+            # --- Calculate Synchronisation for Output Predictions ---
+            synchronisation_out, decay_alpha_out, decay_beta_out = self.compute_synchronisation(activated_state, decay_alpha_out, decay_beta_out, r_out, synch_type='out')
+            # --- Get Predictions and Certainties ---
+            current_prediction = self.output_projector(synchronisation_out)
+            current_certainty = self.compute_certainty(current_prediction)
+            predictions[..., stepi] = current_prediction
+            certainties[..., stepi] = current_certainty
+            # --- Tracking ---
+            if track:
+                pre_activations_tracking.append(state_trace[:,:,-1].detach().cpu().numpy())
+                post_activations_tracking.append(activated_state.detach().cpu().numpy())
+                synch_out_tracking.append(synchronisation_out.detach().cpu().numpy())
+        # --- Return Values ---
+        if track:
+            return predictions, certainties, np.array(synch_out_tracking), np.array(pre_activations_tracking), np.array(post_activations_tracking), np.array(attention_tracking)
+        return predictions, certainties, synchronisation_out

models/ff.py ADDED Viewed

	@@ -0,0 +1,75 @@

+import torch.nn as nn
+# Local imports (Assuming these contain necessary custom modules)
+from models.modules import *
+from models.resnet import resnet18, resnet34, resnet50, resnet101, resnet152
+class FFBaseline(nn.Module):
+    """
+    LSTM Baseline.
+    Wrapper that lets us use the same backbone as the CTM and LSTM baselines, with a
+    Args:
+        d_model (int): workaround that projects final layer to this space so that parameter-matching is plausible.
+        backbone_type (str): Type of feature extraction backbone (e.g., 'resnet18-2', 'none').
+        out_dims (int): Dimensionality of the final output projection.
+        dropout (float): dropout in last layer
+    """
+    def __init__(self,
+                 d_model,
+                 backbone_type,
+                 out_dims,
+                 dropout=0,
+                 ):
+        super(FFBaseline, self).__init__()
+        # --- Core Parameters ---
+        self.d_model = d_model
+        self.backbone_type = backbone_type
+        self.out_dims = out_dims
+        # --- Input Assertions ---
+        assert backbone_type in ['resnet18-1', 'resnet18-2', 'resnet18-3', 'resnet18-4',
+                                 'resnet34-1', 'resnet34-2', 'resnet34-3', 'resnet34-4',
+                                 'resnet50-1', 'resnet50-2', 'resnet50-3', 'resnet50-4',
+                                 'resnet101-1', 'resnet101-2', 'resnet101-3', 'resnet101-4',
+                                 'resnet152-1', 'resnet152-2', 'resnet152-3', 'resnet152-4',
+                                 'none', 'shallow-wide', 'parity_backbone'], f"Invalid backbone_type: {backbone_type}"
+        # --- Backbone / Feature Extraction ---
+        self.initial_rgb = Identity() # Placeholder, potentially replaced if using ResNet
+        self.initial_rgb = nn.LazyConv2d(3, 1, 1) # Adapts input channels lazily
+        resnet_family = resnet18 # Default
+        if '34' in self.backbone_type: resnet_family = resnet34
+        if '50' in self.backbone_type: resnet_family = resnet50
+        if '101' in self.backbone_type: resnet_family = resnet101
+        if '152' in self.backbone_type: resnet_family = resnet152
+        # Determine which ResNet blocks to keep
+        block_num_str = self.backbone_type.split('-')[-1]
+        hyper_blocks_to_keep = list(range(1, int(block_num_str) + 1)) if block_num_str.isdigit() else [1, 2, 3, 4]
+        self.backbone = resnet_family(
+            3, # initial_rgb handles input channels now
+            hyper_blocks_to_keep,
+            stride=2,
+            pretrained=False,
+            progress=True,
+            device="cpu", # Initialise on CPU, move later via .to(device)
+            do_initial_max_pool=True,
+        )
+        # At this point we will have a 4D tensor of features: [B, C, H, W]
+        # The following lets us scale up the resnet with d_model until it matches the CTM
+        self.output_projector = nn.Sequential(nn.AdaptiveAvgPool2d((1, 1)), Squeeze(-1), Squeeze(-1), nn.LazyLinear(d_model), nn.ReLU(), nn.Dropout(dropout), nn.Linear(d_model, out_dims))
+    def forward(self, x):
+        return self.output_projector((self.backbone(self.initial_rgb(x))))

models/lstm.py ADDED Viewed

	@@ -0,0 +1,244 @@

+import torch.nn as nn
+import torch
+import numpy as np
+import math
+from models.modules import ParityBackbone, LearnableFourierPositionalEncoding, MultiLearnableFourierPositionalEncoding, CustomRotationalEmbedding, CustomRotationalEmbedding1D, ShallowWide
+from models.resnet import prepare_resnet_backbone
+from models.utils import compute_normalized_entropy
+from models.constants import (
+    VALID_BACKBONE_TYPES,
+    VALID_POSITIONAL_EMBEDDING_TYPES
+)
+class LSTMBaseline(nn.Module):
+    """
+    LSTM Baseline
+    Args:
+        iterations (int): Number of internal 'thought' steps (T, in paper).
+        d_model (int): Core dimensionality of the latent space.
+        d_input (int): Dimensionality of projected attention outputs or direct input features.
+        heads (int): Number of attention heads.
+        backbone_type (str): Type of feature extraction backbone (e.g., 'resnet18-2', 'none').
+        positional_embedding_type (str): Type of positional embedding for backbone features.
+        out_dims (int): Dimensionality of the final output projection.
+        prediction_reshaper (list): Shape for reshaping predictions before certainty calculation (task-specific).
+        dropout (float): Dropout rate.
+    """
+    def __init__(self,
+                 iterations,
+                 d_model,
+                 d_input,
+                 heads,
+                 backbone_type,
+                 num_layers,
+                 positional_embedding_type,
+                 out_dims,
+                 prediction_reshaper=[-1],
+                 dropout=0,
+                 ):
+        super(LSTMBaseline, self).__init__()
+        # --- Core Parameters ---
+        self.iterations = iterations
+        self.d_model = d_model
+        self.d_input = d_input
+        self.prediction_reshaper = prediction_reshaper
+        self.backbone_type = backbone_type
+        self.positional_embedding_type = positional_embedding_type
+        self.out_dims = out_dims
+        # --- Assertions ---
+        self.verify_args()
+        # --- Input Processing  ---
+        d_backbone = self.get_d_backbone()
+        self.set_initial_rgb()
+        self.set_backbone()
+        self.positional_embedding = self.get_positional_embedding(d_backbone)
+        self.kv_proj = self.get_kv_proj()
+        self.lstm = nn.LSTM(d_input, d_model, num_layers, batch_first=True, dropout=dropout)
+        self.q_proj = self.get_q_proj()
+        self.attention = self.get_attention(heads, dropout)
+        self.output_projector = nn.Sequential(nn.LazyLinear(out_dims))
+        #  --- Start States ---
+        self.register_parameter('start_hidden_state', nn.Parameter(torch.zeros((num_layers, d_model)).uniform_(-math.sqrt(1/(d_model)), math.sqrt(1/(d_model))), requires_grad=True))
+        self.register_parameter('start_cell_state', nn.Parameter(torch.zeros((num_layers, d_model)).uniform_(-math.sqrt(1/(d_model)), math.sqrt(1/(d_model))), requires_grad=True))
+    # --- Core LSTM Methods ---
+    def compute_features(self, x):
+        """Applies backbone and positional embedding to input."""
+        x = self.initial_rgb(x)
+        self.kv_features = self.backbone(x)
+        pos_emb = self.positional_embedding(self.kv_features)
+        combined_features = (self.kv_features + pos_emb).flatten(2).transpose(1, 2)
+        kv = self.kv_proj(combined_features)
+        return kv
+    def compute_certainty(self, current_prediction):
+        """Compute the certainty of the current prediction."""
+        B = current_prediction.size(0)
+        reshaped_pred = current_prediction.reshape([B] +self.prediction_reshaper)
+        ne = compute_normalized_entropy(reshaped_pred)
+        current_certainty = torch.stack((ne, 1-ne), -1)
+        return current_certainty
+    # --- Setup Methods ---
+    def set_initial_rgb(self):
+        """Set the initial RGB processing module based on the backbone type."""
+        if 'resnet' in self.backbone_type:
+            self.initial_rgb = nn.LazyConv2d(3, 1, 1) # Adapts input channels lazily
+        else:
+            self.initial_rgb = nn.Identity()
+    def get_d_backbone(self):
+        """
+        Get the dimensionality of the backbone output, to be used for positional embedding setup.
+        This is a little bit complicated for resnets, but the logic should be easy enough to read below.
+        """
+        if self.backbone_type == 'shallow-wide':
+            return 2048
+        elif self.backbone_type == 'parity_backbone':
+            return self.d_input
+        elif 'resnet' in self.backbone_type:
+            if '18' in self.backbone_type or '34' in self.backbone_type:
+                if self.backbone_type.split('-')[1]=='1': return 64
+                elif self.backbone_type.split('-')[1]=='2': return 128
+                elif self.backbone_type.split('-')[1]=='3': return 256
+                elif self.backbone_type.split('-')[1]=='4': return 512
+                else:
+                    raise NotImplementedError
+            else:
+                if self.backbone_type.split('-')[1]=='1': return 256
+                elif self.backbone_type.split('-')[1]=='2': return 512
+                elif self.backbone_type.split('-')[1]=='3': return 1024
+                elif self.backbone_type.split('-')[1]=='4': return 2048
+                else:
+                    raise NotImplementedError
+        elif self.backbone_type == 'none':
+            return None
+        else:
+            raise ValueError(f"Invalid backbone_type: {self.backbone_type}")
+    def set_backbone(self):
+        """Set the backbone module based on the specified type."""
+        if self.backbone_type == 'shallow-wide':
+            self.backbone = ShallowWide()
+        elif self.backbone_type == 'parity_backbone':
+            d_backbone = self.get_d_backbone()
+            self.backbone = ParityBackbone(n_embeddings=2, d_embedding=d_backbone)
+        elif 'resnet' in self.backbone_type:
+            self.backbone = prepare_resnet_backbone(self.backbone_type)
+        elif self.backbone_type == 'none':
+            self.backbone = nn.Identity()
+        else:
+            raise ValueError(f"Invalid backbone_type: {self.backbone_type}")
+    def get_positional_embedding(self, d_backbone):
+        """Get the positional embedding module."""
+        if self.positional_embedding_type == 'learnable-fourier':
+            return LearnableFourierPositionalEncoding(d_backbone, gamma=1 / 2.5)
+        elif self.positional_embedding_type == 'multi-learnable-fourier':
+            return MultiLearnableFourierPositionalEncoding(d_backbone)
+        elif self.positional_embedding_type == 'custom-rotational':
+            return CustomRotationalEmbedding(d_backbone)
+        elif self.positional_embedding_type == 'custom-rotational-1d':
+            return CustomRotationalEmbedding1D(d_backbone)
+        elif self.positional_embedding_type == 'none':
+            return lambda x: 0  # Default no-op
+        else:
+            raise ValueError(f"Invalid positional_embedding_type: {self.positional_embedding_type}")
+    def get_attention(self, heads, dropout):
+        """Get the attention module."""
+        return nn.MultiheadAttention(self.d_input, heads, dropout, batch_first=True)
+    def get_kv_proj(self):
+        """Get the key-value projection module."""
+        return nn.Sequential(nn.LazyLinear(self.d_input), nn.LayerNorm(self.d_input))
+    def get_q_proj(self):
+        """Get the query projection module."""
+        return nn.LazyLinear(self.d_input)
+    def verify_args(self):
+        """Verify the validity of the input arguments."""
+        assert self.backbone_type in VALID_BACKBONE_TYPES + ['none'], \
+            f"Invalid backbone_type: {self.backbone_type}"
+        assert self.positional_embedding_type in VALID_POSITIONAL_EMBEDDING_TYPES + ['none'], \
+            f"Invalid positional_embedding_type: {self.positional_embedding_type}"
+        if self.backbone_type=='none' and self.positional_embedding_type!='none':
+            raise AssertionError("There should be no positional embedding if there is no backbone.")
+        pass
+    def forward(self, x, track=False):
+        """
+        Forward pass - Reverted to structure closer to user's working version.
+        Executes T=iterations steps.
+        """
+        B = x.size(0)
+        device = x.device
+        # --- Tracking Initialization ---
+        activations_tracking = []
+        attention_tracking = []
+        # --- Featurise Input Data ---
+        kv = self.compute_features(x)
+        # --- Initialise Recurrent State ---
+        hn = torch.repeat_interleave(self.start_hidden_state.unsqueeze(1), x.size(0), 1)
+        cn = torch.repeat_interleave(self.start_cell_state.unsqueeze(1), x.size(0), 1)
+        state_trace = [hn[-1]]
+        # --- Prepare Storage for Outputs per Iteration ---
+        predictions = torch.empty(B, self.out_dims, self.iterations, device=device, dtype=x.dtype)
+        certainties = torch.empty(B, 2, self.iterations, device=device, dtype=x.dtype)
+        # --- Recurrent Loop  ---
+        for stepi in range(self.iterations):
+            # --- Interact with Data via Attention ---
+            q = self.q_proj(hn[-1].unsqueeze(1))
+            attn_out, attn_weights = self.attention(q, kv, kv, average_attn_weights=False, need_weights=True)
+            lstm_input = attn_out
+            # --- Apply LSTM ---
+            hidden_state, (hn,cn) = self.lstm(lstm_input, (hn, cn))
+            hidden_state = hidden_state.squeeze(1)
+            state_trace.append(hidden_state)
+            # --- Get Predictions and Certainties ---
+            current_prediction = self.output_projector(hidden_state)
+            current_certainty = self.compute_certainty(current_prediction)
+            predictions[..., stepi] = current_prediction
+            certainties[..., stepi] = current_certainty
+            # --- Tracking ---
+            if track:
+                activations_tracking.append(hidden_state.squeeze(1).detach().cpu().numpy())
+                attention_tracking.append(attn_weights.detach().cpu().numpy())
+        # --- Return Values ---
+        if track:
+            return predictions, certainties, None, np.zeros_like(activations_tracking), np.array(activations_tracking), np.array(attention_tracking)
+        return predictions, certainties, None

models/lstm_qamnist.py ADDED Viewed

	@@ -0,0 +1,184 @@

+import torch.nn as nn
+import torch
+import torch.nn.functional as F # Used for GLU if not in modules
+import numpy as np
+import math
+# Local imports (Assuming these contain necessary custom modules)
+from models.modules import *
+from models.utils import * # Assuming compute_decay, compute_normalized_entropy are here
+class LSTMBaseline(nn.Module):
+    """
+    LSTM Baseline
+    Args:
+        iterations (int): Number of internal 'thought' steps (T, in paper).
+        d_model (int): Core dimensionality of the CTM's latent space (D, in paper).
+        d_input (int): Dimensionality of projected attention outputs or direct input features.
+        heads (int): Number of attention heads.
+        n_synch_out (int): Number of neurons used for output synchronisation (No, in paper).
+        n_synch_action (int): Number of neurons used for action/attention synchronisation (Ni, in paper).
+        synapse_depth (int): Depth of the synapse model (U-Net if > 1, else MLP).
+        memory_length (int): History length for Neuron-Level Models (M, in paper).
+        deep_nlms (bool): Use deeper (2-layer) NLMs if True, else linear.
+        memory_hidden_dims (int): Hidden dimension size for deep NLMs.
+        do_layernorm_nlm (bool): Apply LayerNorm within NLMs.
+        backbone_type (str): Type of feature extraction backbone (e.g., 'resnet18-2', 'none').
+        positional_embedding_type (str): Type of positional embedding for backbone features.
+        out_dims (int): Dimensionality of the final output projection.
+        prediction_reshaper (list): Shape for reshaping predictions before certainty calculation (task-specific).
+        dropout (float): Dropout rate.
+    """
+    def __init__(self,
+                 iterations,
+                 d_model,
+                 d_input,
+                 heads,
+                 out_dims,
+                 iterations_per_digit,
+                 iterations_per_question_part,
+                 iterations_for_answering,
+                 prediction_reshaper=[-1],
+                 dropout=0,
+                 ):
+        super(LSTMBaseline, self).__init__()
+        # --- Core Parameters ---
+        self.iterations = iterations
+        self.d_model = d_model
+        self.prediction_reshaper = prediction_reshaper
+        self.out_dims = out_dims
+        self.d_input = d_input
+        self.backbone_type = 'qamnist_backbone'
+        self.iterations_per_digit = iterations_per_digit
+        self.iterations_per_question_part = iterations_per_question_part
+        self.total_iterations_for_answering = iterations_for_answering
+        # --- Backbone / Feature Extraction ---
+        self.backbone_digit = MNISTBackbone(d_input)
+        self.index_backbone = QAMNISTIndexEmbeddings(50, d_input)
+        self.operator_backbone = QAMNISTOperatorEmbeddings(2, d_input)
+        # --- Core CTM Modules ---
+        self.lstm_cell = nn.LSTMCell(d_input, d_model)
+        self.register_parameter('start_hidden_state', nn.Parameter(torch.zeros((d_model)).uniform_(-math.sqrt(1/(d_model)), math.sqrt(1/(d_model))), requires_grad=True))
+        self.register_parameter('start_cell_state', nn.Parameter(torch.zeros((d_model)).uniform_(-math.sqrt(1/(d_model)), math.sqrt(1/(d_model))), requires_grad=True))
+        # Attention
+        self.q_proj = nn.LazyLinear(d_input)
+        self.kv_proj = nn.Sequential(nn.LazyLinear(d_input), nn.LayerNorm(d_input))
+        self.attention = nn.MultiheadAttention(d_input, heads, dropout, batch_first=True)
+        # Output Projection
+        self.output_projector = nn.Sequential(nn.LazyLinear(out_dims))
+    def compute_certainty(self, current_prediction):
+        """Compute the certainty of the current prediction."""
+        B = current_prediction.size(0)
+        reshaped_pred = current_prediction.reshape([B] +self.prediction_reshaper)
+        ne = compute_normalized_entropy(reshaped_pred)
+        current_certainty = torch.stack((ne, 1-ne), -1)
+        return current_certainty
+    def get_kv_for_step(self, stepi, x, z, thought_steps, prev_input=None, prev_kv=None):
+        is_digit_step, is_question_step, is_answer_step = thought_steps.determine_step_type(stepi)
+        if is_digit_step:
+            current_input = x[:, stepi]
+            if prev_input is not None and torch.equal(current_input, prev_input):
+                return prev_kv, prev_input
+            kv = self.kv_proj(self.backbone_digit(current_input).flatten(2).permute(0, 2, 1))
+        elif is_question_step:
+            offset = stepi - thought_steps.total_iterations_for_digits
+            current_input = z[:, offset].squeeze(0)
+            if prev_input is not None and torch.equal(current_input, prev_input):
+                return prev_kv, prev_input
+            is_index_step, is_operator_step = thought_steps.determine_answer_step_type(stepi)
+            if is_index_step:
+                kv = self.kv_proj(self.index_backbone(current_input))
+            elif is_operator_step:
+                kv = self.kv_proj(self.operator_backbone(current_input))
+            else:
+                raise ValueError("Invalid step type for question processing.")
+        elif is_answer_step:
+            current_input = None
+            kv = torch.zeros((x.size(0), self.d_input), device=x.device)
+        else:
+            raise ValueError("Invalid step type.")
+        return kv, current_input
+    def forward(self, x, z, track=False):
+        """
+        Forward pass - Reverted to structure closer to user's working version.
+        Executes T=iterations steps.
+        """
+        B = x.size(0) # Batch size
+        # --- Tracking Initialization ---
+        activations_tracking = []
+        attention_tracking = [] # Note: reshaping this correctly requires knowing num_heads
+        embedding_tracking = []
+        thought_steps = ThoughtSteps(self.iterations_per_digit, self.iterations_per_question_part, self.total_iterations_for_answering, x.size(1), z.size(1))
+        # --- Step 2: Initialise Recurrent State ---
+        hidden_state = torch.repeat_interleave(self.start_hidden_state.unsqueeze(0), x.size(0), 0)
+        cell_state = torch.repeat_interleave(self.start_cell_state.unsqueeze(0), x.size(0), 0)
+        state_trace = [hidden_state]
+        device = hidden_state.device
+        # Storage for outputs per iteration
+        predictions = torch.empty(B, self.out_dims, thought_steps.total_iterations, device=device, dtype=x.dtype) # Adjust dtype if needed
+        certainties = torch.empty(B, 2, thought_steps.total_iterations, device=device, dtype=x.dtype) # Adjust dtype if needed
+        prev_input = None
+        prev_kv = None
+        # --- Recurrent Loop (T=iterations steps) ---
+        for stepi in range(thought_steps.total_iterations):
+            is_digit_step, is_question_step, is_answer_step = thought_steps.determine_step_type(stepi)
+            kv, prev_input = self.get_kv_for_step(stepi, x, z, thought_steps, prev_input, prev_kv)
+            prev_kv = kv
+            # --- Interact with Data via Attention ---
+            attn_weights = None
+            if is_digit_step:
+                q = self.q_proj(hidden_state).unsqueeze(1)
+                attn_out, attn_weights = self.attention(q, kv, kv, average_attn_weights=False, need_weights=True)
+                lstm_input = attn_out.squeeze(1)
+            else:
+                lstm_input = kv
+            hidden_state, cell_state = self.lstm_cell(lstm_input.squeeze(1), (hidden_state, cell_state))
+            state_trace.append(hidden_state)
+            # --- Get Predictions and Certainties ---
+            current_prediction = self.output_projector(hidden_state)
+            current_certainty = self.compute_certainty(current_prediction)
+            predictions[..., stepi] = current_prediction
+            certainties[..., stepi] = current_certainty
+            # --- Tracking ---
+            if track:
+                activations_tracking.append(hidden_state.squeeze(1).detach().cpu().numpy())
+                if attn_weights is not None:
+                    attention_tracking.append(attn_weights.detach().cpu().numpy())
+                if is_question_step:
+                    embedding_tracking.append(kv.detach().cpu().numpy())
+        # --- Return Values ---
+        if track:
+            return predictions, certainties, None, np.array(activations_tracking), np.array(activations_tracking), np.array(attention_tracking), np.array(embedding_tracking)
+        return predictions, certainties, None

models/lstm_rl.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import torch.nn as nn
+import torch
+import torch.nn.functional as F # Used for GLU if not in modules
+import numpy as np
+import math
+# Local imports (Assuming these contain necessary custom modules)
+from models.modules import *
+from models.utils import * # Assuming compute_decay, compute_normalized_entropy are here
+class LSTMBaseline(nn.Module):
+    """
+    LSTM Baseline
+    Args:
+        iterations (int): Number of internal 'thought' steps (T, in paper).
+        d_model (int): Core dimensionality of the CTM's latent space (D, in paper).
+        d_input (int): Dimensionality of projected attention outputs or direct input features.
+        backbone_type (str): Type of feature extraction backbone (e.g., 'resnet18-2', 'none').
+    """
+    def __init__(self,
+                 iterations,
+                 d_model,
+                 d_input,
+                 backbone_type,
+                 ):
+        super(LSTMBaseline, self).__init__()
+        # --- Core Parameters ---
+        self.iterations = iterations
+        self.d_model = d_model
+        self.backbone_type = backbone_type
+        # --- Input Assertions ---
+        assert backbone_type in ('navigation-backbone', 'classic-control-backbone'), f"Invalid backbone_type: {backbone_type}"
+        # --- Backbone / Feature Extraction ---
+        if self.backbone_type == 'navigation-backbone':
+            grid_size = 7
+            self.backbone = MiniGridBackbone(d_input=d_input, grid_size=grid_size)
+            lstm_cell_input_dim = grid_size * grid_size * d_input
+        elif self.backbone_type == 'classic-control-backbone':
+            self.backbone = ClassicControlBackbone(d_input=d_input)
+            lstm_cell_input_dim = d_input
+        else:
+            raise NotImplemented('The only backbone supported for RL are for navigation (symbolic C x H x W inputs) and classic control (vectors of length D).')
+        # --- Core LSTM Modules ---
+        self.lstm_cell = nn.LSTMCell(lstm_cell_input_dim, d_model)
+        self.register_parameter('start_hidden_state', nn.Parameter(torch.zeros((d_model)).uniform_(-math.sqrt(1/(d_model)), math.sqrt(1/(d_model))), requires_grad=True))
+        self.register_parameter('start_cell_state', nn.Parameter(torch.zeros((d_model)).uniform_(-math.sqrt(1/(d_model)), math.sqrt(1/(d_model))), requires_grad=True))
+    def compute_features(self, x):
+        """Applies backbone and positional embedding to input."""
+        return self.backbone(x)
+    def forward(self, x, hidden_states, track=False):
+        """
+        Forward pass - Reverted to structure closer to user's working version.
+        Executes T=iterations steps.
+        """
+        # --- Tracking Initialization ---
+        activations_tracking = []
+        # --- Featurise Input Data ---
+        features = self.compute_features(x)
+        hidden_state = hidden_states[0]
+        cell_state = hidden_states[1]
+        # --- Recurrent Loop ---
+        for stepi in range(self.iterations):
+            lstm_input = features.reshape(x.size(0), -1)
+            hidden_state, cell_state = self.lstm_cell(lstm_input.squeeze(1), (hidden_state, cell_state))
+            # --- Tracking ---
+            if track:
+                activations_tracking.append(hidden_state.squeeze(1).detach().cpu().numpy())
+        hidden_states = (
+            hidden_state,
+            cell_state
+        )
+        # --- Return Values ---
+        if track:
+            return hidden_state, hidden_states, np.array(activations_tracking), np.array(activations_tracking)
+        return hidden_state, hidden_states

models/modules.py ADDED Viewed

	@@ -0,0 +1,692 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F # Used for GLU
+import math
+import numpy as np
+# Assuming 'add_coord_dim' is defined in models.utils
+from models.utils import add_coord_dim
+# --- Basic Utility Modules ---
+class Identity(nn.Module):
+    """
+    Identity Module.
+    Returns the input tensor unchanged. Useful as a placeholder or a no-op layer
+    in nn.Sequential containers or conditional network parts.
+    """
+    def __init__(self):
+        super().__init__()
+    def forward(self, x):
+        return x
+class Squeeze(nn.Module):
+    """
+    Squeeze Module.
+    Removes a specified dimension of size 1 from the input tensor.
+    Useful for incorporating tensor dimension squeezing within nn.Sequential.
+    Args:
+      dim (int): The dimension to squeeze.
+    """
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+    def forward(self, x):
+        return x.squeeze(self.dim)
+# --- Core CTM Component Modules ---
+class SynapseUNET(nn.Module):
+    """
+    UNET-style architecture for the Synapse Model (f_theta1 in the paper).
+    This module implements the connections between neurons in the CTM's latent
+    space. It processes the combined input (previous post-activation state z^t
+    and attention output o^t) to produce the pre-activations (a^t) for the
+    next internal tick (Eq. 1 in the paper).
+    While a simpler Linear or MLP layer can be used, the paper notes
+    that this U-Net structure empirically performed better, suggesting benefit
+    from more flexible synaptic connections[cite: 79, 80]. This implementation
+    uses `depth` points in linspace and creates `depth-1` down/up blocks.
+    Args:
+      in_dims (int): Number of input dimensions (d_model + d_input).
+      out_dims (int): Number of output dimensions (d_model).
+      depth (int): Determines structure size; creates `depth-1` down/up blocks.
+      minimum_width (int): Smallest channel width at the U-Net bottleneck.
+      dropout (float): Dropout rate applied within down/up projections.
+    """
+    def __init__(self,
+                 out_dims,
+                 depth,
+                 minimum_width=16,
+                 dropout=0.0):
+        super().__init__()
+        self.width_out = out_dims
+        self.n_deep = depth # Store depth just for reference if needed
+        # Define UNET structure based on depth
+        # Creates `depth` width values, leading to `depth-1` blocks
+        widths = np.linspace(out_dims, minimum_width, depth)
+        # Initial projection layer
+        self.first_projection = nn.Sequential(
+            nn.LazyLinear(int(widths[0])), # Project to the first width
+            nn.LayerNorm(int(widths[0])),
+            nn.SiLU()
+        )
+        # Downward path (encoding layers)
+        self.down_projections = nn.ModuleList()
+        self.up_projections = nn.ModuleList()
+        self.skip_lns = nn.ModuleList()
+        num_blocks = len(widths) - 1 # Number of down/up blocks created
+        for i in range(num_blocks):
+            # Down block: widths[i] -> widths[i+1]
+            self.down_projections.append(nn.Sequential(
+                nn.Dropout(dropout),
+                nn.Linear(int(widths[i]), int(widths[i+1])),
+                nn.LayerNorm(int(widths[i+1])),
+                nn.SiLU()
+            ))
+            # Up block: widths[i+1] -> widths[i]
+            # Note: Up blocks are added in order matching down blocks conceptually,
+            # but applied in reverse order in the forward pass.
+            self.up_projections.append(nn.Sequential(
+                nn.Dropout(dropout),
+                nn.Linear(int(widths[i+1]), int(widths[i])),
+                nn.LayerNorm(int(widths[i])),
+                nn.SiLU()
+            ))
+            # Skip connection LayerNorm operates on width[i]
+            self.skip_lns.append(nn.LayerNorm(int(widths[i])))
+    def forward(self, x):
+        # Initial projection
+        out_first = self.first_projection(x)
+        # Downward path, storing outputs for skip connections
+        outs_down = [out_first]
+        for layer in self.down_projections:
+            outs_down.append(layer(outs_down[-1]))
+        # outs_down contains [level_0, level_1, ..., level_depth-1=bottleneck] outputs
+        # Upward path, starting from the bottleneck output
+        outs_up = outs_down[-1] # Bottleneck activation
+        num_blocks = len(self.up_projections) # Should be depth - 1
+        for i in range(num_blocks):
+            # Apply up projection in reverse order relative to down blocks
+            # up_projection[num_blocks - 1 - i] processes deeper features first
+            up_layer_idx = num_blocks - 1 - i
+            out_up = self.up_projections[up_layer_idx](outs_up)
+            # Get corresponding skip connection from downward path
+            # skip_connection index = num_blocks - 1 - i (same as up_layer_idx)
+            # This matches the output width of the up_projection[up_layer_idx]
+            skip_idx = up_layer_idx
+            skip_connection = outs_down[skip_idx]
+            # Add skip connection and apply LayerNorm corresponding to this level
+            # skip_lns index also corresponds to the level = skip_idx
+            outs_up = self.skip_lns[skip_idx](out_up + skip_connection)
+        # The final output after all up-projections
+        return outs_up
+class SuperLinear(nn.Module):
+    """
+    SuperLinear Layer: Implements Neuron-Level Models (NLMs) for the CTM.
+    This layer is the core component enabling Neuron-Level Models (NLMs),
+    referred to as g_theta_d in the paper (Eq. 3). It applies N independent
+    linear transformations (or small MLPs when used sequentially) to corresponding
+    slices of the input tensor along a specified dimension (typically the neuron
+    or feature dimension).
+    How it works for NLMs:
+    - The input `x` is expected to be the pre-activation history for each neuron,
+      shaped (batch_size, n_neurons=N, history_length=in_dims).
+    - This layer holds unique weights (`w1`) and biases (`b1`) for *each* of the `N` neurons.
+      `w1` has shape (in_dims, out_dims, N), `b1` has shape (1, N, out_dims).
+    - `torch.einsum('bni,iog->bno', x, self.w1)` performs N independent matrix
+      multiplications in parallel (mapping from dim `i` to `o` for each neuron `n`):
+        - For each neuron `n` (from 0 to N-1):
+        - It takes the neuron's history `x[:, n, :]` (shape B, in_dims).
+        - Multiplies it by the neuron's unique weight matrix `self.w1[:, :, n]` (shape in_dims, out_dims).
+        - Resulting in `out[:, n, :]` (shape B, out_dims).
+    - The unique bias `self.b1[:, n, :]` is added.
+    - The result is squeezed on the last dim (if out_dims=1) and scaled by `T`.
+    This allows each neuron `d` to process its temporal history `A_d^t` using
+    its private parameters `theta_d` to produce the post-activation `z_d^{t+1}`,
+    enabling the fine-grained temporal dynamics central to the CTM[cite: 7, 30, 85].
+    It's typically used within the `trace_processor` module of the main CTM class.
+    Args:
+      in_dims (int): Input dimension (typically `memory_length`).
+      out_dims (int): Output dimension per neuron.
+      N (int): Number of independent linear models (typically `d_model`).
+      T (float): Initial value for learnable temperature/scaling factor applied to output.
+      do_norm (bool): Apply Layer Normalization to the input history before linear transform.
+      dropout (float): Dropout rate applied to the input.
+    """
+    def __init__(self,
+                 in_dims,
+                 out_dims,
+                 N,
+                 T=1.0,
+                 do_norm=False,
+                 dropout=0):
+        super().__init__()
+        # N is the number of neurons (d_model), in_dims is the history length (memory_length)
+        self.dropout = nn.Dropout(dropout) if dropout > 0 else Identity()
+        self.in_dims = in_dims # Corresponds to memory_length
+        # LayerNorm applied across the history dimension for each neuron independently
+        self.layernorm = nn.LayerNorm(in_dims, elementwise_affine=True) if do_norm else Identity()
+        self.do_norm = do_norm
+        # Initialize weights and biases
+        # w1 shape: (memory_length, out_dims, d_model)
+        self.register_parameter('w1', nn.Parameter(
+            torch.empty((in_dims, out_dims, N)).uniform_(
+                -1/math.sqrt(in_dims + out_dims),
+                 1/math.sqrt(in_dims + out_dims)
+            ), requires_grad=True)
+        )
+        # b1 shape: (1, d_model, out_dims)
+        self.register_parameter('b1', nn.Parameter(torch.zeros((1, N, out_dims)), requires_grad=True))
+        # Learnable temperature/scaler T
+        self.register_parameter('T', nn.Parameter(torch.Tensor([T])))
+    def forward(self, x):
+        """
+        Args:
+            x (torch.Tensor): Input tensor, expected shape (B, N, in_dims)
+                              where B=batch, N=d_model, in_dims=memory_length.
+        Returns:
+            torch.Tensor: Output tensor, shape (B, N) after squeeze(-1).
+        """
+        # Input shape: (B, D, M) where D=d_model=N neurons in CTM, M=history/memory length
+        out = self.dropout(x)
+        # LayerNorm across the memory_length dimension (dim=-1)
+        out = self.layernorm(out) # Shape remains (B, N, M)
+        # Apply N independent linear models using einsum
+        # einsum('BDM,MHD->BDH', ...)
+        # x: (B=batch size, D=N neurons, one NLM per each of these, M=history/memory length)
+        # w1: (M, H=hidden dims if using MLP, otherwise output, D=N neurons, parallel)
+        # b1: (1, D=N neurons, H)
+        # einsum result: (B, D, H)
+        # Applying bias requires matching shapes, b1 is broadcasted.
+        out = torch.einsum('BDM,MHD->BDH', out, self.w1) + self.b1
+        # Squeeze the output dimension (assumed to be 1 usually) and scale by T
+        # This matches the original code's structure exactly.
+        out = out.squeeze(-1) / self.T
+        return out
+# --- Backbone Modules ---
+class ParityBackbone(nn.Module):
+    def __init__(self, n_embeddings, d_embedding):
+        super(ParityBackbone, self).__init__()
+        self.embedding = nn.Embedding(n_embeddings, d_embedding)
+    def forward(self, x):
+        """
+        Maps -1 (negative parity) to 0 and 1 (positive) to 1
+        """
+        x = (x == 1).long()
+        return self.embedding(x.long()).transpose(1, 2) # Transpose for compatibility with other backbones
+class QAMNISTOperatorEmbeddings(nn.Module):
+    def __init__(self, num_operator_types, d_projection):
+        super(QAMNISTOperatorEmbeddings, self).__init__()
+        self.embedding = nn.Embedding(num_operator_types, d_projection)
+    def forward(self, x):
+        # -1 for plus and -2 for minus
+        return self.embedding(-x - 1)
+class QAMNISTIndexEmbeddings(torch.nn.Module):
+    def __init__(self, max_seq_length, embedding_dim):
+        super().__init__()
+        self.max_seq_length = max_seq_length
+        self.embedding_dim = embedding_dim
+        embedding = torch.zeros(max_seq_length, embedding_dim)
+        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, embedding_dim, 2).float() * (-math.log(10000.0) / embedding_dim))
+        embedding[:, 0::2] = torch.sin(position * div_term)
+        embedding[:, 1::2] = torch.cos(position * div_term)
+        self.register_buffer('embedding', embedding)
+    def forward(self, x):
+        return self.embedding[x]
+class ThoughtSteps:
+    """
+    Helper class for managing "thought steps" in the ctm_qamnist pipeline.
+    Args:
+        iterations_per_digit (int): Number of iterations for each digit.
+        iterations_per_question_part (int): Number of iterations for each question part.
+        total_iterations_for_answering (int): Total number of iterations for answering.
+        total_iterations_for_digits (int): Total number of iterations for digits.
+        total_iterations_for_question (int): Total number of iterations for question.
+    """
+    def __init__(self, iterations_per_digit, iterations_per_question_part, total_iterations_for_answering, total_iterations_for_digits, total_iterations_for_question):
+        self.iterations_per_digit = iterations_per_digit
+        self.iterations_per_question_part = iterations_per_question_part
+        self.total_iterations_for_digits = total_iterations_for_digits
+        self.total_iterations_for_question = total_iterations_for_question
+        self.total_iterations_for_answering = total_iterations_for_answering
+        self.total_iterations = self.total_iterations_for_digits + self.total_iterations_for_question + self.total_iterations_for_answering
+    def determine_step_type(self, stepi: int):
+        is_digit_step = stepi < self.total_iterations_for_digits
+        is_question_step = self.total_iterations_for_digits <= stepi < self.total_iterations_for_digits + self.total_iterations_for_question
+        is_answer_step = stepi >= self.total_iterations_for_digits + self.total_iterations_for_question
+        return is_digit_step, is_question_step, is_answer_step
+    def determine_answer_step_type(self, stepi: int):
+        step_within_questions = stepi - self.total_iterations_for_digits
+        if step_within_questions % (2 * self.iterations_per_question_part) < self.iterations_per_question_part:
+            is_index_step = True
+            is_operator_step = False
+        else:
+            is_index_step = False
+            is_operator_step = True
+        return is_index_step, is_operator_step
+class MNISTBackbone(nn.Module):
+    """
+    Simple backbone for MNIST feature extraction.
+    """
+    def __init__(self, d_input):
+        super(MNISTBackbone, self).__init__()
+        self.layers = nn.Sequential(
+            nn.LazyConv2d(d_input, kernel_size=3, stride=1, padding=1),
+            nn.BatchNorm2d(d_input),
+            nn.ReLU(),
+            nn.MaxPool2d(2, 2),
+            nn.LazyConv2d(d_input, kernel_size=3, stride=1, padding=1),
+            nn.BatchNorm2d(d_input),
+            nn.ReLU(),
+            nn.MaxPool2d(2, 2),
+        )
+    def forward(self, x):
+        return self.layers(x)
+class MiniGridBackbone(nn.Module):
+    def __init__(self, d_input, grid_size=7, num_objects=11, num_colors=6, num_states=3, embedding_dim=8):
+        super().__init__()
+        self.object_embedding = nn.Embedding(num_objects, embedding_dim)
+        self.color_embedding = nn.Embedding(num_colors, embedding_dim)
+        self.state_embedding = nn.Embedding(num_states, embedding_dim)
+        self.position_embedding = nn.Embedding(grid_size * grid_size, embedding_dim)
+        self.project_to_d_projection = nn.Sequential(
+            nn.Linear(embedding_dim * 4, d_input * 2),
+            nn.GLU(),
+            nn.LayerNorm(d_input),
+            nn.Linear(d_input, d_input * 2),
+            nn.GLU(),
+            nn.LayerNorm(d_input)
+        )
+    def forward(self, x):
+        x = x.long()
+        B, H, W, C = x.size()
+        object_idx = x[:,:,:, 0]
+        color_idx =  x[:,:,:, 1]
+        state_idx =  x[:,:,:, 2]
+        obj_embed = self.object_embedding(object_idx)
+        color_embed = self.color_embedding(color_idx)
+        state_embed = self.state_embedding(state_idx)
+        pos_idx = torch.arange(H * W, device=x.device).view(1, H, W).expand(B, -1, -1)
+        pos_embed = self.position_embedding(pos_idx)
+        out = self.project_to_d_projection(torch.cat([obj_embed, color_embed, state_embed, pos_embed], dim=-1))
+        return out
+class ClassicControlBackbone(nn.Module):
+    def __init__(self, d_input):
+        super().__init__()
+        self.input_projector = nn.Sequential(
+            nn.Flatten(),
+            nn.LazyLinear(d_input * 2),
+            nn.GLU(),
+            nn.LayerNorm(d_input),
+            nn.LazyLinear(d_input * 2),
+            nn.GLU(),
+            nn.LayerNorm(d_input)
+        )
+    def forward(self, x):
+        return self.input_projector(x)
+class ShallowWide(nn.Module):
+    """
+    Simple, wide, shallow convolutional backbone for image feature extraction.
+    Alternative to ResNet, uses grouped convolutions and GLU activations.
+    Fixed structure, useful for specific experiments.
+    """
+    def __init__(self):
+        super(ShallowWide, self).__init__()
+        # LazyConv2d infers input channels
+        self.layers = nn.Sequential(
+            nn.LazyConv2d(4096, kernel_size=3, stride=2, padding=1), # Output channels = 4096
+            nn.GLU(dim=1), # Halves channels to 2048
+            nn.BatchNorm2d(2048),
+            # Grouped convolution maintains width but processes groups independently
+            nn.Conv2d(2048, 4096, kernel_size=3, stride=1, padding=1, groups=32),
+            nn.GLU(dim=1), # Halves channels to 2048
+            nn.BatchNorm2d(2048)
+        )
+    def forward(self, x):
+        return self.layers(x)
+class PretrainedResNetWrapper(nn.Module):
+    """
+    Wrapper to use standard pre-trained ResNet models from torchvision.
+    Loads a specified ResNet architecture pre-trained on ImageNet, removes the
+    final classification layer (fc), average pooling, and optionally later layers
+    (e.g., layer4), allowing it to be used as a feature extractor backbone.
+    Args:
+        resnet_type (str): Name of the ResNet model (e.g., 'resnet18', 'resnet50').
+        fine_tune (bool): If False, freezes the weights of the pre-trained backbone.
+    """
+    def __init__(self, resnet_type, fine_tune=True):
+        super(PretrainedResNetWrapper, self).__init__()
+        self.resnet_type = resnet_type
+        self.backbone = torch.hub.load('pytorch/vision:v0.10.0', resnet_type, pretrained=True)
+        if not fine_tune:
+            for param in self.backbone.parameters():
+                param.requires_grad = False
+        # Remove final layers to use as feature extractor
+        self.backbone.avgpool = Identity()
+        self.backbone.fc = Identity()
+        # Keep layer4 by default, user can modify instance if needed
+        # self.backbone.layer4 = Identity()
+    def forward(self, x):
+        # Get features from the modified ResNet
+        out = self.backbone(x)
+        # Reshape output to (B, C, H, W) - This is heuristic based on original comment.
+        # User might need to adjust this based on which layers are kept/removed.
+        # Infer C based on ResNet type (example values)
+        nc = 256 if ('18' in self.resnet_type or '34' in self.resnet_type) else 512 if '50' in self.resnet_type else 1024 if '101' in self.resnet_type else 2048 # Approx for layer3/4 output channel numbers
+        # Infer H, W assuming output is flattened C * H * W
+        num_features = out.shape[-1]
+        # This calculation assumes nc is correct and feature map is square
+        wh_squared = num_features / nc
+        if wh_squared < 0 or not float(wh_squared).is_integer():
+             print(f"Warning: Cannot reliably reshape PretrainedResNetWrapper output. nc={nc}, num_features={num_features}")
+             # Return potentially flattened features if reshape fails
+             return out
+        wh = int(np.sqrt(wh_squared))
+        return out.reshape(x.size(0), nc, wh, wh)
+# --- Positional Encoding Modules ---
+class LearnableFourierPositionalEncoding(nn.Module):
+    """
+    Learnable Fourier Feature Positional Encoding.
+    Implements Algorithm 1 from "Learnable Fourier Features for Multi-Dimensional
+    Spatial Positional Encoding" (https://arxiv.org/pdf/2106.02795.pdf).
+    Provides positional information for 2D feature maps.
+    Args:
+        d_model (int): The output dimension of the positional encoding (D).
+        G (int): Positional groups (default 1).
+        M (int): Dimensionality of input coordinates (default 2 for H, W).
+        F_dim (int): Dimension of the Fourier features.
+        H_dim (int): Hidden dimension of the MLP.
+        gamma (float): Initialization scale for the Fourier projection weights (Wr).
+    """
+    def __init__(self, d_model,
+                 G=1, M=2,
+                 F_dim=256,
+                 H_dim=128,
+                 gamma=1/2.5,
+                 ):
+        super().__init__()
+        self.G = G
+        self.M = M
+        self.F_dim = F_dim
+        self.H_dim = H_dim
+        self.D = d_model
+        self.gamma = gamma
+        self.Wr = nn.Linear(self.M, self.F_dim // 2, bias=False)
+        self.mlp = nn.Sequential(
+            nn.Linear(self.F_dim, self.H_dim, bias=True),
+            nn.GLU(), # Halves H_dim
+            nn.Linear(self.H_dim // 2, self.D // self.G),
+            nn.LayerNorm(self.D // self.G)
+        )
+        self.init_weights()
+    def init_weights(self):
+        nn.init.normal_(self.Wr.weight.data, mean=0, std=self.gamma ** -2)
+    def forward(self, x):
+        """
+        Computes positional encodings for the input feature map x.
+        Args:
+            x (torch.Tensor): Input feature map, shape (B, C, H, W).
+        Returns:
+            torch.Tensor: Positional encoding tensor, shape (B, D, H, W).
+        """
+        B, C, H, W = x.shape
+        # Creates coordinates based on (H, W) and repeats for batch B.
+        # Takes x[:,0] assuming channel dim isn't needed for coords.
+        x_coord = add_coord_dim(x[:,0]) # Expects (B, H, W) -> (B, H, W, 2)
+        # Compute Fourier features
+        projected = self.Wr(x_coord) # (B, H, W, F_dim // 2)
+        cosines = torch.cos(projected)
+        sines = torch.sin(projected)
+        F = (1.0 / math.sqrt(self.F_dim)) * torch.cat([cosines, sines], dim=-1) # (B, H, W, F_dim)
+        # Project features through MLP
+        Y = self.mlp(F) # (B, H, W, D // G)
+        # Reshape to (B, D, H, W)
+        PEx = Y.permute(0, 3, 1, 2) # Assuming G=1
+        return PEx
+class MultiLearnableFourierPositionalEncoding(nn.Module):
+    """
+    Combines multiple LearnableFourierPositionalEncoding modules with different
+    initialization scales (gamma) via a learnable weighted sum.
+    Allows the model to learn an optimal combination of positional frequencies.
+    Args:
+        d_model (int): Output dimension of the encoding.
+        G, M, F_dim, H_dim: Parameters passed to underlying LearnableFourierPositionalEncoding.
+        gamma_range (list[float]): Min and max gamma values for the linspace.
+        N (int): Number of parallel embedding modules to create.
+    """
+    def __init__(self, d_model,
+                 G=1, M=2,
+                 F_dim=256,
+                 H_dim=128,
+                 gamma_range=[1.0, 0.1], # Default range
+                 N=10,
+                 ):
+        super().__init__()
+        self.embedders = nn.ModuleList()
+        for gamma in np.linspace(gamma_range[0], gamma_range[1], N):
+            self.embedders.append(LearnableFourierPositionalEncoding(d_model, G, M, F_dim, H_dim, gamma))
+        # Renamed parameter from 'combination' to 'combination_weights' for clarity only in comments
+        # Actual registered name remains 'combination' as in original code
+        self.register_parameter('combination', torch.nn.Parameter(torch.ones(N), requires_grad=True))
+        self.N = N
+    def forward(self, x):
+        """
+        Computes combined positional encoding.
+        Args:
+            x (torch.Tensor): Input feature map, shape (B, C, H, W).
+        Returns:
+            torch.Tensor: Combined positional encoding tensor, shape (B, D, H, W).
+        """
+        # Compute embeddings from all modules and stack: (N, B, D, H, W)
+        pos_embs = torch.stack([emb(x) for emb in self.embedders], dim=0)
+        # Compute combination weights using softmax
+        # Use registered parameter name 'combination'
+        # Reshape weights for broadcasting: (N,) -> (N, 1, 1, 1, 1)
+        weights = F.softmax(self.combination, dim=-1).view(self.N, 1, 1, 1, 1)
+        # Compute weighted sum over the N dimension
+        combined_emb = (pos_embs * weights).sum(0) # (B, D, H, W)
+        return combined_emb
+class CustomRotationalEmbedding(nn.Module):
+    """
+    Custom Rotational Positional Embedding.
+    Generates 2D positional embeddings based on rotating a fixed start vector.
+    The rotation angle for each grid position is determined primarily by its
+    horizontal position (width dimension). The resulting rotated vectors are
+    concatenated and projected.
+    Note: The current implementation derives angles only from the width dimension (`x.size(-1)`).
+    Args:
+        d_model (int): Dimensionality of the output embeddings.
+    """
+    def __init__(self, d_model):
+        super(CustomRotationalEmbedding, self).__init__()
+        # Learnable 2D start vector
+        self.register_parameter('start_vector', nn.Parameter(torch.Tensor([0, 1]), requires_grad=True))
+        # Projects the 4D concatenated rotated vectors to d_model
+        # Input size 4 comes from concatenating two 2D rotated vectors
+        self.projection = nn.Sequential(nn.Linear(4, d_model))
+    def forward(self, x):
+        """
+        Computes rotational positional embeddings based on input width.
+        Args:
+            x (torch.Tensor): Input tensor (used for shape and device),
+                              shape (batch_size, channels, height, width).
+        Returns:
+            Output tensor containing positional embeddings,
+            shape (1, d_model, height, width) - Batch dim is 1 as PE is same for all.
+        """
+        B, C, H, W = x.shape
+        device = x.device
+        # --- Generate rotations based only on Width ---
+        # Angles derived from width dimension
+        theta_rad = torch.deg2rad(torch.linspace(0, 180, W, device=device)) # Angle per column
+        cos_theta = torch.cos(theta_rad)
+        sin_theta = torch.sin(theta_rad)
+        # Create rotation matrices: Shape (W, 2, 2)
+        # Use unsqueeze(1) to allow stacking along dim 1
+        rotation_matrices = torch.stack([
+            torch.stack([cos_theta, -sin_theta], dim=-1), # Shape (W, 2)
+            torch.stack([sin_theta, cos_theta], dim=-1)  # Shape (W, 2)
+        ], dim=1) # Stacks along dim 1 -> Shape (W, 2, 2)
+        # Rotate the start vector by column angle: Shape (W, 2)
+        rotated_vectors = torch.einsum('wij,j->wi', rotation_matrices, self.start_vector)
+        # --- Create Grid Key ---
+        # Original code uses repeats based on rotated_vectors.shape[0] (which is W) for both dimensions.
+        # This creates a (W, W, 4) key tensor.
+        key = torch.cat((
+            torch.repeat_interleave(rotated_vectors.unsqueeze(1), W, dim=1), # (W, 1, 2) -> (W, W, 2)
+            torch.repeat_interleave(rotated_vectors.unsqueeze(0), W, dim=0)  # (1, W, 2) -> (W, W, 2)
+        ), dim=-1) # Shape (W, W, 4)
+        # Project the 4D key vector to d_model: Shape (W, W, d_model)
+        pe_grid = self.projection(key)
+        # Reshape to (1, d_model, W, W) and then select/resize to target H, W?
+        # Original code permutes to (d_model, W, W) and unsqueezes to (1, d_model, W, W)
+        pe = pe_grid.permute(2, 0, 1).unsqueeze(0)
+        # If H != W, this needs adjustment. Assuming H=W or cropping/padding happens later.
+        # Let's return the (1, d_model, W, W) tensor as generated by the original logic.
+        # If H != W, downstream code must handle the mismatch or this PE needs modification.
+        if H != W:
+            # Simple interpolation/cropping could be added, but sticking to original logic:
+            # Option 1: Interpolate
+            # pe = F.interpolate(pe, size=(H, W), mode='bilinear', align_corners=False)
+            # Option 2: Crop/Pad (e.g., crop if W > W_target, pad if W < W_target)
+            # Sticking to original: return shape (1, d_model, W, W)
+            pass
+        return pe
+class CustomRotationalEmbedding1D(nn.Module):
+    def __init__(self, d_model):
+        super(CustomRotationalEmbedding1D, self).__init__()
+        self.projection = nn.Linear(2, d_model)
+    def forward(self, x):
+        start_vector = torch.tensor([0., 1.], device=x.device, dtype=torch.float)
+        theta_rad = torch.deg2rad(torch.linspace(0, 180, x.size(2), device=x.device))
+        cos_theta = torch.cos(theta_rad)
+        sin_theta = torch.sin(theta_rad)
+        cos_theta = cos_theta.unsqueeze(1)  # Shape: (height, 1)
+        sin_theta = sin_theta.unsqueeze(1)  # Shape: (height, 1)
+        # Create rotation matrices
+        rotation_matrices = torch.stack([
+        torch.cat([cos_theta, -sin_theta], dim=1),
+        torch.cat([sin_theta, cos_theta], dim=1)
+        ], dim=1)  # Shape: (height, 2, 2)
+        # Rotate the start vector
+        rotated_vectors = torch.einsum('bij,j->bi', rotation_matrices, start_vector)
+        pe = self.projection(rotated_vectors)
+        pe = torch.repeat_interleave(pe.unsqueeze(0), x.size(0), 0)
+        return pe.transpose(1, 2) # Transpose for compatibility with other backbones

models/resnet.py ADDED Viewed

	@@ -0,0 +1,374 @@

+import torch
+import torch.nn as nn
+import os
+from models.modules import Identity
+__all__ = [
+    "ResNet",
+    "resnet18",
+    "resnet34",
+    "resnet50",
+    "resnet101",
+    "resnet152",
+]
+def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
+    """3x3 convolution with padding"""
+    return nn.Conv2d(
+        in_planes,
+        out_planes,
+        kernel_size=3,
+        stride=stride,
+        padding=dilation,
+        groups=groups,
+        bias=False,
+        dilation=dilation,
+    )
+def conv1x1(in_planes, out_planes, stride=1):
+    """1x1 convolution"""
+    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
+class BasicBlock(nn.Module):
+    expansion = 1
+    def __init__(
+        self,
+        inplanes,
+        planes,
+        stride=1,
+        downsample=None,
+        groups=1,
+        base_width=64,
+        dilation=1,
+        norm_layer=None,
+    ):
+        super(BasicBlock, self).__init__()
+        if norm_layer is None:
+            norm_layer = nn.BatchNorm2d
+        if groups != 1 or base_width != 64:
+            raise ValueError("BasicBlock only supports groups=1 and base_width=64")
+        if dilation > 1:
+            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
+        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
+        self.conv1 = conv3x3(inplanes, planes, stride)
+        self.bn1 = norm_layer(planes)
+        self.relu = nn.ReLU(inplace=True)
+        self.conv2 = conv3x3(planes, planes)
+        self.bn2 = norm_layer(planes)
+        self.downsample = downsample
+        self.stride = stride
+    def forward(self, x):
+        identity = x
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out += identity
+        out = self.relu(out)
+        return out
+class Bottleneck(nn.Module):
+    expansion = 4
+    def __init__(
+        self,
+        inplanes,
+        planes,
+        stride=1,
+        downsample=None,
+        groups=1,
+        base_width=64,
+        dilation=1,
+        norm_layer=None,
+    ):
+        super(Bottleneck, self).__init__()
+        if norm_layer is None:
+            norm_layer = nn.BatchNorm2d
+        width = int(planes * (base_width / 64.0)) * groups
+        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
+        self.conv1 = conv1x1(inplanes, width)
+        self.bn1 = norm_layer(width)
+        self.conv2 = conv3x3(width, width, stride, groups, dilation)
+        self.bn2 = norm_layer(width)
+        self.conv3 = conv1x1(width, planes * self.expansion)
+        self.bn3 = norm_layer(planes * self.expansion)
+        self.relu = nn.ReLU(inplace=True)
+        self.downsample = downsample
+        self.stride = stride
+    def forward(self, x):
+        identity = x
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        out = self.relu(out)
+        out = self.conv3(out)
+        out = self.bn3(out)
+        if self.downsample is not None:
+            identity = self.downsample(x)
+        out += identity
+        # activation = None
+        # activation = out.detach().cpu().numpy()
+        out = self.relu(out)
+        # return out, activation
+        return out
+class ResNet(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        feature_scales,
+        stride,
+        block,
+        layers,
+        num_classes=10,
+        zero_init_residual=False,
+        groups=1,
+        width_per_group=64,
+        replace_stride_with_dilation=None,
+        norm_layer=None,
+        do_initial_max_pool=True,
+    ):
+        super(ResNet, self).__init__()
+        if norm_layer is None:
+            norm_layer = nn.BatchNorm2d
+        self._norm_layer = norm_layer
+        self.inplanes = 64
+        self.dilation = 1
+        if replace_stride_with_dilation is None:
+            # each element in the tuple indicates if we should replace
+            # the 2x2 stride with a dilated convolution instead
+            replace_stride_with_dilation = [False, False, False]
+        if len(replace_stride_with_dilation) != 3:
+            raise ValueError(
+                "replace_stride_with_dilation should be None "
+                "or a 3-element tuple, got {}".format(replace_stride_with_dilation)
+            )
+        self.groups = groups
+        self.base_width = width_per_group
+        # NOTE: Important!
+        # This has changed from a kernel size of 7 (padding=3) to a kernel of 3 (padding=1)
+        # The reason for this was to limit the receptive field to constrain models to
+        # "Looking around" to gather information.
+        self.conv1 = nn.Conv2d(
+            in_channels, self.inplanes, kernel_size=3, stride=1, padding=1, bias=False
+        ) if in_channels in [1, 3] else nn.LazyConv2d(
+            self.inplanes, kernel_size=3, stride=1, padding=1, bias=False
+        )
+        # END
+        self.bn1 = norm_layer(self.inplanes)
+        self.relu = nn.ReLU(inplace=True)
+        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) if do_initial_max_pool else Identity()
+        self.layer1 = self._make_layer(block, 64, layers[0])
+        self.feature_scales = feature_scales
+        if 2 in feature_scales:
+            self.layer2 = self._make_layer(
+                block, 128, layers[1], stride=stride, dilate=replace_stride_with_dilation[0]
+            )
+            if 3 in feature_scales:
+                self.layer3 = self._make_layer(
+                    block, 256, layers[2], stride=stride, dilate=replace_stride_with_dilation[1]
+                )
+                if 4 in feature_scales:
+                    self.layer4 = self._make_layer(
+                        block, 512, layers[3], stride=stride, dilate=replace_stride_with_dilation[2]
+                    )
+        # NOTE: Commented this out as it is not used anymore for this work, kept it for reference
+        # self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
+        # self.fc = nn.Linear(512 * block.expansion, num_classes)
+        # for m in self.modules():
+        #     if isinstance(m, nn.Conv2d):
+        #         nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
+        #     elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
+        #         nn.init.constant_(m.weight, 1)
+        #         nn.init.constant_(m.bias, 0)
+        # Zero-initialize the last BN in each residual branch,
+        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
+        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
+        if zero_init_residual:
+            for m in self.modules():
+                if isinstance(m, Bottleneck):
+                    nn.init.constant_(m.bn3.weight, 0)
+                elif isinstance(m, BasicBlock):
+                    nn.init.constant_(m.bn2.weight, 0)
+    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
+        norm_layer = self._norm_layer
+        downsample = None
+        previous_dilation = self.dilation
+        if dilate:
+            self.dilation *= stride
+            stride = 1
+        if stride != 1 or self.inplanes != planes * block.expansion:
+            downsample = nn.Sequential(
+                conv1x1(self.inplanes, planes * block.expansion, stride),
+                norm_layer(planes * block.expansion),
+            )
+        layers = []
+        layers.append(
+            block(
+                self.inplanes,
+                planes,
+                stride,
+                downsample,
+                self.groups,
+                self.base_width,
+                previous_dilation,
+                norm_layer,
+            )
+        )
+        self.inplanes = planes * block.expansion
+        for _ in range(1, blocks):
+            layers.append(
+                block(
+                    self.inplanes,
+                    planes,
+                    groups=self.groups,
+                    base_width=self.base_width,
+                    dilation=self.dilation,
+                    norm_layer=norm_layer,
+                )
+            )
+        return nn.Sequential(*layers)
+    def forward(self, x):
+        activations = []
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = self.relu(x)
+        x = self.maxpool(x)
+        # if return_activations: activations.append(torch.clone(x))
+        x = self.layer1(x)
+        if 2 in self.feature_scales:
+            x = self.layer2(x)
+            if 3 in self.feature_scales:
+                x = self.layer3(x)
+                if 4 in self.feature_scales:
+                    x = self.layer4(x)
+        return x
+def _resnet(in_channels, feature_scales, stride, arch, block, layers, pretrained, progress, device, do_initial_max_pool, **kwargs):
+    model = ResNet(in_channels, feature_scales, stride, block, layers, do_initial_max_pool=do_initial_max_pool, **kwargs)
+    if pretrained:
+        assert in_channels==3
+        script_dir = os.path.dirname(__file__)
+        state_dict = torch.load(
+            script_dir + '/state_dicts/' + arch + ".pt", map_location=device
+        )
+        model.load_state_dict(state_dict, strict=False)
+    return model
+def resnet18(in_channels, feature_scales, stride=2, pretrained=False, progress=True, device="cpu", do_initial_max_pool=True, **kwargs):
+    """Constructs a ResNet-18 model.
+    Args:
+        pretrained (bool): If True, returns a model pre-trained on ImageNet
+        progress (bool): If True, displays a progress bar of the download to stderr
+    """
+    return _resnet(in_channels,
+        feature_scales, stride, "resnet18", BasicBlock, [2, 2, 2, 2], pretrained, progress, device, do_initial_max_pool, **kwargs
+    )
+def resnet34(in_channels, feature_scales, stride=2, pretrained=False, progress=True, device="cpu", do_initial_max_pool=True, **kwargs):
+    """Constructs a ResNet-34 model.
+    Args:
+        pretrained (bool): If True, returns a model pre-trained on ImageNet
+        progress (bool): If True, displays a progress bar of the download to stderr
+    """
+    return _resnet(in_channels,
+        feature_scales, stride, "resnet34", BasicBlock, [3, 4, 6, 3], pretrained, progress, device, do_initial_max_pool, **kwargs
+    )
+def resnet50(in_channels, feature_scales, stride=2, pretrained=False, progress=True, device="cpu", do_initial_max_pool=True, **kwargs):
+    """Constructs a ResNet-50 model.
+    Args:
+        pretrained (bool): If True, returns a model pre-trained on ImageNet
+        progress (bool): If True, displays a progress bar of the download to stderr
+    """
+    return _resnet(in_channels,
+        feature_scales, stride, "resnet50", Bottleneck, [3, 4, 6, 3], pretrained, progress, device, do_initial_max_pool, **kwargs
+    )
+def resnet101(in_channels, feature_scales, stride=2, pretrained=False, progress=True, device="cpu", do_initial_max_pool=True, **kwargs):
+    """Constructs a ResNet-50 model.
+    Args:
+        pretrained (bool): If True, returns a model pre-trained on ImageNet
+        progress (bool): If True, displays a progress bar of the download to stderr
+    """
+    return _resnet(in_channels,
+        feature_scales, stride, "resnet101", Bottleneck, [3, 4, 23, 3], pretrained, progress, device, do_initial_max_pool, **kwargs
+    )
+def resnet152(in_channels, feature_scales, stride=2, pretrained=False, progress=True, device="cpu", do_initial_max_pool=True, **kwargs):
+    """Constructs a ResNet-50 model.
+    Args:
+        pretrained (bool): If True, returns a model pre-trained on ImageNet
+        progress (bool): If True, displays a progress bar of the download to stderr
+    """
+    return _resnet(in_channels,
+        feature_scales, stride, "resnet152", Bottleneck, [3, 4, 36, 3], pretrained, progress, device, do_initial_max_pool, **kwargs
+    )
+def prepare_resnet_backbone(backbone_type):
+    resnet_family = resnet18 # Default
+    if '34' in backbone_type: resnet_family = resnet34
+    if '50' in backbone_type: resnet_family = resnet50
+    if '101' in backbone_type: resnet_family = resnet101
+    if '152' in backbone_type: resnet_family = resnet152
+    # Determine which ResNet blocks to keep
+    block_num_str = backbone_type.split('-')[-1]
+    hyper_blocks_to_keep = list(range(1, int(block_num_str) + 1)) if block_num_str.isdigit() else [1, 2, 3, 4]
+    backbone = resnet_family(
+        3,
+        hyper_blocks_to_keep,
+        stride=2,
+        pretrained=False,
+        progress=True,
+        device="cpu",
+        do_initial_max_pool=True,
+    )
+    return backbone

models/utils.py ADDED Viewed

	@@ -0,0 +1,122 @@

+import torch
+import torch.nn.functional as F
+import re
+import os
+def compute_decay(T, params, clamp_lims=(0, 15)):
+    """
+    This function computes exponential decays for learnable synchronisation
+    interactions between pairs of neurons.
+    """
+    assert len(clamp_lims), 'Clamp lims should be length 2'
+    assert type(clamp_lims) == tuple, 'Clamp lims should be tuple'
+    indices = torch.arange(T-1, -1, -1, device=params.device).reshape(T, 1).expand(T, params.shape[0])
+    out = torch.exp(-indices * torch.clamp(params, clamp_lims[0], clamp_lims[1]).unsqueeze(0))
+    return out
+def add_coord_dim(x, scaled=True):
+    """
+    Adds a final dimension to the tensor representing 2D coordinates.
+    Args:
+        tensor: A PyTorch tensor of shape (B, D, H, W).
+    Returns:
+        A PyTorch tensor of shape (B, D, H, W, 2) with the last dimension
+        representing the 2D coordinates within the HW dimensions.
+    """
+    B, H, W = x.shape
+    # Create coordinate grids
+    x_coords = torch.arange(W, device=x.device, dtype=x.dtype).repeat(H, 1)  # Shape (H, W)
+    y_coords = torch.arange(H, device=x.device, dtype=x.dtype).unsqueeze(-1).repeat(1, W)  # Shape (H, W)
+    if scaled:
+        x_coords /= (W-1)
+        y_coords /= (H-1)
+    # Stack coordinates and expand dimensions
+    coords = torch.stack((x_coords, y_coords), dim=-1)  # Shape (H, W, 2)
+    coords = coords.unsqueeze(0)  # Shape (1, 1, H, W, 2)
+    coords = coords.repeat(B, 1, 1, 1)  # Shape (B, D, H, W, 2)
+    return coords
+def compute_normalized_entropy(logits, reduction='mean'):
+    """
+    Calculates the normalized entropy of a PyTorch tensor of logits along the
+    final dimension.
+    Args:
+      logits: A PyTorch tensor of logits.
+    Returns:
+      A PyTorch tensor containing the normalized entropy values.
+    """
+    # Apply softmax to get probabilities
+    preds = F.softmax(logits, dim=-1)
+    # Calculate the log probabilities
+    log_preds = torch.log_softmax(logits, dim=-1)
+    # Calculate the entropy
+    entropy = -torch.sum(preds * log_preds, dim=-1)
+    # Calculate the maximum possible entropy
+    num_classes = preds.shape[-1]
+    max_entropy = torch.log(torch.tensor(num_classes, dtype=torch.float32))
+    # Normalize the entropy
+    normalized_entropy = entropy / max_entropy
+    if len(logits.shape)>2 and reduction == 'mean':
+        normalized_entropy = normalized_entropy.flatten(1).mean(-1)
+    return normalized_entropy
+def reshape_predictions(predictions, prediction_reshaper):
+    B, T = predictions.size(0), predictions.size(-1)
+    new_shape = [B] + prediction_reshaper + [T]
+    rehaped_predictions = predictions.reshape(new_shape)
+    return rehaped_predictions
+def get_all_log_dirs(root_dir):
+    folders = []
+    for dirpath, dirnames, filenames in os.walk(root_dir):
+        if any(f.endswith(".pt") for f in filenames):
+            folders.append(dirpath)
+    return folders
+def get_latest_checkpoint(log_dir):
+    files = [f for f in os.listdir(log_dir) if re.match(r'checkpoint_\d+\.pt', f)]
+    return os.path.join(log_dir, max(files, key=lambda f: int(re.search(r'\d+', f).group()))) if files else None
+def get_latest_checkpoint_file(filepath, limit=300000):
+    checkpoint_files = get_checkpoint_files(filepath)
+    checkpoint_files = [
+        f for f in checkpoint_files if int(re.search(r'checkpoint_(\d+)\.pt', f).group(1)) <= limit
+    ]
+    if not checkpoint_files:
+        return None
+    return checkpoint_files[-1]
+def get_checkpoint_files(filepath):
+    regex = r'checkpoint_(\d+)\.pt'
+    files = [f for f in os.listdir(filepath) if re.match(regex, f)]
+    files = sorted(files, key=lambda f: int(re.search(regex, f).group(1)))
+    return [os.path.join(filepath, f) for f in files]
+def load_checkpoint(checkpoint_path, device):
+    checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)
+    return checkpoint
+def get_model_args_from_checkpoint(checkpoint):
+    if "args" in checkpoint:
+        return(checkpoint["args"])
+    else:
+        raise ValueError("Checkpoint does not contain saved args.")
+def get_accuracy_and_loss_from_checkpoint(checkpoint, device="cpu"):
+    training_iteration = checkpoint.get('training_iteration', 0)
+    train_losses = checkpoint.get('train_losses', [])
+    test_losses = checkpoint.get('test_losses', [])
+    train_accuracies = checkpoint.get('train_accuracies_most_certain', [])
+    test_accuracies = checkpoint.get('test_accuracies_most_certain', [])
+    return training_iteration, train_losses, test_losses, train_accuracies, test_accuracies

requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+numpy
+torch
+torchvision
+matplotlib
+seaborn
+tdqm
+opencv-python
+imageio
+scikit-learn
+umap-learn
+python-dotenv
+gymnasium
+minigrid
+datasets
+autoclip

tasks/image_classification/README.md ADDED Viewed

	@@ -0,0 +1,29 @@

+# Image classification
+This folder contains code for training and analysing imagenet and cifar related experiments.
+## Accessing and loading imagenet
+We use the [ILSRC/imagenet-1k](https://huggingface.co/datasets/ILSVRC/imagenet-1k) dataset in our paper.
+To get this to work for you, you will need to do the following:
+1. Login to huggingface (make an account) to agree to TCs of this dataset,
+2. Make a new access token.
+3. Install huggingface_hub on the target machine with ```pip install huggingface_hub```
+4. Run ```huggingface-cli login``` and use your token. This will authenticate you on the backend and allow the code to run.
+5. Simply run an imagenet experiment. It will auto download and do all that magic.
+## Training
+There are two training files: `train.py` and `train_distributed.py`. The training code uses mixed precision. For the settings in the paper, the following command was used for distributed training:
+```
+torchrun --standalone --nnodes=1 --nproc_per_node=8 -m tasks.image_classification.train_distributed --d_model 4096 --d_input 1024 --synapse_depth 12 --heads 16 --n_synch_out 150 --n_synch_action 150 --neuron_select_type random --iterations 75 --memory_length 25 --deep_memory --memory_hidden_dims 64 --dropout 0.05 --no-do_normalisation --positional_embedding_type none --backbone_type resnet152-4 --batch_size 60 --batch_size_test 64 --lr 5e-4 --training_iterations 500001 --warmup_steps 10000 --use_scheduler --scheduler_type cosine --weight_decay 0.0 --log_dir logs-lambda/imagenet-distributed-4april/d=4096--i=1024--h=16--ns=150-random--iters=75x25--h=64--drop=0.05--pos=none--back=152x4--seed=42 --dataset imagenet --save_every 2000 --track_every 5000 --seed 42 --n_test_batches 50 --use_amp
+```
+You can run the same setup on a single GPU with:
+```
+python -m tasks.image_classification.train tasks.image_classification.train --d_model 4096 --d_input 1024 --synapse_depth 12 --heads 16 --n_synch_out 150 --n_synch_action 150 --neuron_select_type random --iterations 75 --memory_length 25 --deep_memory --memory_hidden_dims 64 --dropout 0.05 --no-do_normalisation --positional_embedding_type none --backbone_type resnet152-4 --batch_size 60 --batch_size_test 64 --lr 5e-4 --training_iterations 500001 --warmup_steps 10000 --use_scheduler --scheduler_type cosine --weight_decay 0.0 --log_dir logs-lambda/imagenet-distributed-4april/d=4096--i=1024--h=16--ns=150-random--iters=75x25--h=64--drop=0.05--pos=none--back=152x4--seed=42 --dataset imagenet --save_every 2000 --track_every 5000 --seed 42 --n_test_batches 50 --use_amp --device 0
+```

tasks/image_classification/analysis/README.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# Analysis
+This folder contains analysis code for image classifcation experiments. To build GIFs for imagenet run (from the base directory):
+```
+python -m tasks.image_classification.analysis.build_imagenet_viz
+```
+To build the plots in the paper run:
+```
+python -m tasks.image_classification.analysis.imagenet_evaluate_and_plot
+```

tasks/image_classification/analysis/run_imagenet_analysis.py ADDED Viewed

	@@ -0,0 +1,972 @@

+# --- Core Libraries ---
+import torch
+import numpy as np
+import os
+import argparse
+from tqdm.auto import tqdm
+import torch.nn.functional as F # Used for interpolate
+# --- Plotting & Visualization ---
+import matplotlib.pyplot as plt
+import matplotlib as mpl
+mpl.use('Agg')
+import seaborn as sns
+sns.set_style('darkgrid')
+from matplotlib import patheffects
+import seaborn as sns
+import imageio
+import cv2
+from scipy.special import softmax
+from tasks.image_classification.plotting import save_frames_to_mp4
+# --- Data Handling & Model ---
+from torchvision import transforms
+from torchvision import datasets # Only used for CIFAR100 in debug mode
+from scipy import ndimage # Used in find_island_centers
+from data.custom_datasets import ImageNet
+from models.ctm import ContinuousThoughtMachine
+from tasks.image_classification.imagenet_classes import IMAGENET2012_CLASSES
+from tasks.image_classification.plotting import plot_neural_dynamics
+# --- Global Settings ---
+np.seterr(divide='ignore')
+mpl.use('Agg')
+sns.set_style('darkgrid')
+# --- Helper Functions ---
+def find_island_centers(array_2d, threshold):
+    """
+    Finds the center of mass of each island (connected component > threshold)
+    in a 2D array, weighted by the array's values.
+    Returns list of (y, x) centers and list of areas.
+    """
+    binary_image = array_2d > threshold
+    labeled_image, num_labels = ndimage.label(binary_image)
+    centers = []
+    areas = []
+    # Calculate center of mass for each labeled island (label 0 is background)
+    for i in range(1, num_labels + 1):
+        island_mask = (labeled_image == i)
+        total_mass = np.sum(array_2d[island_mask])
+        if total_mass > 0:
+            # Get coordinates for this island
+            y_coords, x_coords = np.mgrid[:array_2d.shape[0], :array_2d.shape[1]]
+            # Calculate weighted average for center
+            x_center = np.average(x_coords[island_mask], weights=array_2d[island_mask])
+            y_center = np.average(y_coords[island_mask], weights=array_2d[island_mask])
+            centers.append((round(y_center, 4), round(x_center, 4)))
+            areas.append(np.sum(island_mask)) # Area is the count of pixels in the island
+    return centers, areas
+def parse_args():
+    """Parses command-line arguments."""
+    # Note: Original had two ArgumentParser instances, using the second one.
+    parser = argparse.ArgumentParser(description="Visualize Continuous Thought Machine Attention")
+    parser.add_argument('--actions', type=str, nargs='+', default=['videos'], choices=['plots', 'videos', 'demo'], help="Actions to take. Plots=results plots; videos=gifs/mp4s to watch attention; demo: last frame of internal ticks")
+    parser.add_argument('--device', type=int, nargs='+', default=[-1], help="GPU device index or -1 for CPU")
+    parser.add_argument('--checkpoint', type=str, default='checkpoints/imagenet/ctm_clean.pt', help="Path to ATM checkpoint")
+    parser.add_argument('--output_dir', type=str, default='tasks/image_classification/analysis/outputs/imagenet_viz', help="Directory for visualization outputs")
+    parser.add_argument('--debug', action=argparse.BooleanOptionalAction, default=True, help='Debug mode: use CIFAR100 instead of ImageNet for debugging.')
+    parser.add_argument('--plot_every', type=int, default=10, help="How often to plot.")
+    parser.add_argument('--inference_iterations', type=int, default=50, help="Iterations to use during inference.")
+    parser.add_argument('--data_indices', type=int, nargs='+', default=[], help="Use specific indices in validation data for demos, otherwise random.")
+    parser.add_argument('--N_to_viz', type=int, default=5, help="When not supplying data_indices.")
+    return parser.parse_args()
+# --- Main Execution Block ---
+if __name__=='__main__':
+    # --- Setup ---
+    args = parse_args()
+    if args.device[0] != -1 and torch.cuda.is_available():
+        device = f'cuda:{args.device[0]}'
+    else:
+        device = 'cpu'
+    print(f"Using device: {device}")
+    # --- Load Checkpoint & Model ---
+    print(f"Loading checkpoint: {args.checkpoint}")
+    checkpoint = torch.load(args.checkpoint, map_location=device, weights_only=False) # removed weights_only=False
+    model_args = checkpoint['args']
+    # Handle legacy arguments from checkpoint if necessary
+    if not hasattr(model_args, 'backbone_type') and hasattr(model_args, 'resnet_type'):
+        model_args.backbone_type = f'{model_args.resnet_type}-{getattr(model_args, "resnet_feature_scales", [4])[-1]}'
+    if not hasattr(model_args, 'neuron_select_type'):
+        model_args.neuron_select_type = 'first-last'
+    # Instantiate Model based on checkpoint args
+    print("Instantiating CTM model...")
+    model = ContinuousThoughtMachine(
+        iterations=model_args.iterations,
+        d_model=model_args.d_model,
+        d_input=model_args.d_input,
+        heads=model_args.heads,
+        n_synch_out=model_args.n_synch_out,
+        n_synch_action=model_args.n_synch_action,
+        synapse_depth=model_args.synapse_depth,
+        memory_length=model_args.memory_length,
+        deep_nlms=model_args.deep_memory,
+        memory_hidden_dims=model_args.memory_hidden_dims,
+        do_layernorm_nlm=model_args.do_normalisation,
+        backbone_type=model_args.backbone_type,
+        positional_embedding_type=model_args.positional_embedding_type,
+        out_dims=model_args.out_dims,
+        prediction_reshaper=[-1], # Kept fixed value from original code
+        dropout=0, # No dropout for eval
+        neuron_select_type=model_args.neuron_select_type,
+        n_random_pairing_self=model_args.n_random_pairing_self,
+    ).to(device)
+    # Load weights into model
+    load_result = model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    print(f" Loaded state_dict. Missing: {load_result.missing_keys}, Unexpected: {load_result.unexpected_keys}")
+    model.eval() # Set model to evaluation mode
+    # --- Prepare Dataset ---
+    if args.debug:
+        print("Debug mode: Using CIFAR100")
+        # CIFAR100 specific normalization constants
+        dataset_mean = [0.5070751592371341, 0.48654887331495067, 0.4409178433670344]
+        dataset_std = [0.2673342858792403, 0.2564384629170882, 0.27615047132568393]
+        img_size = 256 # Resize CIFAR images for consistency
+        transform = transforms.Compose([
+            transforms.Resize(img_size),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=dataset_mean, std=dataset_std), # Normalize
+        ])
+        validation_dataset = datasets.CIFAR100('data/', train=False, transform=transform, download=True)
+        validation_dataset_centercrop = datasets.CIFAR100('data/', train=True, transform=transform, download=True)
+    else:
+        print("Using ImageNet")
+        # ImageNet specific normalization constants
+        dataset_mean = [0.485, 0.456, 0.406]
+        dataset_std = [0.229, 0.224, 0.225]
+        img_size = 256 # Resize ImageNet images
+        # Note: Original comment mentioned no CenterCrop, this transform reflects that.
+        transform = transforms.Compose([
+            transforms.Resize(img_size),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=dataset_mean, std=dataset_std) # Normalize
+        ])
+        validation_dataset = ImageNet(which_split='validation', transform=transform)
+        validation_dataset_centercrop = ImageNet(which_split='train', transform=transforms.Compose([
+            transforms.Resize(img_size),
+            transforms.RandomCrop(img_size),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=dataset_mean, std=dataset_std) # Normalize
+        ]))
+    class_labels = list(IMAGENET2012_CLASSES.values()) # Load actual class names
+    os.makedirs(f'{args.output_dir}', exist_ok=True)
+    interp_mode = 'nearest'
+    cmap_calib = sns.color_palette('viridis', as_cmap=True)
+    loader = torch.utils.data.DataLoader(validation_dataset, batch_size=1, shuffle=False, num_workers=0, drop_last=False)
+    loader_crop = torch.utils.data.DataLoader(validation_dataset_centercrop, batch_size=64, shuffle=True, num_workers=0, drop_last=True)
+    model.eval()
+    figscale = 0.85
+    topk = 5
+    mean_certainties_correct, mean_certainties_incorrect = [],[]
+    tracked_certainties = []
+    tracked_targets = []
+    tracked_predictions = []
+    if model.iterations != args.inference_iterations:
+        print('WARNING: you are setting inference iterations to a value not used during training!')
+    model.iterations = args.inference_iterations
+    if 'plots' in args.actions:
+        with torch.inference_mode(): # Disable gradient calculations
+            with tqdm(total=len(loader), initial=0, leave=False, position=0, dynamic_ncols=True) as pbar:
+                imgi = 0
+                for bi, (inputs, targets) in enumerate(loader):
+                    inputs = inputs.to(device)
+                    targets = targets.to(device)
+                    if bi==0:
+                        dynamics_inputs, _ = next(iter(loader_crop))  # Use this because of batching
+                        _, _, _, _, post_activations_viz, _ = model(inputs, track=True)
+                        plot_neural_dynamics(post_activations_viz, 15*10, args.output_dir, axis_snap=True, N_per_row=15)
+                    predictions, certainties, synchronisation = model(inputs)
+                    tracked_predictions.append(predictions.detach().cpu().numpy())
+                    tracked_targets.append(targets.detach().cpu().numpy())
+                    tracked_certainties.append(certainties.detach().cpu().numpy())
+                    pbar.set_description(f'Processing base image of size {inputs.shape}')
+                    pbar.update(1)
+                    if ((bi % args.plot_every == 0) or bi == len(loader)-1) and bi!=0: #
+                        concatenated_certainties = np.concatenate(tracked_certainties, axis=0)
+                        concatenated_targets = np.concatenate(tracked_targets, axis=0)
+                        concatenated_predictions = np.concatenate(tracked_predictions, axis=0)
+                        concatenated_predictions_argsorted = np.argsort(concatenated_predictions, 1)[:,::-1]
+                        for topk in [1, 5]:
+                            concatenated_predictions_argsorted_topk = concatenated_predictions_argsorted[:,:topk]
+                            accs_instant, accs_avg, accs_certain = [], [], []
+                            accs_avg_logits, accs_weighted_logits = [],[]
+                            with tqdm(total=(concatenated_predictions.shape[-1]), initial=0, leave=False, position=1, dynamic_ncols=True) as pbarinner:
+                                pbarinner.set_description('Acc types')
+                                for stepi in np.arange(concatenated_predictions.shape[-1]):
+                                    pred_avg = softmax(concatenated_predictions, 1)[:,:,:stepi+1].mean(-1).argsort(1)[:,-topk:]
+                                    pred_instant = concatenated_predictions_argsorted_topk[:,:,stepi]
+                                    pred_certain = concatenated_predictions_argsorted_topk[np.arange(concatenated_predictions.shape[0]),:, concatenated_certainties[:,1,:stepi+1].argmax(1)]
+                                    pred_avg_logits = concatenated_predictions[:,:,:stepi+1].mean(-1).argsort(1)[:,-topk:]
+                                    pred_weighted_logits = (concatenated_predictions[:,:,:stepi+1] * concatenated_certainties[:,1:,:stepi+1]).sum(-1).argsort(1)[:, -topk:]
+                                    pbarinner.update(1)
+                                    accs_instant.append(np.any(pred_instant==concatenated_targets[...,np.newaxis], -1).mean())
+                                    accs_avg.append(np.any(pred_avg==concatenated_targets[...,np.newaxis], -1).mean())
+                                    accs_avg_logits.append(np.any(pred_avg==concatenated_targets[...,np.newaxis], -1).mean())
+                                    accs_weighted_logits.append(np.any(pred_weighted_logits==concatenated_targets[...,np.newaxis], -1).mean())
+                                    accs_certain.append(np.any(pred_avg_logits==concatenated_targets[...,np.newaxis], -1).mean())
+                            fig = plt.figure(figsize=(10*figscale, 4*figscale))
+                            ax = fig.add_subplot(111)
+                            cp = sns.color_palette("bright")
+                            ax.plot(np.arange(concatenated_predictions.shape[-1])+1, 100*np.array(accs_instant), linestyle='-', color=cp[0], label='Instant')
+                            # ax.plot(np.arange(concatenated_predictions.shape[-1])+1, 100*np.array(accs_avg), linestyle='--', color=cp[1], label='Based on average probability up to this step')
+                            ax.plot(np.arange(concatenated_predictions.shape[-1])+1, 100*np.array(accs_certain), linestyle=':', color=cp[2], label='Most certain')
+                            ax.plot(np.arange(concatenated_predictions.shape[-1])+1, 100*np.array(accs_avg_logits), linestyle='-.', color=cp[3], label='Average logits')
+                            ax.plot(np.arange(concatenated_predictions.shape[-1])+1, 100*np.array(accs_weighted_logits), linestyle='--', color=cp[4], label='Logits weighted by certainty')
+                            ax.set_xlim([0, concatenated_predictions.shape[-1]+1])
+                            ax.set_ylim([75, 92])
+                            ax.set_xlabel('Internal ticks')
+                            ax.set_ylabel(f'Top-k={topk} accuracy')
+                            ax.legend(loc='lower right')
+                            fig.tight_layout(pad=0.1)
+                            fig.savefig(f'{args.output_dir}/accuracy_types_{topk}.png', dpi=200)
+                            fig.savefig(f'{args.output_dir}/accuracy_types_{topk}.pdf', dpi=200)
+                            plt.close(fig)
+                            print(f'k={topk}. Accuracy most certain at last internal tick={100*np.array(accs_certain)[-1]:0.4f}')  # Using certainty based approach
+                        indices_over_80 = []
+                        classes_80 = {}
+                        corrects_80 = {}
+                        topk = 5
+                        concatenated_predictions_argsorted_topk = concatenated_predictions_argsorted[:,:topk]
+                        for certainty_threshold in [0.5, 0.8, 0.9]:
+                            # certainty_threshold = 0.6
+                            percentage_corrects = []
+                            percentage_incorrects = []
+                            with tqdm(total=(concatenated_predictions.shape[-1]), initial=0, leave=False, position=1, dynamic_ncols=True) as pbarinner:
+                                pbarinner.set_description(f'Certainty threshold={certainty_threshold}')
+                                for stepi in np.arange(concatenated_predictions.shape[-1]):
+                                    certainty_here = concatenated_certainties[:,1,stepi]
+                                    certainty_mask = certainty_here>=certainty_threshold
+                                    predictions_here = concatenated_predictions_argsorted_topk[:,:,stepi]
+                                    is_correct_here = np.any(predictions_here==concatenated_targets[...,np.newaxis], axis=-1)
+                                    percentage_corrects.append(is_correct_here[certainty_mask].sum()/predictions_here.shape[0])
+                                    percentage_incorrects.append((~is_correct_here)[certainty_mask].sum()/predictions_here.shape[0])
+                                    if certainty_threshold==0.8:
+                                        indices_certain = np.where(certainty_mask)[0]
+                                        for index in indices_certain:
+                                            if index not in indices_over_80:
+                                                indices_over_80.append(index)
+                                                if concatenated_targets[index] not in classes_80:
+                                                    classes_80[concatenated_targets[index]] = [stepi]
+                                                    corrects_80[concatenated_targets[index]] = [is_correct_here[index]]
+                                                else:
+                                                    classes_80[concatenated_targets[index]] = classes_80[concatenated_targets[index]]+[stepi]
+                                                    corrects_80[concatenated_targets[index]] = corrects_80[concatenated_targets[index]]+[is_correct_here[index]]
+                                    pbarinner.update(1)
+                            fig = plt.figure(figsize=(6.5*figscale, 4*figscale))
+                            ax = fig.add_subplot(111)
+                            ax.bar(np.arange(concatenated_predictions.shape[-1])+1,
+                                percentage_corrects,
+                                color='forestgreen',
+                                hatch='OO',
+                                width=0.9,
+                                label='Positive',
+                                alpha=0.9,
+                                linewidth=1.0*figscale)
+                            ax.bar(np.arange(concatenated_predictions.shape[-1])+1,
+                                percentage_incorrects,
+                                bottom=percentage_corrects,
+                                color='crimson',
+                                hatch='xx',
+                                width=0.9,
+                                label='Negative',
+                                alpha=0.9,
+                                linewidth=1.0*figscale)
+                            ax.set_xlim(-1, concatenated_predictions.shape[-1]+1)
+                            ax.set_xlabel('Internal tick')
+                            ax.set_ylabel('% of data')
+                            ax.legend(loc='lower right')
+                            fig.tight_layout(pad=0.1)
+                            fig.savefig(f'{args.output_dir}/steps_versus_correct_{certainty_threshold}.png', dpi=200)
+                            fig.savefig(f'{args.output_dir}/steps_versus_correct_{certainty_threshold}.pdf', dpi=200)
+                            plt.close(fig)
+                        class_list = list(classes_80.keys())
+                        mean_steps = [np.mean(classes_80[cls]) for cls in class_list]
+                        std_steps = [np.std(classes_80[cls]) for cls in class_list]
+                        # Following code plots the class distribution over internal ticks
+                        indices_to_show = np.arange(1000)
+                        colours = cmap_diverse = plt.get_cmap('rainbow')(np.linspace(0, 1, 1000))
+                        # np.random.shuffle(colours)
+                        bottom = np.zeros(concatenated_predictions.shape[-1])
+                        fig = plt.figure(figsize=(7*figscale, 4*figscale))
+                        ax = fig.add_subplot(111)
+                        for iii, idx in enumerate(indices_to_show):
+                            if idx in classes_80:
+                                steps = classes_80[idx]
+                                colour = colours[iii]
+                                vs, cts = np.unique(steps, return_counts=True)
+                                bar = np.zeros(concatenated_predictions.shape[-1])
+                                bar[vs] = cts
+                                ax.bar(np.arange(concatenated_predictions.shape[-1])+1, bar, bottom=bottom, color=colour, width=1, edgecolor='none')
+                                bottom += bar
+                        ax.set_xlabel('Internal ticks')
+                        ax.set_ylabel('Counts over 0.8 certainty')
+                        fig.tight_layout(pad=0.1)
+                        fig.savefig(f'{args.output_dir}/class_counts.png', dpi=200)
+                        fig.savefig(f'{args.output_dir}/class_counts.pdf', dpi=200)
+                        plt.close(fig)
+                        # The following code plots calibration
+                        probability_space = np.linspace(0, 1, 10)
+                        fig = plt.figure(figsize=(6*figscale, 4*figscale))
+                        ax = fig.add_subplot(111)
+                        color_linspace = np.linspace(0, 1, concatenated_predictions.shape[-1])
+                        with tqdm(total=(concatenated_predictions.shape[-1]), initial=0, leave=False, position=1, dynamic_ncols=True) as pbarinner:
+                            pbarinner.set_description(f'Calibration')
+                            for stepi in np.arange(concatenated_predictions.shape[-1]):
+                                color = cmap_calib(color_linspace[stepi])
+                                pred = concatenated_predictions[:,:,stepi].argmax(1)
+                                is_correct = pred == concatenated_targets  # BxT
+                                probabilities = softmax(concatenated_predictions[:,:,:stepi+1], axis=1)[np.arange(concatenated_predictions.shape[0]),pred].mean(-1)#softmax(concatenated_predictions[:,:,stepi], axis=1).max(1)
+                                probability_space = np.linspace(0, 1, 10)
+                                accuracies_per_bin = []
+                                bin_centers = []
+                                for pi in range(len(probability_space)-1):
+                                    bin_low = probability_space[pi]
+                                    bin_high = probability_space[pi+1]
+                                    mask = ((probabilities >=bin_low) & (probabilities < bin_high)) if pi !=len(probability_space)-2 else ((probabilities >=bin_low) & (probabilities <= bin_high))
+                                    accuracies_per_bin.append(is_correct[mask].mean())
+                                    bin_centers.append(probabilities[mask].mean())
+                                if stepi==concatenated_predictions.shape[-1]-1:
+                                    ax.plot(bin_centers, accuracies_per_bin, linestyle='-', marker='.', color='#4050f7', alpha=1, label='After all ticks')
+                                else: ax.plot(bin_centers, accuracies_per_bin, linestyle='-', marker='.', color=color, alpha=0.65)
+                                pbarinner.update(1)
+                        ax.plot(probability_space, np.linspace(0, 1, len(probability_space)), 'k--')
+                        ax.legend(loc='upper left')
+                        ax.set_xlim([-0.01, 1.01])
+                        ax.set_ylim([-0.01, 1.01])
+                        sm = plt.cm.ScalarMappable(cmap=cmap_calib, norm=plt.Normalize(vmin=0, vmax=concatenated_predictions.shape[-1] - 1))
+                        sm.set_array([])  # Empty array for colormap
+                        cbar = fig.colorbar(sm, ax=ax, orientation='vertical', pad=0.02)
+                        cbar.set_label('Internal ticks')
+                        ax.set_xlabel('Mean predicted probabilities')
+                        ax.set_ylabel('Ratio of positives')
+                        fig.tight_layout(pad=0.1)
+                        fig.savefig(f'{args.output_dir}/imagenet_calibration.png', dpi=200)
+                        fig.savefig(f'{args.output_dir}/imagenet_calibration.pdf', dpi=200)
+                        plt.close(fig)
+    if 'videos' in args.actions:
+        if not args.data_indices: # If list is empty
+            n_samples = len(validation_dataset)
+            num_to_sample = min(args.N_to_viz, n_samples)
+            replace = n_samples < num_to_sample
+            data_indices = np.random.choice(np.arange(n_samples), size=num_to_sample, replace=replace)
+            print(f"Selected random indices: {data_indices}")
+        else:
+            data_indices = args.data_indices
+            print(f"Using specified indices: {data_indices}")
+        for di in data_indices:
+            print(f'\nBuilding viz for dataset index {di}.')
+            # --- Get Data & Run Inference ---
+            # inputs_norm is already normalized by the transform
+            inputs, ground_truth_target = validation_dataset.__getitem__(int(di))
+            # Add batch dimension and send to device
+            inputs = inputs.to(device).unsqueeze(0)
+            # Run model inference
+            predictions, certainties, synchronisation, pre_activations, post_activations, attention_tracking = model(inputs, track=True)
+            # predictions: (B, Classes, Steps), attention_tracking: (Steps*B*Heads, SeqLen)
+            n_steps = predictions.size(-1)
+            # --- Reshape Attention ---
+            # Infer feature map size from model internals (assuming B=1)
+            h_feat, w_feat = model.kv_features.shape[-2:]
+            n_heads = attention_tracking.shape[2]
+            # Reshape to (Steps, Heads, H_feat, W_feat) assuming B=1
+            attention_tracking = attention_tracking.reshape(n_steps, n_heads, h_feat, w_feat)
+            # --- Setup for Plotting ---
+            step_linspace = np.linspace(0, 1, n_steps) # For step colors
+            # Define color maps
+            cmap_spectral = sns.color_palette("Spectral", as_cmap=True)
+            cmap_attention = sns.color_palette('viridis', as_cmap=True)
+            # Create output directory for this index
+            index_output_dir = os.path.join(args.output_dir, str(di))
+            os.makedirs(index_output_dir, exist_ok=True)
+            frames = [] # Store frames for GIF
+            head_routes = {h: [] for h in range(n_heads)} # Store (y,x) path points per head
+            head_routes[-1] = []
+            route_colours_step = [] # Store colors for each step's path segments
+            # --- Loop Through Each Step ---
+            for step_i in range(n_steps):
+                # --- Prepare Image for Display ---
+                # Denormalize the input tensor for visualization
+                data_img_tensor = inputs[0].cpu() # Get first item in batch, move to CPU
+                mean_tensor = torch.tensor(dataset_mean).view(3, 1, 1)
+                std_tensor = torch.tensor(dataset_std).view(3, 1, 1)
+                data_img_denorm = data_img_tensor * std_tensor + mean_tensor
+                # Permute to (H, W, C) and convert to numpy, clip to [0, 1]
+                data_img_np = data_img_denorm.permute(1, 2, 0).detach().numpy()
+                data_img_np = np.clip(data_img_np, 0, 1)
+                img_h, img_w = data_img_np.shape[:2]
+                # --- Process Attention & Certainty ---
+                # Average attention over last few steps (from original code)
+                start_step = max(0, step_i - 5)
+                attention_now = attention_tracking[start_step : step_i + 1].mean(0) # Avg over steps -> (Heads, H_feat, W_feat)
+                # Get certainties up to current step
+                certainties_now = certainties[0, 1, :step_i+1].detach().cpu().numpy() # Assuming index 1 holds relevant certainty
+                # --- Calculate Attention Paths (using bilinear interp) ---
+                # Interpolate attention to image size using bilinear for center finding
+                attention_interp_bilinear = F.interpolate(
+                    torch.from_numpy(attention_now).unsqueeze(0).float(), # Add batch dim, ensure float
+                    size=(img_h, img_w),
+                    mode=interp_mode,
+                    # align_corners=False
+                ).squeeze(0) # Remove batch dim -> (Heads, H, W)
+                # Normalize each head's map to [0, 1]
+                # Deal with mean
+                attn_mean = attention_interp_bilinear.mean(0)
+                attn_mean_min = attn_mean.min()
+                attn_mean_max = attn_mean.max()
+                attn_mean = (attn_mean - attn_mean_min) / (attn_mean_max - attn_mean_min)
+                centers, areas = find_island_centers(attn_mean.detach().cpu().numpy(), threshold=0.7)
+                if centers: # If islands found
+                    largest_island_idx = np.argmax(areas)
+                    current_center = centers[largest_island_idx] # (y, x)
+                    head_routes[-1].append(current_center)
+                elif head_routes[-1]: # If no center now, repeat last known center if history exists
+                    head_routes[-1].append(head_routes[-1][-1])
+                attn_min = attention_interp_bilinear.view(n_heads, -1).min(dim=-1, keepdim=True)[0].unsqueeze(-1)
+                attn_max = attention_interp_bilinear.view(n_heads, -1).max(dim=-1, keepdim=True)[0].unsqueeze(-1)
+                attention_interp_bilinear = (attention_interp_bilinear - attn_min) / (attn_max - attn_min + 1e-6)
+                # Store step color
+                current_colour = list(cmap_spectral(step_linspace[step_i]))
+                route_colours_step.append(current_colour)
+                # Find island center for each head
+                for head_i in range(n_heads):
+                    attn_head_np = attention_interp_bilinear[head_i].detach().cpu().numpy()
+                    # Keep threshold=0.7 based on original call
+                    centers, areas = find_island_centers(attn_head_np, threshold=0.7)
+                    if centers: # If islands found
+                        largest_island_idx = np.argmax(areas)
+                        current_center = centers[largest_island_idx] # (y, x)
+                        head_routes[head_i].append(current_center)
+                    elif head_routes[head_i]: # If no center now, repeat last known center if history exists
+                            head_routes[head_i].append(head_routes[head_i][-1])
+                # --- Plotting Setup ---
+                mosaic = [['head_mean', 'head_mean', 'head_mean', 'head_mean', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay'],
+                            ['head_mean', 'head_mean', 'head_mean', 'head_mean', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay'],
+                            ['head_mean', 'head_mean', 'head_mean', 'head_mean', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay'],
+                            ['head_mean', 'head_mean', 'head_mean', 'head_mean', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay', 'head_mean_overlay'],
+                        ['head_0', 'head_0_overlay', 'head_1', 'head_1_overlay', 'head_2', 'head_2_overlay', 'head_3', 'head_3_overlay'],
+                        ['head_4', 'head_4_overlay', 'head_5', 'head_5_overlay','head_6', 'head_6_overlay', 'head_7', 'head_7_overlay'],
+                        ['head_8', 'head_8_overlay', 'head_9', 'head_9_overlay','head_10', 'head_10_overlay', 'head_11', 'head_11_overlay'],
+                        ['head_12', 'head_12_overlay', 'head_13', 'head_13_overlay','head_14', 'head_14_overlay', 'head_15', 'head_15_overlay'],
+                        ['probabilities', 'probabilities','probabilities', 'probabilities', 'certainty', 'certainty', 'certainty', 'certainty'],
+                        ]
+                img_aspect = data_img_np.shape[0] / data_img_np.shape[1]
+                aspect_ratio = (8 * figscale, 9 * figscale * img_aspect) # W, H
+                fig, axes = plt.subplot_mosaic(mosaic, figsize=aspect_ratio)
+                for ax in axes.values():
+                    ax.axis('off')
+                # --- Plot Certainty ---
+                ax_cert = axes['certainty']
+                ax_cert.plot(np.arange(len(certainties_now)), certainties_now, 'k-', linewidth=figscale*1)
+                # Add background color based on prediction correctness at each step
+                for ii in range(len(certainties_now)):
+                    is_correct = predictions[0, :, ii].argmax(-1).item() == ground_truth_target # .item() for scalar tensor
+                    facecolor = 'limegreen' if is_correct else 'orchid'
+                    ax_cert.axvspan(ii, ii + 1, facecolor=facecolor, edgecolor=None, lw=0, alpha=0.3)
+                # Mark the last point
+                ax_cert.plot(len(certainties_now)-1, certainties_now[-1], 'k.', markersize=figscale*4)
+                ax_cert.axis('off')
+                ax_cert.set_ylim([0.05, 1.05])
+                ax_cert.set_xlim([0, n_steps]) # Use n_steps for consistent x-axis limit
+                # --- Plot Probabilities ---
+                ax_prob = axes['probabilities']
+                # Get probabilities for the current step
+                ps = torch.softmax(predictions[0, :, step_i], -1).detach().cpu()
+                k = 15 # Top k predictions
+                topk_probs, topk_indices = torch.topk(ps, k, dim=0, largest=True)
+                topk_indices = topk_indices.numpy()
+                topk_probs = topk_probs.numpy()
+                top_classes = np.array(class_labels)[topk_indices]
+                true_class_idx = ground_truth_target # Ground truth index
+                # Determine bar colors (green if correct, blue otherwise - consistent with original)
+                colours = ['g' if idx == true_class_idx else 'b' for idx in topk_indices]
+                # Plot horizontal bars (inverted range for top-down display)
+                ax_prob.barh(np.arange(k)[::-1], topk_probs, color=colours, alpha=1) # Use barh and inverted range
+                ax_prob.set_xlim([0, 1])
+                ax_prob.axis('off')
+                # Add text labels for top classes
+                for i, name_idx in enumerate(topk_indices):
+                    name = class_labels[name_idx] # Get name from index
+                    is_correct = name_idx == true_class_idx
+                    fg_color = 'darkgreen' if is_correct else 'crimson' # Text colors from original
+                    text_str = f'{name[:40]}' # Truncate long names
+                    # Position text on the left side of the horizontal bars
+                    ax_prob.text(
+                        0.01, # Small offset from left edge
+                        k - 1 - i, # Y-position corresponding to the bar
+                        text_str,
+                        #transform=ax_prob.transAxes, # Use data coordinates for Y
+                        verticalalignment='center',
+                        horizontalalignment='left',
+                        fontsize=8,
+                        color=fg_color,
+                        alpha=0.9, # Slightly more visible than 0.5
+                        path_effects=[
+                            patheffects.Stroke(linewidth=2, foreground='white'), # Adjusted stroke
+                            patheffects.Normal()
+                        ])
+                # --- Plot Attention Heads & Overlays (using nearest interp) ---
+                # Re-interpolate attention using nearest neighbor for visual plotting
+                attention_interp_plot = F.interpolate(
+                    torch.from_numpy(attention_now).unsqueeze(0).float(),
+                    size=(img_h, img_w),
+                    mode=interp_mode, # 'nearest'
+                ).squeeze(0)
+                attn_mean = attention_interp_plot.mean(0)
+                attn_mean_min = attn_mean.min()
+                attn_mean_max = attn_mean.max()
+                attn_mean = (attn_mean - attn_mean_min) / (attn_mean_max - attn_mean_min)
+                # Normalize each head's map to [0, 1]
+                attn_min_plot = attention_interp_plot.view(n_heads, -1).min(dim=-1, keepdim=True)[0].unsqueeze(-1)
+                attn_max_plot = attention_interp_plot.view(n_heads, -1).max(dim=-1, keepdim=True)[0].unsqueeze(-1)
+                attention_interp_plot = (attention_interp_plot - attn_min_plot) / (attn_max_plot - attn_min_plot + 1e-6)
+                attention_interp_plot_np = attention_interp_plot.detach().cpu().numpy()
+                for head_i in list(range(n_heads)) + [-1]:
+                    axname = f'head_{head_i}' if head_i != -1 else 'head_mean'
+                    if axname not in axes: continue # Skip if mosaic doesn't have this head
+                    ax = axes[axname]
+                    ax_overlay = axes[f'{axname}_overlay']
+                    # Plot attention heatmap
+                    this_attn = attention_interp_plot_np[head_i] if head_i != -1 else attn_mean
+                    img_to_plot = cmap_attention(this_attn)
+                    ax.imshow(img_to_plot)
+                    ax.axis('off')
+                    # Plot overlay: image + paths
+                    these_route_steps = head_routes[head_i]
+                    arrow_scale = 1.5 if head_i != -1 else 3
+                    if these_route_steps: # Only plot if path exists
+                        # Separate y and x coordinates
+                        y_coords, x_coords = zip(*these_route_steps)
+                        y_coords = np.array(y_coords)
+                        x_coords = np.array(x_coords)
+                        # Flip y-coordinates for correct plotting (imshow origin is top-left)
+                        # NOTE: Original flip seemed complex, simplifying to standard flip
+                        y_coords_flipped = img_h - 1 - y_coords
+                        # Show original image flipped vertically to match coordinate system
+                        ax_overlay.imshow(np.flipud(data_img_np), origin='lower')
+                        # Draw arrows for path segments
+                            # Arrow size scaling from original
+                        for i in range(len(these_route_steps) - 1):
+                            dx = x_coords[i+1] - x_coords[i]
+                            dy = y_coords_flipped[i+1] - y_coords_flipped[i] # Use flipped y for delta
+                            # Draw white background arrow (thicker)
+                            ax_overlay.arrow(x_coords[i], y_coords_flipped[i], dx, dy,
+                                                linewidth=1.6 * arrow_scale * 1.3,
+                                                head_width=1.9 * arrow_scale * 1.3,
+                                                head_length=1.4 * arrow_scale * 1.45,
+                                                fc='white', ec='white', length_includes_head=True, alpha=1)
+                            # Draw colored foreground arrow
+                            ax_overlay.arrow(x_coords[i], y_coords_flipped[i], dx, dy,
+                                                linewidth=1.6 * arrow_scale,
+                                                head_width=1.9 * arrow_scale,
+                                                head_length=1.4 * arrow_scale,
+                                                fc=route_colours_step[i], ec=route_colours_step[i], # Use step color
+                                                length_includes_head=True)
+                    else: # If no path yet, just show the image
+                            ax_overlay.imshow(np.flipud(data_img_np), origin='lower')
+                    # Set limits and turn off axes for overlay
+                    ax_overlay.set_xlim([0, img_w - 1])
+                    ax_overlay.set_ylim([0, img_h - 1])
+                    ax_overlay.axis('off')
+                # --- Finalize and Save Frame ---
+                fig.tight_layout(pad=0.1) # Adjust spacing
+                # Render the plot to a numpy array
+                canvas = fig.canvas
+                canvas.draw()
+                image_numpy = np.frombuffer(canvas.buffer_rgba(), dtype='uint8')
+                image_numpy = image_numpy.reshape(*reversed(canvas.get_width_height()), 4)[:,:,:3] # Get RGB
+                frames.append(image_numpy) # Add to list for GIF
+                plt.close(fig) # Close figure to free memory
+            # --- Save GIF ---
+            gif_path = os.path.join(index_output_dir, f'{str(di)}_viz.gif')
+            print(f"Saving GIF to {gif_path}...")
+            imageio.mimsave(gif_path, frames, fps=15, loop=0) # loop=0 means infinite loop
+            save_frames_to_mp4([fm[:,:,::-1] for fm in frames], os.path.join(index_output_dir, f'{str(di)}_viz.mp4'), fps=15, gop_size=1, preset='veryslow')
+    if 'demo' in args.actions:
+        # --- Select Data Indices ---
+        if not args.data_indices: # If list is empty
+            n_samples = len(validation_dataset)
+            num_to_sample = min(args.N_to_viz, n_samples)
+            replace = n_samples < num_to_sample
+            data_indices = np.random.choice(np.arange(n_samples), size=num_to_sample, replace=replace)
+            print(f"Selected random indices: {data_indices}")
+        else:
+            data_indices = args.data_indices
+            print(f"Using specified indices: {data_indices}")
+        for di in data_indices:
+            index_output_dir = os.path.join(args.output_dir, str(di))
+            os.makedirs(index_output_dir, exist_ok=True)
+            print(f'\nBuilding viz for dataset index {di}.')
+            inputs, ground_truth_target = validation_dataset.__getitem__(int(di))
+            # Add batch dimension and send to device
+            inputs = inputs.to(device).unsqueeze(0)
+            predictions, certainties, synchronisations_over_time, pre_activations, post_activations, attention_tracking = model(inputs, track=True)
+            # --- Reshape Attention ---
+            # Infer feature map size from model internals (assuming B=1)
+            h_feat, w_feat = model.kv_features.shape[-2:]
+            n_steps = predictions.size(-1)
+            n_heads = attention_tracking.shape[2]
+            # Reshape to (Steps, Heads, H_feat, W_feat) assuming B=1
+            attention_tracking = attention_tracking.reshape(n_steps, n_heads, h_feat, w_feat)
+            # --- Setup for Plotting ---
+            step_linspace = np.linspace(0, 1, n_steps) # For step colors
+            # Define color maps
+            cmap_steps = sns.color_palette("Spectral", as_cmap=True)
+            cmap_attention = sns.color_palette('viridis', as_cmap=True)
+            # Create output directory for this index
+            frames = [] # Store frames for GIF
+            head_routes = [] # Store (y,x) path points per head
+            route_colours_step = [] # Store colors for each step's path segments
+            # --- Loop Through Each Step ---
+            for step_i in range(n_steps):
+                # Store step color
+                current_colour = list(cmap_steps(step_linspace[step_i]))
+                route_colours_step.append(current_colour)
+                # --- Prepare Image for Display ---
+                # Denormalize the input tensor for visualization
+                data_img_tensor = inputs[0].cpu() # Get first item in batch, move to CPU
+                mean_tensor = torch.tensor(dataset_mean).view(3, 1, 1)
+                std_tensor = torch.tensor(dataset_std).view(3, 1, 1)
+                data_img_denorm = data_img_tensor * std_tensor + mean_tensor
+                # Permute to (H, W, C) and convert to numpy, clip to [0, 1]
+                data_img_np = data_img_denorm.permute(1, 2, 0).detach().numpy()
+                data_img_np = np.clip(data_img_np, 0, 1)
+                img_h, img_w = data_img_np.shape[:2]
+                # --- Process Attention & Certainty ---
+                # Average attention over last few steps (from original code)
+                start_step = max(0, step_i - 5)
+                attention_now = attention_tracking[start_step : step_i + 1].mean(0) # Avg over steps -> (Heads, H_feat, W_feat)
+                # Get certainties up to current step
+                certainties_now = certainties[0, 1, :step_i+1].detach().cpu().numpy() # Assuming index 1 holds relevant certainty
+                # --- Calculate Attention Paths (using bilinear interp) ---
+                # Interpolate attention to image size using bilinear for center finding
+                attention_interp_bilinear = F.interpolate(
+                    torch.from_numpy(attention_now).unsqueeze(0).float(), # Add batch dim, ensure float
+                    size=(img_h, img_w),
+                    mode=interp_mode,
+                ).squeeze(0) # Remove batch dim -> (Heads, H, W)
+                attn_mean = attention_interp_bilinear.mean(0)
+                attn_mean_min = attn_mean.min()
+                attn_mean_max = attn_mean.max()
+                attn_mean = (attn_mean - attn_mean_min) / (attn_mean_max - attn_mean_min)
+                centers, areas = find_island_centers(attn_mean.detach().cpu().numpy(), threshold=0.7)
+                if centers: # If islands found
+                    largest_island_idx = np.argmax(areas)
+                    current_center = centers[largest_island_idx] # (y, x)
+                    head_routes.append(current_center)
+                elif head_routes: # If no center now, repeat last known center if history exists
+                    head_routes.append(head_routes[-1])
+                # --- Plotting Setup ---
+                # if n_heads != 8: print(f"Warning: Plotting layout assumes 8 heads, found {n_heads}. Layout may be incorrect.")
+                mosaic = [['head_0', 'head_1', 'head_2', 'head_3', 'head_mean', 'head_mean', 'head_mean', 'head_mean', 'overlay', 'overlay', 'overlay', 'overlay'],
+                        ['head_4', 'head_5', 'head_6', 'head_7', 'head_mean', 'head_mean', 'head_mean', 'head_mean', 'overlay', 'overlay', 'overlay', 'overlay'],
+                        ['head_8', 'head_9', 'head_10', 'head_11', 'head_mean', 'head_mean', 'head_mean', 'head_mean', 'overlay', 'overlay', 'overlay', 'overlay'],
+                        ['head_12', 'head_13', 'head_14', 'head_15', 'head_mean', 'head_mean', 'head_mean', 'head_mean', 'overlay', 'overlay', 'overlay', 'overlay'],
+                        ['probabilities', 'probabilities', 'probabilities', 'probabilities', 'probabilities', 'probabilities', 'certainty', 'certainty', 'certainty', 'certainty', 'certainty', 'certainty'],
+                        ['probabilities', 'probabilities', 'probabilities', 'probabilities', 'probabilities', 'probabilities', 'certainty', 'certainty', 'certainty', 'certainty', 'certainty', 'certainty'],
+                        ]
+                img_aspect = data_img_np.shape[0] / data_img_np.shape[1]
+                aspect_ratio = (12 * figscale, 6 * figscale * img_aspect) # W, H
+                fig, axes = plt.subplot_mosaic(mosaic, figsize=aspect_ratio)
+                for ax in axes.values():
+                    ax.axis('off')
+                # --- Plot Certainty ---
+                ax_cert = axes['certainty']
+                ax_cert.plot(np.arange(len(certainties_now)), certainties_now, 'k-', linewidth=figscale*1)
+                # Add background color based on prediction correctness at each step
+                for ii in range(len(certainties_now)):
+                    is_correct = predictions[0, :, ii].argmax(-1).item() == ground_truth_target # .item() for scalar tensor
+                    facecolor = 'limegreen' if is_correct else 'orchid'
+                    ax_cert.axvspan(ii, ii + 1, facecolor=facecolor, edgecolor=None, lw=0, alpha=0.3)
+                # Mark the last point
+                ax_cert.plot(len(certainties_now)-1, certainties_now[-1], 'k.', markersize=figscale*4)
+                ax_cert.axis('off')
+                ax_cert.set_ylim([0.05, 1.05])
+                ax_cert.set_xlim([0, n_steps]) # Use n_steps for consistent x-axis limit
+                # --- Plot Probabilities ---
+                ax_prob = axes['probabilities']
+                # Get probabilities for the current step
+                ps = torch.softmax(predictions[0, :, step_i], -1).detach().cpu()
+                k = 15 # Top k predictions
+                topk_probs, topk_indices = torch.topk(ps, k, dim=0, largest=True)
+                topk_indices = topk_indices.numpy()
+                topk_probs = topk_probs.numpy()
+                top_classes = np.array(class_labels)[topk_indices]
+                true_class_idx = ground_truth_target # Ground truth index
+                # Determine bar colors (green if correct, blue otherwise - consistent with original)
+                colours = ['g' if idx == true_class_idx else 'b' for idx in topk_indices]
+                # Plot horizontal bars (inverted range for top-down display)
+                ax_prob.barh(np.arange(k)[::-1], topk_probs, color=colours, alpha=1) # Use barh and inverted range
+                ax_prob.set_xlim([0, 1])
+                ax_prob.axis('off')
+                # Add text labels for top classes
+                for i, name_idx in enumerate(topk_indices):
+                    name = class_labels[name_idx] # Get name from index
+                    is_correct = name_idx == true_class_idx
+                    fg_color = 'darkgreen' if is_correct else 'crimson' # Text colors from original
+                    text_str = f'{name[:40]}' # Truncate long names
+                    # Position text on the left side of the horizontal bars
+                    ax_prob.text(
+                        0.01, # Small offset from left edge
+                        k - 1 - i, # Y-position corresponding to the bar
+                        text_str,
+                        #transform=ax_prob.transAxes, # Use data coordinates for Y
+                        verticalalignment='center',
+                        horizontalalignment='left',
+                        fontsize=8,
+                        color=fg_color,
+                        alpha=0.7, # Slightly more visible than 0.5
+                        path_effects=[
+                            patheffects.Stroke(linewidth=2, foreground='white'), # Adjusted stroke
+                            patheffects.Normal()
+                        ])
+                # --- Plot Attention Heads & Overlays (using nearest interp) ---
+                # Re-interpolate attention using nearest neighbor for visual plotting
+                attention_interp_plot = F.interpolate(
+                    torch.from_numpy(attention_now).unsqueeze(0).float(),
+                    size=(img_h, img_w),
+                    mode=interp_mode # 'nearest'
+                ).squeeze(0)
+                attn_mean = attention_interp_plot.mean(0)
+                attn_mean_min = attn_mean.min()
+                attn_mean_max = attn_mean.max()
+                attn_mean = (attn_mean - attn_mean_min) / (attn_mean_max - attn_mean_min)
+                img_to_plot = cmap_attention(attn_mean)
+                axes['head_mean'].imshow(img_to_plot)
+                axes['head_mean'].axis('off')
+                these_route_steps = head_routes
+                ax_overlay = axes['overlay']
+                if these_route_steps: # Only plot if path exists
+                    # Separate y and x coordinates
+                    y_coords, x_coords = zip(*these_route_steps)
+                    y_coords = np.array(y_coords)
+                    x_coords = np.array(x_coords)
+                    # Flip y-coordinates for correct plotting (imshow origin is top-left)
+                    # NOTE: Original flip seemed complex, simplifying to standard flip
+                    y_coords_flipped = img_h - 1 - y_coords
+                    # Show original image flipped vertically to match coordinate system
+                    ax_overlay.imshow(np.flipud(data_img_np), origin='lower')
+                    # Draw arrows for path segments
+                    arrow_scale = 2 # Arrow size scaling from original
+                    for i in range(len(these_route_steps) - 1):
+                        dx = x_coords[i+1] - x_coords[i]
+                        dy = y_coords_flipped[i+1] - y_coords_flipped[i] # Use flipped y for delta
+                        # Draw white background arrow (thicker)
+                        ax_overlay.arrow(x_coords[i], y_coords_flipped[i], dx, dy,
+                                            linewidth=1.6 * arrow_scale * 1.3,
+                                            head_width=1.9 * arrow_scale * 1.3,
+                                            head_length=1.4 * arrow_scale * 1.45,
+                                            fc='white', ec='white', length_includes_head=True, alpha=1)
+                        # Draw colored foreground arrow
+                        ax_overlay.arrow(x_coords[i], y_coords_flipped[i], dx, dy,
+                                            linewidth=1.6 * arrow_scale,
+                                            head_width=1.9 * arrow_scale,
+                                            head_length=1.4 * arrow_scale,
+                                            fc=route_colours_step[i], ec=route_colours_step[i], # Use step color
+                                            length_includes_head=True)
+                    # Set limits and turn off axes for overlay
+                    ax_overlay.set_xlim([0, img_w - 1])
+                    ax_overlay.set_ylim([0, img_h - 1])
+                    ax_overlay.axis('off')
+                for head_i in range(n_heads):
+                    if f'head_{head_i}' not in axes: continue # Skip if mosaic doesn't have this head
+                    ax = axes[f'head_{head_i}']
+                    # Plot attention heatmap
+                    attn_up_to_now = attention_tracking[:step_i + 1, head_i].mean(0)
+                    attn_up_to_now = (attn_up_to_now - attn_up_to_now.min())/(attn_up_to_now.max() - attn_up_to_now.min())
+                    img_to_plot = cmap_attention(attn_up_to_now)
+                    ax.imshow(img_to_plot)
+                    ax.axis('off')
+                # --- Finalize and Save Frame ---
+                fig.tight_layout(pad=0.1) # Adjust spacing
+                # Render the plot to a numpy array
+                canvas = fig.canvas
+                canvas.draw()
+                image_numpy = np.frombuffer(canvas.buffer_rgba(), dtype='uint8')
+                image_numpy = image_numpy.reshape(*reversed(canvas.get_width_height()), 4)[:,:,:3] # Get RGB
+                frames.append(image_numpy) # Add to list for GIF
+                # Save individual frame if requested
+                if step_i==model.iterations-1:
+                    fig.savefig(os.path.join(index_output_dir, f'frame_{step_i}.png'), dpi=200)
+                plt.close(fig) # Close figure to free memory
+            outfilename = os.path.join(index_output_dir, f'{di}_demo.mp4')
+            save_frames_to_mp4([fm[:,:,::-1] for fm in frames], outfilename, fps=15, gop_size=1, preset='veryslow')

tasks/image_classification/imagenet_classes.py ADDED Viewed

	@@ -0,0 +1,1007 @@

+from collections import OrderedDict
+IMAGENET2012_CLASSES = OrderedDict(
+    {
+        "n01440764": "tench, Tinca tinca",
+        "n01443537": "goldfish, Carassius auratus",
+        "n01484850": "great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias",
+        "n01491361": "tiger shark, Galeocerdo cuvieri",
+        "n01494475": "hammerhead, hammerhead shark",
+        "n01496331": "electric ray, crampfish, numbfish, torpedo",
+        "n01498041": "stingray",
+        "n01514668": "cock",
+        "n01514859": "hen",
+        "n01518878": "ostrich, Struthio camelus",
+        "n01530575": "brambling, Fringilla montifringilla",
+        "n01531178": "goldfinch, Carduelis carduelis",
+        "n01532829": "house finch, linnet, Carpodacus mexicanus",
+        "n01534433": "junco, snowbird",
+        "n01537544": "indigo bunting, indigo finch, indigo bird, Passerina cyanea",
+        "n01558993": "robin, American robin, Turdus migratorius",
+        "n01560419": "bulbul",
+        "n01580077": "jay",
+        "n01582220": "magpie",
+        "n01592084": "chickadee",
+        "n01601694": "water ouzel, dipper",
+        "n01608432": "kite",
+        "n01614925": "bald eagle, American eagle, Haliaeetus leucocephalus",
+        "n01616318": "vulture",
+        "n01622779": "great grey owl, great gray owl, Strix nebulosa",
+        "n01629819": "European fire salamander, Salamandra salamandra",
+        "n01630670": "common newt, Triturus vulgaris",
+        "n01631663": "eft",
+        "n01632458": "spotted salamander, Ambystoma maculatum",
+        "n01632777": "axolotl, mud puppy, Ambystoma mexicanum",
+        "n01641577": "bullfrog, Rana catesbeiana",
+        "n01644373": "tree frog, tree-frog",
+        "n01644900": "tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui",
+        "n01664065": "loggerhead, loggerhead turtle, Caretta caretta",
+        "n01665541": "leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea",
+        "n01667114": "mud turtle",
+        "n01667778": "terrapin",
+        "n01669191": "box turtle, box tortoise",
+        "n01675722": "banded gecko",
+        "n01677366": "common iguana, iguana, Iguana iguana",
+        "n01682714": "American chameleon, anole, Anolis carolinensis",
+        "n01685808": "whiptail, whiptail lizard",
+        "n01687978": "agama",
+        "n01688243": "frilled lizard, Chlamydosaurus kingi",
+        "n01689811": "alligator lizard",
+        "n01692333": "Gila monster, Heloderma suspectum",
+        "n01693334": "green lizard, Lacerta viridis",
+        "n01694178": "African chameleon, Chamaeleo chamaeleon",
+        "n01695060": "Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis",
+        "n01697457": "African crocodile, Nile crocodile, Crocodylus niloticus",
+        "n01698640": "American alligator, Alligator mississipiensis",
+        "n01704323": "triceratops",
+        "n01728572": "thunder snake, worm snake, Carphophis amoenus",
+        "n01728920": "ringneck snake, ring-necked snake, ring snake",
+        "n01729322": "hognose snake, puff adder, sand viper",
+        "n01729977": "green snake, grass snake",
+        "n01734418": "king snake, kingsnake",
+        "n01735189": "garter snake, grass snake",
+        "n01737021": "water snake",
+        "n01739381": "vine snake",
+        "n01740131": "night snake, Hypsiglena torquata",
+        "n01742172": "boa constrictor, Constrictor constrictor",
+        "n01744401": "rock python, rock snake, Python sebae",
+        "n01748264": "Indian cobra, Naja naja",
+        "n01749939": "green mamba",
+        "n01751748": "sea snake",
+        "n01753488": "horned viper, cerastes, sand viper, horned asp, Cerastes cornutus",
+        "n01755581": "diamondback, diamondback rattlesnake, Crotalus adamanteus",
+        "n01756291": "sidewinder, horned rattlesnake, Crotalus cerastes",
+        "n01768244": "trilobite",
+        "n01770081": "harvestman, daddy longlegs, Phalangium opilio",
+        "n01770393": "scorpion",
+        "n01773157": "black and gold garden spider, Argiope aurantia",
+        "n01773549": "barn spider, Araneus cavaticus",
+        "n01773797": "garden spider, Aranea diademata",
+        "n01774384": "black widow, Latrodectus mactans",
+        "n01774750": "tarantula",
+        "n01775062": "wolf spider, hunting spider",
+        "n01776313": "tick",
+        "n01784675": "centipede",
+        "n01795545": "black grouse",
+        "n01796340": "ptarmigan",
+        "n01797886": "ruffed grouse, partridge, Bonasa umbellus",
+        "n01798484": "prairie chicken, prairie grouse, prairie fowl",
+        "n01806143": "peacock",
+        "n01806567": "quail",
+        "n01807496": "partridge",
+        "n01817953": "African grey, African gray, Psittacus erithacus",
+        "n01818515": "macaw",
+        "n01819313": "sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita",
+        "n01820546": "lorikeet",
+        "n01824575": "coucal",
+        "n01828970": "bee eater",
+        "n01829413": "hornbill",
+        "n01833805": "hummingbird",
+        "n01843065": "jacamar",
+        "n01843383": "toucan",
+        "n01847000": "drake",
+        "n01855032": "red-breasted merganser, Mergus serrator",
+        "n01855672": "goose",
+        "n01860187": "black swan, Cygnus atratus",
+        "n01871265": "tusker",
+        "n01872401": "echidna, spiny anteater, anteater",
+        "n01873310": "platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus",
+        "n01877812": "wallaby, brush kangaroo",
+        "n01882714": "koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus",
+        "n01883070": "wombat",
+        "n01910747": "jellyfish",
+        "n01914609": "sea anemone, anemone",
+        "n01917289": "brain coral",
+        "n01924916": "flatworm, platyhelminth",
+        "n01930112": "nematode, nematode worm, roundworm",
+        "n01943899": "conch",
+        "n01944390": "snail",
+        "n01945685": "slug",
+        "n01950731": "sea slug, nudibranch",
+        "n01955084": "chiton, coat-of-mail shell, sea cradle, polyplacophore",
+        "n01968897": "chambered nautilus, pearly nautilus, nautilus",
+        "n01978287": "Dungeness crab, Cancer magister",
+        "n01978455": "rock crab, Cancer irroratus",
+        "n01980166": "fiddler crab",
+        "n01981276": "king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica",
+        "n01983481": "American lobster, Northern lobster, Maine lobster, Homarus americanus",
+        "n01984695": "spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish",
+        "n01985128": "crayfish, crawfish, crawdad, crawdaddy",
+        "n01986214": "hermit crab",
+        "n01990800": "isopod",
+        "n02002556": "white stork, Ciconia ciconia",
+        "n02002724": "black stork, Ciconia nigra",
+        "n02006656": "spoonbill",
+        "n02007558": "flamingo",
+        "n02009229": "little blue heron, Egretta caerulea",
+        "n02009912": "American egret, great white heron, Egretta albus",
+        "n02011460": "bittern",
+        "n02012849": "crane",
+        "n02013706": "limpkin, Aramus pictus",
+        "n02017213": "European gallinule, Porphyrio porphyrio",
+        "n02018207": "American coot, marsh hen, mud hen, water hen, Fulica americana",
+        "n02018795": "bustard",
+        "n02025239": "ruddy turnstone, Arenaria interpres",
+        "n02027492": "red-backed sandpiper, dunlin, Erolia alpina",
+        "n02028035": "redshank, Tringa totanus",
+        "n02033041": "dowitcher",
+        "n02037110": "oystercatcher, oyster catcher",
+        "n02051845": "pelican",
+        "n02056570": "king penguin, Aptenodytes patagonica",
+        "n02058221": "albatross, mollymawk",
+        "n02066245": "grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus",
+        "n02071294": "killer whale, killer, orca, grampus, sea wolf, Orcinus orca",
+        "n02074367": "dugong, Dugong dugon",
+        "n02077923": "sea lion",
+        "n02085620": "Chihuahua",
+        "n02085782": "Japanese spaniel",
+        "n02085936": "Maltese dog, Maltese terrier, Maltese",
+        "n02086079": "Pekinese, Pekingese, Peke",
+        "n02086240": "Shih-Tzu",
+        "n02086646": "Blenheim spaniel",
+        "n02086910": "papillon",
+        "n02087046": "toy terrier",
+        "n02087394": "Rhodesian ridgeback",
+        "n02088094": "Afghan hound, Afghan",
+        "n02088238": "basset, basset hound",
+        "n02088364": "beagle",
+        "n02088466": "bloodhound, sleuthhound",
+        "n02088632": "bluetick",
+        "n02089078": "black-and-tan coonhound",
+        "n02089867": "Walker hound, Walker foxhound",
+        "n02089973": "English foxhound",
+        "n02090379": "redbone",
+        "n02090622": "borzoi, Russian wolfhound",
+        "n02090721": "Irish wolfhound",
+        "n02091032": "Italian greyhound",
+        "n02091134": "whippet",
+        "n02091244": "Ibizan hound, Ibizan Podenco",
+        "n02091467": "Norwegian elkhound, elkhound",
+        "n02091635": "otterhound, otter hound",
+        "n02091831": "Saluki, gazelle hound",
+        "n02092002": "Scottish deerhound, deerhound",
+        "n02092339": "Weimaraner",
+        "n02093256": "Staffordshire bullterrier, Staffordshire bull terrier",
+        "n02093428": "American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier",
+        "n02093647": "Bedlington terrier",
+        "n02093754": "Border terrier",
+        "n02093859": "Kerry blue terrier",
+        "n02093991": "Irish terrier",
+        "n02094114": "Norfolk terrier",
+        "n02094258": "Norwich terrier",
+        "n02094433": "Yorkshire terrier",
+        "n02095314": "wire-haired fox terrier",
+        "n02095570": "Lakeland terrier",
+        "n02095889": "Sealyham terrier, Sealyham",
+        "n02096051": "Airedale, Airedale terrier",
+        "n02096177": "cairn, cairn terrier",
+        "n02096294": "Australian terrier",
+        "n02096437": "Dandie Dinmont, Dandie Dinmont terrier",
+        "n02096585": "Boston bull, Boston terrier",
+        "n02097047": "miniature schnauzer",
+        "n02097130": "giant schnauzer",
+        "n02097209": "standard schnauzer",
+        "n02097298": "Scotch terrier, Scottish terrier, Scottie",
+        "n02097474": "Tibetan terrier, chrysanthemum dog",
+        "n02097658": "silky terrier, Sydney silky",
+        "n02098105": "soft-coated wheaten terrier",
+        "n02098286": "West Highland white terrier",
+        "n02098413": "Lhasa, Lhasa apso",
+        "n02099267": "flat-coated retriever",
+        "n02099429": "curly-coated retriever",
+        "n02099601": "golden retriever",
+        "n02099712": "Labrador retriever",
+        "n02099849": "Chesapeake Bay retriever",
+        "n02100236": "German short-haired pointer",
+        "n02100583": "vizsla, Hungarian pointer",
+        "n02100735": "English setter",
+        "n02100877": "Irish setter, red setter",
+        "n02101006": "Gordon setter",
+        "n02101388": "Brittany spaniel",
+        "n02101556": "clumber, clumber spaniel",
+        "n02102040": "English springer, English springer spaniel",
+        "n02102177": "Welsh springer spaniel",
+        "n02102318": "cocker spaniel, English cocker spaniel, cocker",
+        "n02102480": "Sussex spaniel",
+        "n02102973": "Irish water spaniel",
+        "n02104029": "kuvasz",
+        "n02104365": "schipperke",
+        "n02105056": "groenendael",
+        "n02105162": "malinois",
+        "n02105251": "briard",
+        "n02105412": "kelpie",
+        "n02105505": "komondor",
+        "n02105641": "Old English sheepdog, bobtail",
+        "n02105855": "Shetland sheepdog, Shetland sheep dog, Shetland",
+        "n02106030": "collie",
+        "n02106166": "Border collie",
+        "n02106382": "Bouvier des Flandres, Bouviers des Flandres",
+        "n02106550": "Rottweiler",
+        "n02106662": "German shepherd, German shepherd dog, German police dog, alsatian",
+        "n02107142": "Doberman, Doberman pinscher",
+        "n02107312": "miniature pinscher",
+        "n02107574": "Greater Swiss Mountain dog",
+        "n02107683": "Bernese mountain dog",
+        "n02107908": "Appenzeller",
+        "n02108000": "EntleBucher",
+        "n02108089": "boxer",
+        "n02108422": "bull mastiff",
+        "n02108551": "Tibetan mastiff",
+        "n02108915": "French bulldog",
+        "n02109047": "Great Dane",
+        "n02109525": "Saint Bernard, St Bernard",
+        "n02109961": "Eskimo dog, husky",
+        "n02110063": "malamute, malemute, Alaskan malamute",
+        "n02110185": "Siberian husky",
+        "n02110341": "dalmatian, coach dog, carriage dog",
+        "n02110627": "affenpinscher, monkey pinscher, monkey dog",
+        "n02110806": "basenji",
+        "n02110958": "pug, pug-dog",
+        "n02111129": "Leonberg",
+        "n02111277": "Newfoundland, Newfoundland dog",
+        "n02111500": "Great Pyrenees",
+        "n02111889": "Samoyed, Samoyede",
+        "n02112018": "Pomeranian",
+        "n02112137": "chow, chow chow",
+        "n02112350": "keeshond",
+        "n02112706": "Brabancon griffon",
+        "n02113023": "Pembroke, Pembroke Welsh corgi",
+        "n02113186": "Cardigan, Cardigan Welsh corgi",
+        "n02113624": "toy poodle",
+        "n02113712": "miniature poodle",
+        "n02113799": "standard poodle",
+        "n02113978": "Mexican hairless",
+        "n02114367": "timber wolf, grey wolf, gray wolf, Canis lupus",
+        "n02114548": "white wolf, Arctic wolf, Canis lupus tundrarum",
+        "n02114712": "red wolf, maned wolf, Canis rufus, Canis niger",
+        "n02114855": "coyote, prairie wolf, brush wolf, Canis latrans",
+        "n02115641": "dingo, warrigal, warragal, Canis dingo",
+        "n02115913": "dhole, Cuon alpinus",
+        "n02116738": "African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus",
+        "n02117135": "hyena, hyaena",
+        "n02119022": "red fox, Vulpes vulpes",
+        "n02119789": "kit fox, Vulpes macrotis",
+        "n02120079": "Arctic fox, white fox, Alopex lagopus",
+        "n02120505": "grey fox, gray fox, Urocyon cinereoargenteus",
+        "n02123045": "tabby, tabby cat",
+        "n02123159": "tiger cat",
+        "n02123394": "Persian cat",
+        "n02123597": "Siamese cat, Siamese",
+        "n02124075": "Egyptian cat",
+        "n02125311": "cougar, puma, catamount, mountain lion, painter, panther, Felis concolor",
+        "n02127052": "lynx, catamount",
+        "n02128385": "leopard, Panthera pardus",
+        "n02128757": "snow leopard, ounce, Panthera uncia",
+        "n02128925": "jaguar, panther, Panthera onca, Felis onca",
+        "n02129165": "lion, king of beasts, Panthera leo",
+        "n02129604": "tiger, Panthera tigris",
+        "n02130308": "cheetah, chetah, Acinonyx jubatus",
+        "n02132136": "brown bear, bruin, Ursus arctos",
+        "n02133161": "American black bear, black bear, Ursus americanus, Euarctos americanus",
+        "n02134084": "ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus",
+        "n02134418": "sloth bear, Melursus ursinus, Ursus ursinus",
+        "n02137549": "mongoose",
+        "n02138441": "meerkat, mierkat",
+        "n02165105": "tiger beetle",
+        "n02165456": "ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle",
+        "n02167151": "ground beetle, carabid beetle",
+        "n02168699": "long-horned beetle, longicorn, longicorn beetle",
+        "n02169497": "leaf beetle, chrysomelid",
+        "n02172182": "dung beetle",
+        "n02174001": "rhinoceros beetle",
+        "n02177972": "weevil",
+        "n02190166": "fly",
+        "n02206856": "bee",
+        "n02219486": "ant, emmet, pismire",
+        "n02226429": "grasshopper, hopper",
+        "n02229544": "cricket",
+        "n02231487": "walking stick, walkingstick, stick insect",
+        "n02233338": "cockroach, roach",
+        "n02236044": "mantis, mantid",
+        "n02256656": "cicada, cicala",
+        "n02259212": "leafhopper",
+        "n02264363": "lacewing, lacewing fly",
+        "n02268443": "dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk",
+        "n02268853": "damselfly",
+        "n02276258": "admiral",
+        "n02277742": "ringlet, ringlet butterfly",
+        "n02279972": "monarch, monarch butterfly, milkweed butterfly, Danaus plexippus",
+        "n02280649": "cabbage butterfly",
+        "n02281406": "sulphur butterfly, sulfur butterfly",
+        "n02281787": "lycaenid, lycaenid butterfly",
+        "n02317335": "starfish, sea star",
+        "n02319095": "sea urchin",
+        "n02321529": "sea cucumber, holothurian",
+        "n02325366": "wood rabbit, cottontail, cottontail rabbit",
+        "n02326432": "hare",
+        "n02328150": "Angora, Angora rabbit",
+        "n02342885": "hamster",
+        "n02346627": "porcupine, hedgehog",
+        "n02356798": "fox squirrel, eastern fox squirrel, Sciurus niger",
+        "n02361337": "marmot",
+        "n02363005": "beaver",
+        "n02364673": "guinea pig, Cavia cobaya",
+        "n02389026": "sorrel",
+        "n02391049": "zebra",
+        "n02395406": "hog, pig, grunter, squealer, Sus scrofa",
+        "n02396427": "wild boar, boar, Sus scrofa",
+        "n02397096": "warthog",
+        "n02398521": "hippopotamus, hippo, river horse, Hippopotamus amphibius",
+        "n02403003": "ox",
+        "n02408429": "water buffalo, water ox, Asiatic buffalo, Bubalus bubalis",
+        "n02410509": "bison",
+        "n02412080": "ram, tup",
+        "n02415577": "bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis",
+        "n02417914": "ibex, Capra ibex",
+        "n02422106": "hartebeest",
+        "n02422699": "impala, Aepyceros melampus",
+        "n02423022": "gazelle",
+        "n02437312": "Arabian camel, dromedary, Camelus dromedarius",
+        "n02437616": "llama",
+        "n02441942": "weasel",
+        "n02442845": "mink",
+        "n02443114": "polecat, fitch, foulmart, foumart, Mustela putorius",
+        "n02443484": "black-footed ferret, ferret, Mustela nigripes",
+        "n02444819": "otter",
+        "n02445715": "skunk, polecat, wood pussy",
+        "n02447366": "badger",
+        "n02454379": "armadillo",
+        "n02457408": "three-toed sloth, ai, Bradypus tridactylus",
+        "n02480495": "orangutan, orang, orangutang, Pongo pygmaeus",
+        "n02480855": "gorilla, Gorilla gorilla",
+        "n02481823": "chimpanzee, chimp, Pan troglodytes",
+        "n02483362": "gibbon, Hylobates lar",
+        "n02483708": "siamang, Hylobates syndactylus, Symphalangus syndactylus",
+        "n02484975": "guenon, guenon monkey",
+        "n02486261": "patas, hussar monkey, Erythrocebus patas",
+        "n02486410": "baboon",
+        "n02487347": "macaque",
+        "n02488291": "langur",
+        "n02488702": "colobus, colobus monkey",
+        "n02489166": "proboscis monkey, Nasalis larvatus",
+        "n02490219": "marmoset",
+        "n02492035": "capuchin, ringtail, Cebus capucinus",
+        "n02492660": "howler monkey, howler",
+        "n02493509": "titi, titi monkey",
+        "n02493793": "spider monkey, Ateles geoffroyi",
+        "n02494079": "squirrel monkey, Saimiri sciureus",
+        "n02497673": "Madagascar cat, ring-tailed lemur, Lemur catta",
+        "n02500267": "indri, indris, Indri indri, Indri brevicaudatus",
+        "n02504013": "Indian elephant, Elephas maximus",
+        "n02504458": "African elephant, Loxodonta africana",
+        "n02509815": "lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens",
+        "n02510455": "giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca",
+        "n02514041": "barracouta, snoek",
+        "n02526121": "eel",
+        "n02536864": "coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch",
+        "n02606052": "rock beauty, Holocanthus tricolor",
+        "n02607072": "anemone fish",
+        "n02640242": "sturgeon",
+        "n02641379": "gar, garfish, garpike, billfish, Lepisosteus osseus",
+        "n02643566": "lionfish",
+        "n02655020": "puffer, pufferfish, blowfish, globefish",
+        "n02666196": "abacus",
+        "n02667093": "abaya",
+        "n02669723": "academic gown, academic robe, judge's robe",
+        "n02672831": "accordion, piano accordion, squeeze box",
+        "n02676566": "acoustic guitar",
+        "n02687172": "aircraft carrier, carrier, flattop, attack aircraft carrier",
+        "n02690373": "airliner",
+        "n02692877": "airship, dirigible",
+        "n02699494": "altar",
+        "n02701002": "ambulance",
+        "n02704792": "amphibian, amphibious vehicle",
+        "n02708093": "analog clock",
+        "n02727426": "apiary, bee house",
+        "n02730930": "apron",
+        "n02747177": "ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin",
+        "n02749479": "assault rifle, assault gun",
+        "n02769748": "backpack, back pack, knapsack, packsack, rucksack, haversack",
+        "n02776631": "bakery, bakeshop, bakehouse",
+        "n02777292": "balance beam, beam",
+        "n02782093": "balloon",
+        "n02783161": "ballpoint, ballpoint pen, ballpen, Biro",
+        "n02786058": "Band Aid",
+        "n02787622": "banjo",
+        "n02788148": "bannister, banister, balustrade, balusters, handrail",
+        "n02790996": "barbell",
+        "n02791124": "barber chair",
+        "n02791270": "barbershop",
+        "n02793495": "barn",
+        "n02794156": "barometer",
+        "n02795169": "barrel, cask",
+        "n02797295": "barrow, garden cart, lawn cart, wheelbarrow",
+        "n02799071": "baseball",
+        "n02802426": "basketball",
+        "n02804414": "bassinet",
+        "n02804610": "bassoon",
+        "n02807133": "bathing cap, swimming cap",
+        "n02808304": "bath towel",
+        "n02808440": "bathtub, bathing tub, bath, tub",
+        "n02814533": "beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon",
+        "n02814860": "beacon, lighthouse, beacon light, pharos",
+        "n02815834": "beaker",
+        "n02817516": "bearskin, busby, shako",
+        "n02823428": "beer bottle",
+        "n02823750": "beer glass",
+        "n02825657": "bell cote, bell cot",
+        "n02834397": "bib",
+        "n02835271": "bicycle-built-for-two, tandem bicycle, tandem",
+        "n02837789": "bikini, two-piece",
+        "n02840245": "binder, ring-binder",
+        "n02841315": "binoculars, field glasses, opera glasses",
+        "n02843684": "birdhouse",
+        "n02859443": "boathouse",
+        "n02860847": "bobsled, bobsleigh, bob",
+        "n02865351": "bolo tie, bolo, bola tie, bola",
+        "n02869837": "bonnet, poke bonnet",
+        "n02870880": "bookcase",
+        "n02871525": "bookshop, bookstore, bookstall",
+        "n02877765": "bottlecap",
+        "n02879718": "bow",
+        "n02883205": "bow tie, bow-tie, bowtie",
+        "n02892201": "brass, memorial tablet, plaque",
+        "n02892767": "brassiere, bra, bandeau",
+        "n02894605": "breakwater, groin, groyne, mole, bulwark, seawall, jetty",
+        "n02895154": "breastplate, aegis, egis",
+        "n02906734": "broom",
+        "n02909870": "bucket, pail",
+        "n02910353": "buckle",
+        "n02916936": "bulletproof vest",
+        "n02917067": "bullet train, bullet",
+        "n02927161": "butcher shop, meat market",
+        "n02930766": "cab, hack, taxi, taxicab",
+        "n02939185": "caldron, cauldron",
+        "n02948072": "candle, taper, wax light",
+        "n02950826": "cannon",
+        "n02951358": "canoe",
+        "n02951585": "can opener, tin opener",
+        "n02963159": "cardigan",
+        "n02965783": "car mirror",
+        "n02966193": "carousel, carrousel, merry-go-round, roundabout, whirligig",
+        "n02966687": "carpenter's kit, tool kit",
+        "n02971356": "carton",
+        "n02974003": "car wheel",
+        "n02977058": "cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM",
+        "n02978881": "cassette",
+        "n02979186": "cassette player",
+        "n02980441": "castle",
+        "n02981792": "catamaran",
+        "n02988304": "CD player",
+        "n02992211": "cello, violoncello",
+        "n02992529": "cellular telephone, cellular phone, cellphone, cell, mobile phone",
+        "n02999410": "chain",
+        "n03000134": "chainlink fence",
+        "n03000247": "chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour",
+        "n03000684": "chain saw, chainsaw",
+        "n03014705": "chest",
+        "n03016953": "chiffonier, commode",
+        "n03017168": "chime, bell, gong",
+        "n03018349": "china cabinet, china closet",
+        "n03026506": "Christmas stocking",
+        "n03028079": "church, church building",
+        "n03032252": "cinema, movie theater, movie theatre, movie house, picture palace",
+        "n03041632": "cleaver, meat cleaver, chopper",
+        "n03042490": "cliff dwelling",
+        "n03045698": "cloak",
+        "n03047690": "clog, geta, patten, sabot",
+        "n03062245": "cocktail shaker",
+        "n03063599": "coffee mug",
+        "n03063689": "coffeepot",
+        "n03065424": "coil, spiral, volute, whorl, helix",
+        "n03075370": "combination lock",
+        "n03085013": "computer keyboard, keypad",
+        "n03089624": "confectionery, confectionary, candy store",
+        "n03095699": "container ship, containership, container vessel",
+        "n03100240": "convertible",
+        "n03109150": "corkscrew, bottle screw",
+        "n03110669": "cornet, horn, trumpet, trump",
+        "n03124043": "cowboy boot",
+        "n03124170": "cowboy hat, ten-gallon hat",
+        "n03125729": "cradle",
+        "n03126707": "crane2",
+        "n03127747": "crash helmet",
+        "n03127925": "crate",
+        "n03131574": "crib, cot",
+        "n03133878": "Crock Pot",
+        "n03134739": "croquet ball",
+        "n03141823": "crutch",
+        "n03146219": "cuirass",
+        "n03160309": "dam, dike, dyke",
+        "n03179701": "desk",
+        "n03180011": "desktop computer",
+        "n03187595": "dial telephone, dial phone",
+        "n03188531": "diaper, nappy, napkin",
+        "n03196217": "digital clock",
+        "n03197337": "digital watch",
+        "n03201208": "dining table, board",
+        "n03207743": "dishrag, dishcloth",
+        "n03207941": "dishwasher, dish washer, dishwashing machine",
+        "n03208938": "disk brake, disc brake",
+        "n03216828": "dock, dockage, docking facility",
+        "n03218198": "dogsled, dog sled, dog sleigh",
+        "n03220513": "dome",
+        "n03223299": "doormat, welcome mat",
+        "n03240683": "drilling platform, offshore rig",
+        "n03249569": "drum, membranophone, tympan",
+        "n03250847": "drumstick",
+        "n03255030": "dumbbell",
+        "n03259280": "Dutch oven",
+        "n03271574": "electric fan, blower",
+        "n03272010": "electric guitar",
+        "n03272562": "electric locomotive",
+        "n03290653": "entertainment center",
+        "n03291819": "envelope",
+        "n03297495": "espresso maker",
+        "n03314780": "face powder",
+        "n03325584": "feather boa, boa",
+        "n03337140": "file, file cabinet, filing cabinet",
+        "n03344393": "fireboat",
+        "n03345487": "fire engine, fire truck",
+        "n03347037": "fire screen, fireguard",
+        "n03355925": "flagpole, flagstaff",
+        "n03372029": "flute, transverse flute",
+        "n03376595": "folding chair",
+        "n03379051": "football helmet",
+        "n03384352": "forklift",
+        "n03388043": "fountain",
+        "n03388183": "fountain pen",
+        "n03388549": "four-poster",
+        "n03393912": "freight car",
+        "n03394916": "French horn, horn",
+        "n03400231": "frying pan, frypan, skillet",
+        "n03404251": "fur coat",
+        "n03417042": "garbage truck, dustcart",
+        "n03424325": "gasmask, respirator, gas helmet",
+        "n03425413": "gas pump, gasoline pump, petrol pump, island dispenser",
+        "n03443371": "goblet",
+        "n03444034": "go-kart",
+        "n03445777": "golf ball",
+        "n03445924": "golfcart, golf cart",
+        "n03447447": "gondola",
+        "n03447721": "gong, tam-tam",
+        "n03450230": "gown",
+        "n03452741": "grand piano, grand",
+        "n03457902": "greenhouse, nursery, glasshouse",
+        "n03459775": "grille, radiator grille",
+        "n03461385": "grocery store, grocery, food market, market",
+        "n03467068": "guillotine",
+        "n03476684": "hair slide",
+        "n03476991": "hair spray",
+        "n03478589": "half track",
+        "n03481172": "hammer",
+        "n03482405": "hamper",
+        "n03483316": "hand blower, blow dryer, blow drier, hair dryer, hair drier",
+        "n03485407": "hand-held computer, hand-held microcomputer",
+        "n03485794": "handkerchief, hankie, hanky, hankey",
+        "n03492542": "hard disc, hard disk, fixed disk",
+        "n03494278": "harmonica, mouth organ, harp, mouth harp",
+        "n03495258": "harp",
+        "n03496892": "harvester, reaper",
+        "n03498962": "hatchet",
+        "n03527444": "holster",
+        "n03529860": "home theater, home theatre",
+        "n03530642": "honeycomb",
+        "n03532672": "hook, claw",
+        "n03534580": "hoopskirt, crinoline",
+        "n03535780": "horizontal bar, high bar",
+        "n03538406": "horse cart, horse-cart",
+        "n03544143": "hourglass",
+        "n03584254": "iPod",
+        "n03584829": "iron, smoothing iron",
+        "n03590841": "jack-o'-lantern",
+        "n03594734": "jean, blue jean, denim",
+        "n03594945": "jeep, landrover",
+        "n03595614": "jersey, T-shirt, tee shirt",
+        "n03598930": "jigsaw puzzle",
+        "n03599486": "jinrikisha, ricksha, rickshaw",
+        "n03602883": "joystick",
+        "n03617480": "kimono",
+        "n03623198": "knee pad",
+        "n03627232": "knot",
+        "n03630383": "lab coat, laboratory coat",
+        "n03633091": "ladle",
+        "n03637318": "lampshade, lamp shade",
+        "n03642806": "laptop, laptop computer",
+        "n03649909": "lawn mower, mower",
+        "n03657121": "lens cap, lens cover",
+        "n03658185": "letter opener, paper knife, paperknife",
+        "n03661043": "library",
+        "n03662601": "lifeboat",
+        "n03666591": "lighter, light, igniter, ignitor",
+        "n03670208": "limousine, limo",
+        "n03673027": "liner, ocean liner",
+        "n03676483": "lipstick, lip rouge",
+        "n03680355": "Loafer",
+        "n03690938": "lotion",
+        "n03691459": "loudspeaker, speaker, speaker unit, loudspeaker system, speaker system",
+        "n03692522": "loupe, jeweler's loupe",
+        "n03697007": "lumbermill, sawmill",
+        "n03706229": "magnetic compass",
+        "n03709823": "mailbag, postbag",
+        "n03710193": "mailbox, letter box",
+        "n03710637": "maillot",
+        "n03710721": "maillot, tank suit",
+        "n03717622": "manhole cover",
+        "n03720891": "maraca",
+        "n03721384": "marimba, xylophone",
+        "n03724870": "mask",
+        "n03729826": "matchstick",
+        "n03733131": "maypole",
+        "n03733281": "maze, labyrinth",
+        "n03733805": "measuring cup",
+        "n03742115": "medicine chest, medicine cabinet",
+        "n03743016": "megalith, megalithic structure",
+        "n03759954": "microphone, mike",
+        "n03761084": "microwave, microwave oven",
+        "n03763968": "military uniform",
+        "n03764736": "milk can",
+        "n03769881": "minibus",
+        "n03770439": "miniskirt, mini",
+        "n03770679": "minivan",
+        "n03773504": "missile",
+        "n03775071": "mitten",
+        "n03775546": "mixing bowl",
+        "n03776460": "mobile home, manufactured home",
+        "n03777568": "Model T",
+        "n03777754": "modem",
+        "n03781244": "monastery",
+        "n03782006": "monitor",
+        "n03785016": "moped",
+        "n03786901": "mortar",
+        "n03787032": "mortarboard",
+        "n03788195": "mosque",
+        "n03788365": "mosquito net",
+        "n03791053": "motor scooter, scooter",
+        "n03792782": "mountain bike, all-terrain bike, off-roader",
+        "n03792972": "mountain tent",
+        "n03793489": "mouse, computer mouse",
+        "n03794056": "mousetrap",
+        "n03796401": "moving van",
+        "n03803284": "muzzle",
+        "n03804744": "nail",
+        "n03814639": "neck brace",
+        "n03814906": "necklace",
+        "n03825788": "nipple",
+        "n03832673": "notebook, notebook computer",
+        "n03837869": "obelisk",
+        "n03838899": "oboe, hautboy, hautbois",
+        "n03840681": "ocarina, sweet potato",
+        "n03841143": "odometer, hodometer, mileometer, milometer",
+        "n03843555": "oil filter",
+        "n03854065": "organ, pipe organ",
+        "n03857828": "oscilloscope, scope, cathode-ray oscilloscope, CRO",
+        "n03866082": "overskirt",
+        "n03868242": "oxcart",
+        "n03868863": "oxygen mask",
+        "n03871628": "packet",
+        "n03873416": "paddle, boat paddle",
+        "n03874293": "paddlewheel, paddle wheel",
+        "n03874599": "padlock",
+        "n03876231": "paintbrush",
+        "n03877472": "pajama, pyjama, pj's, jammies",
+        "n03877845": "palace",
+        "n03884397": "panpipe, pandean pipe, syrinx",
+        "n03887697": "paper towel",
+        "n03888257": "parachute, chute",
+        "n03888605": "parallel bars, bars",
+        "n03891251": "park bench",
+        "n03891332": "parking meter",
+        "n03895866": "passenger car, coach, carriage",
+        "n03899768": "patio, terrace",
+        "n03902125": "pay-phone, pay-station",
+        "n03903868": "pedestal, plinth, footstall",
+        "n03908618": "pencil box, pencil case",
+        "n03908714": "pencil sharpener",
+        "n03916031": "perfume, essence",
+        "n03920288": "Petri dish",
+        "n03924679": "photocopier",
+        "n03929660": "pick, plectrum, plectron",
+        "n03929855": "pickelhaube",
+        "n03930313": "picket fence, paling",
+        "n03930630": "pickup, pickup truck",
+        "n03933933": "pier",
+        "n03935335": "piggy bank, penny bank",
+        "n03937543": "pill bottle",
+        "n03938244": "pillow",
+        "n03942813": "ping-pong ball",
+        "n03944341": "pinwheel",
+        "n03947888": "pirate, pirate ship",
+        "n03950228": "pitcher, ewer",
+        "n03954731": "plane, carpenter's plane, woodworking plane",
+        "n03956157": "planetarium",
+        "n03958227": "plastic bag",
+        "n03961711": "plate rack",
+        "n03967562": "plow, plough",
+        "n03970156": "plunger, plumber's helper",
+        "n03976467": "Polaroid camera, Polaroid Land camera",
+        "n03976657": "pole",
+        "n03977966": "police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria",
+        "n03980874": "poncho",
+        "n03982430": "pool table, billiard table, snooker table",
+        "n03983396": "pop bottle, soda bottle",
+        "n03991062": "pot, flowerpot",
+        "n03992509": "potter's wheel",
+        "n03995372": "power drill",
+        "n03998194": "prayer rug, prayer mat",
+        "n04004767": "printer",
+        "n04005630": "prison, prison house",
+        "n04008634": "projectile, missile",
+        "n04009552": "projector",
+        "n04019541": "puck, hockey puck",
+        "n04023962": "punching bag, punch bag, punching ball, punchball",
+        "n04026417": "purse",
+        "n04033901": "quill, quill pen",
+        "n04033995": "quilt, comforter, comfort, puff",
+        "n04037443": "racer, race car, racing car",
+        "n04039381": "racket, racquet",
+        "n04040759": "radiator",
+        "n04041544": "radio, wireless",
+        "n04044716": "radio telescope, radio reflector",
+        "n04049303": "rain barrel",
+        "n04065272": "recreational vehicle, RV, R.V.",
+        "n04067472": "reel",
+        "n04069434": "reflex camera",
+        "n04070727": "refrigerator, icebox",
+        "n04074963": "remote control, remote",
+        "n04081281": "restaurant, eating house, eating place, eatery",
+        "n04086273": "revolver, six-gun, six-shooter",
+        "n04090263": "rifle",
+        "n04099969": "rocking chair, rocker",
+        "n04111531": "rotisserie",
+        "n04116512": "rubber eraser, rubber, pencil eraser",
+        "n04118538": "rugby ball",
+        "n04118776": "rule, ruler",
+        "n04120489": "running shoe",
+        "n04125021": "safe",
+        "n04127249": "safety pin",
+        "n04131690": "saltshaker, salt shaker",
+        "n04133789": "sandal",
+        "n04136333": "sarong",
+        "n04141076": "sax, saxophone",
+        "n04141327": "scabbard",
+        "n04141975": "scale, weighing machine",
+        "n04146614": "school bus",
+        "n04147183": "schooner",
+        "n04149813": "scoreboard",
+        "n04152593": "screen, CRT screen",
+        "n04153751": "screw",
+        "n04154565": "screwdriver",
+        "n04162706": "seat belt, seatbelt",
+        "n04179913": "sewing machine",
+        "n04192698": "shield, buckler",
+        "n04200800": "shoe shop, shoe-shop, shoe store",
+        "n04201297": "shoji",
+        "n04204238": "shopping basket",
+        "n04204347": "shopping cart",
+        "n04208210": "shovel",
+        "n04209133": "shower cap",
+        "n04209239": "shower curtain",
+        "n04228054": "ski",
+        "n04229816": "ski mask",
+        "n04235860": "sleeping bag",
+        "n04238763": "slide rule, slipstick",
+        "n04239074": "sliding door",
+        "n04243546": "slot, one-armed bandit",
+        "n04251144": "snorkel",
+        "n04252077": "snowmobile",
+        "n04252225": "snowplow, snowplough",
+        "n04254120": "soap dispenser",
+        "n04254680": "soccer ball",
+        "n04254777": "sock",
+        "n04258138": "solar dish, solar collector, solar furnace",
+        "n04259630": "sombrero",
+        "n04263257": "soup bowl",
+        "n04264628": "space bar",
+        "n04265275": "space heater",
+        "n04266014": "space shuttle",
+        "n04270147": "spatula",
+        "n04273569": "speedboat",
+        "n04275548": "spider web, spider's web",
+        "n04277352": "spindle",
+        "n04285008": "sports car, sport car",
+        "n04286575": "spotlight, spot",
+        "n04296562": "stage",
+        "n04310018": "steam locomotive",
+        "n04311004": "steel arch bridge",
+        "n04311174": "steel drum",
+        "n04317175": "stethoscope",
+        "n04325704": "stole",
+        "n04326547": "stone wall",
+        "n04328186": "stopwatch, stop watch",
+        "n04330267": "stove",
+        "n04332243": "strainer",
+        "n04335435": "streetcar, tram, tramcar, trolley, trolley car",
+        "n04336792": "stretcher",
+        "n04344873": "studio couch, day bed",
+        "n04346328": "stupa, tope",
+        "n04347754": "submarine, pigboat, sub, U-boat",
+        "n04350905": "suit, suit of clothes",
+        "n04355338": "sundial",
+        "n04355933": "sunglass",
+        "n04356056": "sunglasses, dark glasses, shades",
+        "n04357314": "sunscreen, sunblock, sun blocker",
+        "n04366367": "suspension bridge",
+        "n04367480": "swab, swob, mop",
+        "n04370456": "sweatshirt",
+        "n04371430": "swimming trunks, bathing trunks",
+        "n04371774": "swing",
+        "n04372370": "switch, electric switch, electrical switch",
+        "n04376876": "syringe",
+        "n04380533": "table lamp",
+        "n04389033": "tank, army tank, armored combat vehicle, armoured combat vehicle",
+        "n04392985": "tape player",
+        "n04398044": "teapot",
+        "n04399382": "teddy, teddy bear",
+        "n04404412": "television, television system",
+        "n04409515": "tennis ball",
+        "n04417672": "thatch, thatched roof",
+        "n04418357": "theater curtain, theatre curtain",
+        "n04423845": "thimble",
+        "n04428191": "thresher, thrasher, threshing machine",
+        "n04429376": "throne",
+        "n04435653": "tile roof",
+        "n04442312": "toaster",
+        "n04443257": "tobacco shop, tobacconist shop, tobacconist",
+        "n04447861": "toilet seat",
+        "n04456115": "torch",
+        "n04458633": "totem pole",
+        "n04461696": "tow truck, tow car, wrecker",
+        "n04462240": "toyshop",
+        "n04465501": "tractor",
+        "n04467665": "trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi",
+        "n04476259": "tray",
+        "n04479046": "trench coat",
+        "n04482393": "tricycle, trike, velocipede",
+        "n04483307": "trimaran",
+        "n04485082": "tripod",
+        "n04486054": "triumphal arch",
+        "n04487081": "trolleybus, trolley coach, trackless trolley",
+        "n04487394": "trombone",
+        "n04493381": "tub, vat",
+        "n04501370": "turnstile",
+        "n04505470": "typewriter keyboard",
+        "n04507155": "umbrella",
+        "n04509417": "unicycle, monocycle",
+        "n04515003": "upright, upright piano",
+        "n04517823": "vacuum, vacuum cleaner",
+        "n04522168": "vase",
+        "n04523525": "vault",
+        "n04525038": "velvet",
+        "n04525305": "vending machine",
+        "n04532106": "vestment",
+        "n04532670": "viaduct",
+        "n04536866": "violin, fiddle",
+        "n04540053": "volleyball",
+        "n04542943": "waffle iron",
+        "n04548280": "wall clock",
+        "n04548362": "wallet, billfold, notecase, pocketbook",
+        "n04550184": "wardrobe, closet, press",
+        "n04552348": "warplane, military plane",
+        "n04553703": "washbasin, handbasin, washbowl, lavabo, wash-hand basin",
+        "n04554684": "washer, automatic washer, washing machine",
+        "n04557648": "water bottle",
+        "n04560804": "water jug",
+        "n04562935": "water tower",
+        "n04579145": "whiskey jug",
+        "n04579432": "whistle",
+        "n04584207": "wig",
+        "n04589890": "window screen",
+        "n04590129": "window shade",
+        "n04591157": "Windsor tie",
+        "n04591713": "wine bottle",
+        "n04592741": "wing",
+        "n04596742": "wok",
+        "n04597913": "wooden spoon",
+        "n04599235": "wool, woolen, woollen",
+        "n04604644": "worm fence, snake fence, snake-rail fence, Virginia fence",
+        "n04606251": "wreck",
+        "n04612504": "yawl",
+        "n04613696": "yurt",
+        "n06359193": "web site, website, internet site, site",
+        "n06596364": "comic book",
+        "n06785654": "crossword puzzle, crossword",
+        "n06794110": "street sign",
+        "n06874185": "traffic light, traffic signal, stoplight",
+        "n07248320": "book jacket, dust cover, dust jacket, dust wrapper",
+        "n07565083": "menu",
+        "n07579787": "plate",
+        "n07583066": "guacamole",
+        "n07584110": "consomme",
+        "n07590611": "hot pot, hotpot",
+        "n07613480": "trifle",
+        "n07614500": "ice cream, icecream",
+        "n07615774": "ice lolly, lolly, lollipop, popsicle",
+        "n07684084": "French loaf",
+        "n07693725": "bagel, beigel",
+        "n07695742": "pretzel",
+        "n07697313": "cheeseburger",
+        "n07697537": "hotdog, hot dog, red hot",
+        "n07711569": "mashed potato",
+        "n07714571": "head cabbage",
+        "n07714990": "broccoli",
+        "n07715103": "cauliflower",
+        "n07716358": "zucchini, courgette",
+        "n07716906": "spaghetti squash",
+        "n07717410": "acorn squash",
+        "n07717556": "butternut squash",
+        "n07718472": "cucumber, cuke",
+        "n07718747": "artichoke, globe artichoke",
+        "n07720875": "bell pepper",
+        "n07730033": "cardoon",
+        "n07734744": "mushroom",
+        "n07742313": "Granny Smith",
+        "n07745940": "strawberry",
+        "n07747607": "orange",
+        "n07749582": "lemon",
+        "n07753113": "fig",
+        "n07753275": "pineapple, ananas",
+        "n07753592": "banana",
+        "n07754684": "jackfruit, jak, jack",
+        "n07760859": "custard apple",
+        "n07768694": "pomegranate",
+        "n07802026": "hay",
+        "n07831146": "carbonara",
+        "n07836838": "chocolate sauce, chocolate syrup",
+        "n07860988": "dough",
+        "n07871810": "meat loaf, meatloaf",
+        "n07873807": "pizza, pizza pie",
+        "n07875152": "potpie",
+        "n07880968": "burrito",
+        "n07892512": "red wine",
+        "n07920052": "espresso",
+        "n07930864": "cup",
+        "n07932039": "eggnog",
+        "n09193705": "alp",
+        "n09229709": "bubble",
+        "n09246464": "cliff, drop, drop-off",
+        "n09256479": "coral reef",
+        "n09288635": "geyser",
+        "n09332890": "lakeside, lakeshore",
+        "n09399592": "promontory, headland, head, foreland",
+        "n09421951": "sandbar, sand bar",
+        "n09428293": "seashore, coast, seacoast, sea-coast",
+        "n09468604": "valley, vale",
+        "n09472597": "volcano",
+        "n09835506": "ballplayer, baseball player",
+        "n10148035": "groom, bridegroom",
+        "n10565667": "scuba diver",
+        "n11879895": "rapeseed",
+        "n11939491": "daisy",
+        "n12057211": "yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum",
+        "n12144580": "corn",
+        "n12267677": "acorn",
+        "n12620546": "hip, rose hip, rosehip",
+        "n12768682": "buckeye, horse chestnut, conker",
+        "n12985857": "coral fungus",
+        "n12998815": "agaric",
+        "n13037406": "gyromitra",
+        "n13040303": "stinkhorn, carrion fungus",
+        "n13044778": "earthstar",
+        "n13052670": "hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",
+        "n13054560": "bolete",
+        "n13133613": "ear, spike, capitulum",
+        "n15075141": "toilet tissue, toilet paper, bathroom tissue",
+    }
+)

tasks/image_classification/plotting.py ADDED Viewed

	@@ -0,0 +1,494 @@

+import numpy as np
+import cv2
+import torch
+import os
+import imageio
+import matplotlib.pyplot as plt
+import matplotlib as mpl
+from matplotlib import patheffects
+mpl.use('Agg')
+import seaborn as sns
+import numpy as np
+from tqdm.auto import tqdm
+sns.set_style('darkgrid')
+from tqdm.auto import tqdm
+from scipy import ndimage
+import umap
+from scipy.special import softmax
+import subprocess as sp
+import cv2 # Still potentially useful for color conversion checks if needed
+import os
+def save_frames_to_mp4(frames, output_filename, fps=15.0, gop_size=None, crf=23, preset='medium', pix_fmt='yuv420p'):
+    """
+    Saves a list of NumPy array frames to an MP4 video file using FFmpeg via subprocess.
+    Includes fix for odd frame dimensions by padding to the nearest even number using -vf pad.
+    Requires FFmpeg to be installed and available in the system PATH.
+    Args:
+        frames (list): A list of NumPy arrays representing the video frames.
+                       Expected format: uint8, (height, width, 3) for BGR color
+                       or (height, width) for grayscale. Should be consistent.
+        output_filename (str): The path and name for the output MP4 file.
+        fps (float, optional): Frames per second for the output video. Defaults to 15.0.
+        gop_size (int, optional): Group of Pictures (GOP) size. This determines the
+                                  maximum interval between keyframes. Lower values
+                                  mean more frequent keyframes (better seeking, larger file).
+                                  Defaults to int(fps) (approx 1 keyframe per second).
+        crf (int, optional): Constant Rate Factor for H.264 encoding. Lower values mean
+                             better quality and larger files. Typical range: 18-28.
+                             Defaults to 23.
+        preset (str, optional): FFmpeg encoding speed preset. Affects encoding time
+                                and compression efficiency. Options include 'ultrafast',
+                                'superfast', 'veryfast', 'faster', 'fast', 'medium',
+                                'slow', 'slower', 'veryslow'. Defaults to 'medium'.
+    """
+    if not frames:
+        print("Error: The 'frames' list is empty. No video to save.")
+        return
+    # --- Determine Parameters from First Frame ---
+    try:
+        first_frame = frames[0]
+        print(first_frame.shape)
+        if not isinstance(first_frame, np.ndarray):
+             print(f"Error: Frame 0 is not a NumPy array (type: {type(first_frame)}).")
+             return
+        frame_height, frame_width = first_frame.shape[:2]
+        frame_size_str = f"{frame_width}x{frame_height}"
+        # Determine input pixel format based on first frame's shape
+        if len(first_frame.shape) == 3 and first_frame.shape[2] == 3:
+            input_pixel_format = 'bgr24' # Assume OpenCV's default BGR uint8
+            expected_dims = 3
+            print(f"Info: Detected color frames (shape: {first_frame.shape}). Expecting BGR input.")
+        elif len(first_frame.shape) == 2:
+            input_pixel_format = 'gray'
+            expected_dims = 2
+            print(f"Info: Detected grayscale frames (shape: {first_frame.shape}).")
+        else:
+            print(f"Error: Unsupported frame shape {first_frame.shape}. Must be (h, w) or (h, w, 3).")
+            return
+        if first_frame.dtype != np.uint8:
+             print(f"Warning: First frame dtype is {first_frame.dtype}. Will attempt conversion to uint8.")
+    except IndexError:
+        print("Error: Could not access the first frame to determine dimensions.")
+        return
+    except Exception as e:
+         print(f"Error processing first frame: {e}")
+         return
+    # --- Set GOP size default if not provided ---
+    if gop_size is None:
+        gop_size = int(fps)
+        print(f"Info: GOP size not specified, defaulting to {gop_size} (approx 1 keyframe/sec).")
+    # --- Construct FFmpeg Command ---
+    # ADDED -vf pad filter to ensure even dimensions for libx264/yuv420p
+    # It calculates the nearest even dimensions >= original dimensions
+    # Example: 1600x1351 -> 1600x1352
+    pad_filter = "pad=ceil(iw/2)*2:ceil(ih/2)*2"
+    command = [
+        'ffmpeg',
+        '-y',
+        '-f', 'rawvideo',
+        '-vcodec', 'rawvideo',
+        '-pix_fmt', input_pixel_format,
+        '-s', frame_size_str,
+        '-r', str(float(fps)),
+        '-i', '-',
+        '-vf', pad_filter, # <--- ADDED VIDEO FILTER HERE
+        '-c:v', 'libx264',
+        '-pix_fmt', pix_fmt,
+        '-preset', preset,
+        '-crf', str(crf),
+        '-g', str(gop_size),
+        '-movflags', '+faststart',
+        output_filename
+    ]
+    print(f"\n--- Starting FFmpeg ---")
+    print(f"Output File: {output_filename}")
+    print(f"Parameters: FPS={fps}, Size={frame_size_str}, GOP={gop_size}, CRF={crf}, Preset={preset}")
+    print(f"Applying Filter: -vf {pad_filter} (Ensures even dimensions)")
+    # print(f"FFmpeg Command: {' '.join(command)}") # Uncomment for debugging
+    # --- Execute FFmpeg via Subprocess ---
+    try:
+        process = sp.Popen(command, stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE)
+        print(f"\nWriting {len(frames)} frames to FFmpeg...")
+        progress_interval = max(1, len(frames) // 10) # Print progress roughly 10 times
+        for i, frame in enumerate(frames):
+            # Basic validation and conversion for each frame
+            if not isinstance(frame, np.ndarray):
+                 print(f"Warning: Frame {i} is not a numpy array (type: {type(frame)}). Skipping.")
+                 continue
+            if frame.shape[0] != frame_height or frame.shape[1] != frame_width:
+                print(f"Warning: Frame {i} has different dimensions {frame.shape[:2]}! Expected ({frame_height},{frame_width}). Skipping.")
+                continue
+            current_dims = len(frame.shape)
+            if current_dims != expected_dims:
+                 print(f"Warning: Frame {i} has inconsistent dimensions ({current_dims}D vs expected {expected_dims}D). Skipping.")
+                 continue
+            if expected_dims == 3 and frame.shape[2] != 3:
+                 print(f"Warning: Frame {i} is color but doesn't have 3 channels ({frame.shape}). Skipping.")
+                 continue
+            if frame.dtype != np.uint8:
+                try:
+                    frame = np.clip(frame, 0, 255).astype(np.uint8)
+                except Exception as clip_err:
+                     print(f"Error clipping/converting frame {i} dtype: {clip_err}. Skipping.")
+                     continue
+            # Write frame bytes to FFmpeg's stdin
+            try:
+                 process.stdin.write(frame.tobytes())
+            except (OSError, BrokenPipeError) as pipe_err:
+                 print(f"\nError writing frame {i} to FFmpeg stdin: {pipe_err}")
+                 print("FFmpeg process likely terminated prematurely. Check FFmpeg errors below.")
+                 try:
+                     # Immediately try to read stderr if pipe breaks
+                     stderr_output_on_error = process.stderr.read()
+                     if stderr_output_on_error:
+                          print("\n--- FFmpeg stderr output on error ---")
+                          print(stderr_output_on_error.decode(errors='ignore'))
+                          print("--- End FFmpeg stderr ---")
+                 except Exception as read_err:
+                     print(f"(Could not read stderr after pipe error: {read_err})")
+                 return
+            except Exception as write_err:
+                 print(f"Unexpected error writing frame {i}: {write_err}. Skipping.")
+                 continue
+            if (i + 1) % progress_interval == 0 or (i + 1) == len(frames):
+                 print(f"  Processed frame {i + 1}/{len(frames)}")
+        print("\nFinished writing frames. Closing FFmpeg stdin and waiting for completion...")
+        process.stdin.close()
+        stdout, stderr = process.communicate()
+        return_code = process.wait()
+        print("\n--- FFmpeg Final Status ---")
+        if return_code == 0:
+            print(f"FFmpeg process completed successfully.")
+            print(f"Video saved as: {output_filename}")
+        else:
+            print(f"FFmpeg process failed with return code {return_code}.")
+            print("--- FFmpeg Standard Error Output: ---")
+            print(stderr.decode(errors='replace')) # Print stderr captured by communicate()
+            print("--- End FFmpeg Output ---")
+            print("Review the FFmpeg error message above for details (e.g., dimension errors, parameter issues).")
+    except FileNotFoundError:
+        print("\n--- FATAL ERROR ---")
+        print("Error: 'ffmpeg' command not found.")
+        print("Please ensure FFmpeg is installed and its directory is included in your system's PATH environment variable.")
+        print("Download from: https://ffmpeg.org/")
+        print("-------------------")
+    except Exception as e:
+        print(f"\nAn unexpected error occurred during FFmpeg execution: {e}")
+def find_island_centers(array_2d, threshold):
+    """
+    Finds the center of mass of each island (connected component) in a 2D array.
+    Args:
+        array_2d: A 2D numpy array of values.
+        threshold: The threshold to binarize the array.
+    Returns:
+        A list of tuples (y, x) representing the center of mass of each island.
+    """
+    binary_image = array_2d > threshold
+    labeled_image, num_labels = ndimage.label(binary_image)
+    centers = []
+    areas = []  # Store the area of each island
+    for i in range(1, num_labels + 1):
+        island = (labeled_image == i)
+        total_mass = np.sum(array_2d[island])
+        if total_mass > 0:
+            y_coords, x_coords = np.mgrid[:array_2d.shape[0], :array_2d.shape[1]]
+            x_center = np.average(x_coords[island], weights=array_2d[island])
+            y_center = np.average(y_coords[island], weights=array_2d[island])
+            centers.append((round(y_center, 4), round(x_center, 4)))
+            areas.append(np.sum(island))  # Calculate area of the island
+    return centers, areas
+def plot_neural_dynamics(post_activations_history, N_to_plot, save_location, axis_snap=False, N_per_row=5, which_neurons_mid=None, mid_colours=None, use_most_active_neurons=False):
+    assert N_to_plot%N_per_row==0, f'For nice visualisation, N_to_plot={N_to_plot} must be a multiple of N_per_row={N_per_row}'
+    assert post_activations_history.shape[-1] >= N_to_plot
+    figscale = 2
+    aspect_ratio = 3
+    mosaic = np.array([[f'{i}'] for i in range(N_to_plot)]).flatten().reshape(-1, N_per_row)
+    fig_synch, axes_synch = plt.subplot_mosaic(mosaic=mosaic, figsize=(figscale*mosaic.shape[1]*aspect_ratio*0.2, figscale*mosaic.shape[0]*0.2))
+    fig_mid, axes_mid = plt.subplot_mosaic(mosaic=mosaic, figsize=(figscale*mosaic.shape[1]*aspect_ratio*0.2, figscale*mosaic.shape[0]*0.2), dpi=200)
+    palette = sns.color_palette("husl", 8)
+    which_neurons_synch = np.arange(N_to_plot)
+    # which_neurons_mid = np.arange(N_to_plot, N_to_plot*2) if post_activations_history.shape[-1] >= 2*N_to_plot else np.random.choice(np.arange(post_activations_history.shape[-1]), size=N_to_plot, replace=True)
+    random_indices = np.random.choice(np.arange(post_activations_history.shape[-1]), size=N_to_plot, replace=post_activations_history.shape[-1] < N_to_plot)
+    if use_most_active_neurons:
+        metric = np.abs(np.fft.rfft(post_activations_history, axis=0))[3:].mean(0).std(0)
+        random_indices = np.argsort(metric)[-N_to_plot:]
+        np.random.shuffle(random_indices)
+    which_neurons_mid = which_neurons_mid if which_neurons_mid is not None else random_indices
+    if mid_colours is None:
+        mid_colours = [palette[np.random.randint(0, 8)] for ndx in range(N_to_plot)]
+    with tqdm(total=N_to_plot, initial=0, leave=False, position=1, dynamic_ncols=True) as pbar_inner:
+        pbar_inner.set_description('Plotting neural dynamics')
+        for ndx in range(N_to_plot):
+            ax_s = axes_synch[f'{ndx}']
+            ax_m = axes_mid[f'{ndx}']
+            traces_s = post_activations_history[:,:,which_neurons_synch[ndx]].T
+            traces_m = post_activations_history[:,:,which_neurons_mid[ndx]].T
+            c_s = palette[np.random.randint(0, 8)]
+            c_m = mid_colours[ndx]
+            for traces_s_here, traces_m_here in zip(traces_s, traces_m):
+                ax_s.plot(np.arange(len(traces_s_here)), traces_s_here, linestyle='-', color=c_s, alpha=0.05, linewidth=0.6)
+                ax_m.plot(np.arange(len(traces_m_here)), traces_m_here, linestyle='-', color=c_m, alpha=0.05, linewidth=0.6)
+            ax_s.plot(np.arange(len(traces_s[0])), traces_s[0], linestyle='-', color='white', alpha=1, linewidth=2.5)
+            ax_s.plot(np.arange(len(traces_s[0])), traces_s[0], linestyle='-', color=c_s, alpha=1, linewidth=1.3)
+            ax_s.plot(np.arange(len(traces_s[0])), traces_s[0], linestyle='-', color='black', alpha=1, linewidth=0.3)
+            ax_m.plot(np.arange(len(traces_m[0])), traces_m[0], linestyle='-', color='white', alpha=1, linewidth=2.5)
+            ax_m.plot(np.arange(len(traces_m[0])), traces_m[0], linestyle='-', color=c_m, alpha=1, linewidth=1.3)
+            ax_m.plot(np.arange(len(traces_m[0])), traces_m[0], linestyle='-', color='black', alpha=1, linewidth=0.3)
+            if axis_snap and np.all(np.isfinite(traces_s[0])):
+                ax_s.set_ylim([np.min(traces_s[0])-np.ptp(traces_s[0])*0.05, np.max(traces_s[0])+np.ptp(traces_s[0])*0.05])
+                ax_m.set_ylim([np.min(traces_m[0])-np.ptp(traces_m[0])*0.05, np.max(traces_m[0])+np.ptp(traces_m[0])*0.05])
+            ax_s.grid(False)
+            ax_m.grid(False)
+            ax_s.set_xlim([0, len(traces_s[0])-1])
+            ax_m.set_xlim([0, len(traces_m[0])-1])
+            ax_s.set_xticklabels([])
+            ax_s.set_yticklabels([])
+            ax_m.set_xticklabels([])
+            ax_m.set_yticklabels([])
+            pbar_inner.update(1)
+    fig_synch.tight_layout(pad=0.05)
+    fig_mid.tight_layout(pad=0.05)
+    if save_location is not None:
+        fig_synch.savefig(f'{save_location}/neural_dynamics_synch.pdf', dpi=200)
+        fig_synch.savefig(f'{save_location}/neural_dynamics_synch.png', dpi=200)
+        fig_mid.savefig(f'{save_location}/neural_dynamics_other.pdf', dpi=200)
+        fig_mid.savefig(f'{save_location}/neural_dynamics_other.png', dpi=200)
+        plt.close(fig_synch)
+        plt.close(fig_mid)
+    return fig_synch, fig_mid, which_neurons_mid, mid_colours
+def make_classification_gif(image, target, predictions, certainties, post_activations, attention_tracking, class_labels, save_location):
+    cmap_viridis = sns.color_palette('viridis', as_cmap=True)
+    cmap_spectral = sns.color_palette("Spectral", as_cmap=True)
+    figscale = 2
+    with tqdm(total=post_activations.shape[0]+1, initial=0, leave=False, position=1, dynamic_ncols=True) as pbar_inner:
+        pbar_inner.set_description('Computing UMAP')
+        low = np.percentile(post_activations, 1, axis=0, keepdims=True)
+        high = np.percentile(post_activations, 99, axis=0, keepdims=True)
+        post_activations_normed = np.clip((post_activations - low)/(high - low), 0, 1)
+        metric = 'cosine'
+        reducer = umap.UMAP(n_components=2,
+                            n_neighbors=100,
+                            min_dist=3,
+                            spread=3.0,
+                            metric=metric,
+                            random_state=None,
+                            # low_memory=True,
+                            ) if post_activations.shape[-1] > 2048 else umap.UMAP(n_components=2,
+                            n_neighbors=20,
+                            min_dist=1,
+                            spread=1.0,
+                            metric=metric,
+                            random_state=None,
+                            # low_memory=True,
+                            )
+        positions = reducer.fit_transform(post_activations_normed.T)
+        x_umap = positions[:, 0]
+        y_umap = positions[:, 1]
+        pbar_inner.update(1)
+        pbar_inner.set_description('Iterating through to build frames')
+        frames = []
+        route_steps = {}
+        route_colours = []
+        n_steps = len(post_activations)
+        n_heads = attention_tracking.shape[1]
+        step_linspace = np.linspace(0, 1, n_steps)
+        for stepi in np.arange(0, n_steps, 1):
+            pbar_inner.set_description('Making frames for gif')
+            attention_now = attention_tracking[max(0, stepi-5):stepi+1].mean(0)  # Make it smooth for pretty
+            # attention_now[:,0,0] = 0  # Corners can be weird looking
+            # attention_now[:,0,-1] = 0
+            # attention_now[:,-1,0] = 0
+            # attention_now[:,-1,-1] = 0
+            # attention_now = (attention_tracking[:stepi+1, 0] * decay).sum(0)/(decay.sum(0))
+            certainties_now = certainties[1, :stepi+1]
+            attention_interp = torch.nn.functional.interpolate(torch.from_numpy(attention_now).unsqueeze(0), image.shape[:2], mode='bilinear')[0]
+            attention_interp = (attention_interp.flatten(1) - attention_interp.flatten(1).min(-1, keepdim=True)[0])/(attention_interp.flatten(1).max(-1, keepdim=True)[0] - attention_interp.flatten(1).min(-1, keepdim=True)[0])
+            attention_interp = attention_interp.reshape(n_heads, image.shape[0], image.shape[1])
+            colour = list(cmap_spectral(step_linspace[stepi]))
+            route_colours.append(colour)
+            for headi in range(min(8, n_heads)):
+                com_attn = np.copy(attention_interp[headi])
+                com_attn[com_attn < np.percentile(com_attn, 97)] = 0.0
+                if headi not in route_steps:
+                    A = attention_interp[headi].detach().cpu().numpy()
+                    centres, areas = find_island_centers(A, threshold=0.7)
+                    route_steps[headi] = [centres[np.argmax(areas)]]
+                else:
+                    A = attention_interp[headi].detach().cpu().numpy()
+                    centres, areas = find_island_centers(A, threshold=0.7)
+                    route_steps[headi] = route_steps[headi] + [centres[np.argmax(areas)]]
+            mosaic = [['head_0', 'head_0_overlay', 'head_1', 'head_1_overlay'],
+                      ['head_2', 'head_2_overlay', 'head_3', 'head_3_overlay'],
+                      ['head_4', 'head_4_overlay', 'head_5', 'head_5_overlay'],
+                      ['head_6', 'head_6_overlay', 'head_7', 'head_7_overlay'],
+                      ['probabilities', 'probabilities','certainty', 'certainty'],
+                      ['umap', 'umap', 'umap', 'umap'],
+                      ['umap', 'umap', 'umap', 'umap'],
+                      ['umap', 'umap', 'umap', 'umap'],
+                      ]
+            img_aspect = image.shape[0]/image.shape[1]
+            # print(img_aspect)
+            aspect_ratio = (4*figscale, 8*figscale*img_aspect)
+            fig, axes = plt.subplot_mosaic(mosaic, figsize=aspect_ratio)
+            for ax in axes.values():
+                ax.axis('off')
+            axes['certainty'].plot(np.arange(len(certainties_now)), certainties_now, 'k-', linewidth=figscale*1, label='1-(normalised entropy)')
+            for ii, (x, y) in enumerate(zip(np.arange(len(certainties_now)), certainties_now)):
+                is_correct = predictions[:, ii].argmax(-1)==target
+                if is_correct: axes['certainty'].axvspan(ii, ii + 1, facecolor='limegreen', edgecolor=None, lw=0, alpha=0.3)
+                else:
+                    axes['certainty'].axvspan(ii, ii + 1, facecolor='orchid', edgecolor=None, lw=0, alpha=0.3)
+            axes['certainty'].plot(len(certainties_now)-1, certainties_now[-1], 'k.', markersize=figscale*4)
+            axes['certainty'].axis('off')
+            axes['certainty'].set_ylim([-0.05, 1.05])
+            axes['certainty'].set_xlim([0, certainties.shape[-1]+1])
+            ps = torch.softmax(torch.from_numpy(predictions[:, stepi]), -1)
+            k = 15 if len(class_labels) > 15 else len(class_labels)
+            topk = torch.topk (ps, k, dim = 0, largest=True).indices.detach().cpu().numpy()
+            top_classes = np.array(class_labels)[topk]
+            true_class = target
+            colours = [('b' if ci != true_class else 'g') for ci in topk]
+            bar_heights = ps[topk].detach().cpu().numpy()
+            axes['probabilities'].bar(np.arange(len(bar_heights))[::-1], bar_heights, color=np.array(colours), alpha=1)
+            axes['probabilities'].set_ylim([0, 1])
+            axes['probabilities'].axis('off')
+            for i, (name) in enumerate(top_classes):
+                prob = ps[i]
+                is_correct = name==class_labels[true_class]
+                fg_color = 'darkgreen' if is_correct else 'crimson'
+                text_str = f'{name[:40]}'
+                axes['probabilities'].text(
+                    0.05,
+                    0.95 - i * 0.055,  # Adjust vertical position for each line
+                    text_str,
+                    transform=axes['probabilities'].transAxes,
+                    verticalalignment='top',
+                    fontsize=8,  # Increased font size
+                    color=fg_color,
+                    alpha=0.5,
+                    path_effects=[
+                        patheffects.Stroke(linewidth=3, foreground='aliceblue'),
+                        patheffects.Normal()
+                    ])
+            attention_now = attention_tracking[max(0, stepi-5):stepi+1].mean(0)  # Make it smooth for pretty
+            # attention_now = (attention_tracking[:stepi+1, 0] * decay).sum(0)/(decay.sum(0))
+            certainties_now = certainties[1, :stepi+1]
+            attention_interp = torch.nn.functional.interpolate(torch.from_numpy(attention_now).unsqueeze(0), image.shape[:2], mode='nearest')[0]
+            attention_interp = (attention_interp.flatten(1) - attention_interp.flatten(1).min(-1, keepdim=True)[0])/(attention_interp.flatten(1).max(-1, keepdim=True)[0] - attention_interp.flatten(1).min(-1, keepdim=True)[0])
+            attention_interp = attention_interp.reshape(n_heads, image.shape[0], image.shape[1])
+            for hi in range(min(8, n_heads)):
+                ax = axes[f'head_{hi}']
+                img_to_plot = cmap_viridis(attention_interp[hi].detach().cpu().numpy())
+                ax.imshow(img_to_plot)
+                ax_overlay = axes[f'head_{hi}_overlay']
+                these_route_steps = route_steps[hi]
+                y_coords, x_coords = zip(*these_route_steps)
+                y_coords = image.shape[-2] - np.array(list(y_coords))-1
+                ax_overlay.imshow(np.flip(image, axis=0), origin='lower')
+                # ax.imshow(np.flip(solution_maze, axis=0), origin='lower')
+                arrow_scale = 1.5 if image.shape[0] > 32 else 0.8
+                for i in range(len(these_route_steps)-1):
+                    dx = x_coords[i+1] - x_coords[i]
+                    dy = y_coords[i+1] - y_coords[i]
+                    ax_overlay.arrow(x_coords[i], y_coords[i], dx, dy, linewidth=1.6*arrow_scale*1.3, head_width=1.9*arrow_scale*1.3, head_length=1.4*arrow_scale*1.45, fc='white', ec='white', length_includes_head = True, alpha=1)
+                    ax_overlay.arrow(x_coords[i], y_coords[i], dx, dy, linewidth=1.6*arrow_scale, head_width=1.9*arrow_scale, head_length=1.4*arrow_scale, fc=route_colours[i], ec=route_colours[i], length_includes_head = True)
+                ax_overlay.set_xlim([0,image.shape[1]-1])
+                ax_overlay.set_ylim([0,image.shape[0]-1])
+                ax_overlay.axis('off')
+            z = post_activations_normed[stepi]
+            axes['umap'].scatter(x_umap, y_umap, s=30, c=cmap_spectral(z))
+            fig.tight_layout(pad=0.1)
+            canvas = fig.canvas
+            canvas.draw()
+            image_numpy = np.frombuffer(canvas.buffer_rgba(), dtype='uint8')
+            image_numpy = (image_numpy.reshape(*reversed(canvas.get_width_height()), 4)[:,:,:3])
+            frames.append(image_numpy)
+            plt.close(fig)
+            pbar_inner.update(1)
+        pbar_inner.set_description('Saving gif')
+        imageio.mimsave(save_location, frames, fps=15, loop=100)

tasks/image_classification/scripts/train_cifar10.sh ADDED Viewed

	@@ -0,0 +1,286 @@

+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/ctm/d=256--i=64--heads=16--sd=5--synch=256-512-0-h=64-random-pairing--iters=50x15--backbone=18-1--seed=1 \
+--model ctm
+--dataset cifar10 \
+--d_model 256 \
+--d_input 64 \
+--synapse_depth 5 \
+--heads 16 \
+--n_synch_out 256 \
+--n_synch_action 512 \
+--n_random_pairing_self 0 \
+--neuron_select_type random-pairing \
+--iterations 50 \
+--memory_length 15 \
+--deep_memory \
+--memory_hidden_dims 64 \
+--dropout 0.0 \
+--dropout_nlm 0 \
+--no-do_normalisation \
+--positional_embedding_type none \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 1000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 1
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/ctm/d=256--i=64--heads=16--sd=5--synch=256-512-0-h=64-random-pairing--iters=50x15--backbone=18-1--seed=2 \
+--model ctm
+--dataset cifar10 \
+--d_model 256 \
+--d_input 64 \
+--synapse_depth 5 \
+--heads 16 \
+--n_synch_out 256 \
+--n_synch_action 512 \
+--n_random_pairing_self 0 \
+--neuron_select_type random-pairing \
+--iterations 50 \
+--memory_length 15 \
+--deep_memory \
+--memory_hidden_dims 64 \
+--dropout 0.0 \
+--dropout_nlm 0 \
+--no-do_normalisation \
+--positional_embedding_type none \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 1000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 2
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/ctm/d=256--i=64--heads=16--sd=5--synch=256-512-0-h=64-random-pairing--iters=50x15--backbone=18-1--seed=42 \
+--model ctm
+--dataset cifar10 \
+--d_model 256 \
+--d_input 64 \
+--synapse_depth 5 \
+--heads 16 \
+--n_synch_out 256 \
+--n_synch_action 512 \
+--n_random_pairing_self 0 \
+--neuron_select_type random-pairing \
+--iterations 50 \
+--memory_length 15 \
+--deep_memory \
+--memory_hidden_dims 64 \
+--dropout 0.0 \
+--dropout_nlm 0 \
+--no-do_normalisation \
+--positional_embedding_type none \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 1000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 42
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/lstm/nlayers=2--d=256--i=64--heads=16--synch=256-512-0-h=64-random-pairing--iters=50x15--backbone=18-1--seed=1 \
+--dataset cifar10 \
+--model lstm \
+--num_layers 2 \
+--d_model 256 \
+--d_input 64 \
+--heads 16 \
+--iterations 50 \
+--dropout 0.0  \
+--positional_embedding_type none \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 2000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--reload  \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 1 \
+--no-reload
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/lstm/nlayers=2--d=256--i=64--heads=16--synch=256-512-0-h=64-random-pairing--iters=50x15--backbone=18-1--seed=2 \
+--dataset cifar10 \
+--model lstm \
+--num_layers 2 \
+--d_model 256 \
+--d_input 64 \
+--heads 16 \
+--iterations 50 \
+--dropout 0.0  \
+--positional_embedding_type none \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 2000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--reload  \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 2 \
+--no-reload
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/lstm/nlayers=2--d=256--i=64--heads=16--synch=256-512-0-h=64-random-pairing--iters=50x15--backbone=18-1--seed=42 \
+--dataset cifar10 \
+--model lstm \
+--num_layers 2 \
+--d_model 256 \
+--d_input 64 \
+--heads 16 \
+--iterations 50 \
+--dropout 0.0  \
+--positional_embedding_type none \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 2000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--reload  \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 42 \
+--no-reload
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/ff/d=256--backbone=18-1--seed=1 \
+--dataset cifar10 \
+--model ff \
+--d_model 256 \
+--memory_hidden_dims 64 \
+--dropout 0.0 \
+--dropout_nlm 0 \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 1000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 1
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/ff/d=256--backbone=18-1--seed=2 \
+--dataset cifar10 \
+--model ff \
+--d_model 256 \
+--memory_hidden_dims 64 \
+--dropout 0.0 \
+--dropout_nlm 0 \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 1000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 2
+python -m tasks.image_classification.train \
+--log_dir logs/cifar10-versus-humans/ff/d=256--backbone=18-1--seed=42 \
+--dataset cifar10 \
+--model ff \
+--d_model 256 \
+--memory_hidden_dims 64 \
+--dropout 0.0 \
+--dropout_nlm 0 \
+--backbone_type resnet18-1 \
+--training_iterations 600001 \
+--warmup_steps 1000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0001 \
+--save_every 1000 \
+--track_every 2000 \
+--n_test_batches 50 \
+--num_workers_train 8 \
+--batch_size 512 \
+--batch_size_test 512 \
+--lr 1e-4 \
+--device 0 \
+--seed 42

tasks/image_classification/scripts/train_imagenet.sh ADDED Viewed

	@@ -0,0 +1,38 @@

+torchrun --standalone --nnodes=1 --nproc_per_node=8 -m tasks.image_classification.train_distributed \
+--log_dir logs/imagenet/d=4096--i=1024--heads=16--sd=8--nlm=64--synch=8192-2048-32-h=64-random-pairing--iters=50x25--backbone=152x4 \
+--model ctm \
+--dataset imagenet \
+--d_model 4096 \
+--d_input 1024 \
+--synapse_depth 8 \
+--heads 16 \
+--n_synch_out 8196 \
+--n_synch_action 2048 \
+--n_random_pairing_self 32 \
+--neuron_select_type random-pairing \
+--iterations 50 \
+--memory_length 25 \
+--deep_memory \
+--memory_hidden_dims 64 \
+--dropout 0.2 \
+--dropout_nlm 0 \
+--no-do_normalisation \
+--positional_embedding_type none \
+--backbone_type resnet152-4 \
+--batch_size 64 \
+--batch_size_test 64 \
+--n_test_batches 200 \
+--lr 5e-4 \
+--gradient_clipping 20 \
+--training_iterations 500001 \
+--save_every 1000 \
+--track_every 5000 \
+--warmup_steps 10000 \
+--use_scheduler \
+--scheduler_type cosine \
+--weight_decay 0.0 \
+--seed 1 \
+--use_amp \
+--reload  \
+--num_workers_train 8 \
+--use_custom_sampler

tasks/image_classification/train.py ADDED Viewed

	@@ -0,0 +1,685 @@

+import argparse
+import os
+import random
+import matplotlib.pyplot as plt
+import numpy as np
+import seaborn as sns
+sns.set_style('darkgrid')
+import torch
+if torch.cuda.is_available():
+    # For faster
+    torch.set_float32_matmul_precision('high')
+import torch.nn as nn
+from tqdm.auto import tqdm
+from data.custom_datasets import ImageNet
+from torchvision import datasets
+from torchvision import transforms
+from tasks.image_classification.imagenet_classes import IMAGENET2012_CLASSES
+from models.ctm import ContinuousThoughtMachine
+from models.lstm import LSTMBaseline
+from models.ff import FFBaseline
+from tasks.image_classification.plotting import plot_neural_dynamics, make_classification_gif
+from utils.housekeeping import set_seed, zip_python_code
+from utils.losses import image_classification_loss # Used by CTM, LSTM
+from utils.schedulers import WarmupCosineAnnealingLR, WarmupMultiStepLR, warmup
+from autoclip.torch import QuantileClip
+import gc
+import torchvision
+torchvision.disable_beta_transforms_warning()
+import warnings
+warnings.filterwarnings("ignore", message="using precomputed metric; inverse_transform will be unavailable")
+warnings.filterwarnings('ignore', message='divide by zero encountered in power', category=RuntimeWarning)
+warnings.filterwarnings(
+    "ignore",
+    "Corrupt EXIF data",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Metadata Warning",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Truncated File Read",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Model Selection
+    parser.add_argument('--model', type=str, default='ctm', choices=['ctm', 'lstm', 'ff'], help='Model type to train.')
+    # Model Architecture
+    # Common
+    parser.add_argument('--d_model', type=int, default=512, help='Dimension of the model.')
+    parser.add_argument('--dropout', type=float, default=0.0, help='Dropout rate.')
+    parser.add_argument('--backbone_type', type=str, default='resnet18-4', help='Type of backbone featureiser.')
+    # CTM / LSTM specific
+    parser.add_argument('--d_input', type=int, default=128, help='Dimension of the input (CTM, LSTM).')
+    parser.add_argument('--heads', type=int, default=4, help='Number of attention heads (CTM, LSTM).')
+    parser.add_argument('--iterations', type=int, default=75, help='Number of internal ticks (CTM, LSTM).')
+    parser.add_argument('--positional_embedding_type', type=str, default='none', help='Type of positional embedding (CTM, LSTM).',
+                        choices=['none',
+                                 'learnable-fourier',
+                                 'multi-learnable-fourier',
+                                 'custom-rotational'])
+    # CTM specific
+    parser.add_argument('--synapse_depth', type=int, default=4, help='Depth of U-NET model for synapse. 1=linear, no unet (CTM only).')
+    parser.add_argument('--n_synch_out', type=int, default=512, help='Number of neurons to use for output synch (CTM only).')
+    parser.add_argument('--n_synch_action', type=int, default=512, help='Number of neurons to use for observation/action synch (CTM only).')
+    parser.add_argument('--neuron_select_type', type=str, default='random-pairing', help='Protocol for selecting neuron subset (CTM only).')
+    parser.add_argument('--n_random_pairing_self', type=int, default=0, help='Number of neurons paired self-to-self for synch (CTM only).')
+    parser.add_argument('--memory_length', type=int, default=25, help='Length of the pre-activation history for NLMS (CTM only).')
+    parser.add_argument('--deep_memory', action=argparse.BooleanOptionalAction, default=True, help='Use deep memory (CTM only).')
+    parser.add_argument('--memory_hidden_dims', type=int, default=4, help='Hidden dimensions of the memory if using deep memory (CTM only).')
+    parser.add_argument('--dropout_nlm', type=float, default=None, help='Dropout rate for NLMs specifically. Unset to match dropout on the rest of the model (CTM only).')
+    parser.add_argument('--do_normalisation', action=argparse.BooleanOptionalAction, default=False, help='Apply normalization in NLMs (CTM only).')
+    # LSTM specific
+    parser.add_argument('--num_layers', type=int, default=2, help='Number of LSTM stacked layers (LSTM only).')
+    # Training
+    parser.add_argument('--batch_size', type=int, default=32, help='Batch size for training.')
+    parser.add_argument('--batch_size_test', type=int, default=32, help='Batch size for testing.')
+    parser.add_argument('--lr', type=float, default=1e-3, help='Learning rate for the model.')
+    parser.add_argument('--training_iterations', type=int, default=100001, help='Number of training iterations.')
+    parser.add_argument('--warmup_steps', type=int, default=5000, help='Number of warmup steps.')
+    parser.add_argument('--use_scheduler', action=argparse.BooleanOptionalAction, default=True, help='Use a learning rate scheduler.')
+    parser.add_argument('--scheduler_type', type=str, default='cosine', choices=['multistep', 'cosine'], help='Type of learning rate scheduler.')
+    parser.add_argument('--milestones', type=int, default=[8000, 15000, 20000], nargs='+', help='Learning rate scheduler milestones.')
+    parser.add_argument('--gamma', type=float, default=0.1, help='Learning rate scheduler gamma for multistep.')
+    parser.add_argument('--weight_decay', type=float, default=0.0, help='Weight decay factor.')
+    parser.add_argument('--weight_decay_exclusion_list', type=str, nargs='+', default=[], help='List to exclude from weight decay. Typically good: bn, ln, bias, start')
+    parser.add_argument('--gradient_clipping', type=float, default=-1, help='Gradient quantile clipping value (-1 to disable).')
+    parser.add_argument('--do_compile', action=argparse.BooleanOptionalAction, default=False, help='Try to compile model components (backbone, synapses if CTM).')
+    parser.add_argument('--num_workers_train', type=int, default=1, help='Num workers training.')
+    # Housekeeping
+    parser.add_argument('--log_dir', type=str, default='logs/scratch', help='Directory for logging.')
+    parser.add_argument('--dataset', type=str, default='cifar10', help='Dataset to use.')
+    parser.add_argument('--data_root', type=str, default='data/', help='Where to save dataset.')
+    parser.add_argument('--save_every', type=int, default=1000, help='Save checkpoints every this many iterations.')
+    parser.add_argument('--seed', type=int, default=412, help='Random seed.')
+    parser.add_argument('--reload', action=argparse.BooleanOptionalAction, default=False, help='Reload from disk?')
+    parser.add_argument('--reload_model_only', action=argparse.BooleanOptionalAction, default=False, help='Reload only the model from disk?')
+    parser.add_argument('--strict_reload', action=argparse.BooleanOptionalAction, default=True, help='Should use strict reload for model weights.') # Added back
+    parser.add_argument('--track_every', type=int, default=1000, help='Track metrics every this many iterations.')
+    parser.add_argument('--n_test_batches', type=int, default=20, help='How many minibatches to approx metrics. Set to -1 for full eval')
+    parser.add_argument('--device', type=int, nargs='+', default=[-1], help='List of GPU(s) to use. Set to -1 to use CPU.')
+    parser.add_argument('--use_amp', action=argparse.BooleanOptionalAction, default=False, help='AMP autocast.')
+    args = parser.parse_args()
+    return args
+def get_dataset(dataset, root):
+    if dataset=='imagenet':
+        dataset_mean = [0.485, 0.456, 0.406]
+        dataset_std = [0.229, 0.224, 0.225]
+        normalize = transforms.Normalize(mean=dataset_mean, std=dataset_std)
+        train_transform = transforms.Compose([
+            transforms.RandomResizedCrop(224),
+                    transforms.RandomHorizontalFlip(),
+                    transforms.ToTensor(),
+                    normalize])
+        test_transform = transforms.Compose([
+            transforms.Resize(256),
+            transforms.CenterCrop(224),
+                    transforms.ToTensor(),
+                    normalize])
+        class_labels = list(IMAGENET2012_CLASSES.values())
+        train_data = ImageNet(which_split='train', transform=train_transform)
+        test_data = ImageNet(which_split='validation', transform=test_transform)
+    elif dataset=='cifar10':
+        dataset_mean = [0.49139968, 0.48215827, 0.44653124]
+        dataset_std = [0.24703233, 0.24348505, 0.26158768]
+        normalize = transforms.Normalize(mean=dataset_mean, std=dataset_std)
+        train_transform = transforms.Compose(
+            [transforms.AutoAugment(transforms.AutoAugmentPolicy.CIFAR10),
+            transforms.ToTensor(),
+            normalize,
+            ])
+        test_transform = transforms.Compose(
+            [transforms.ToTensor(),
+            normalize,
+            ])
+        train_data = datasets.CIFAR10(root, train=True, transform=train_transform, download=True)
+        test_data = datasets.CIFAR10(root, train=False, transform=test_transform, download=True)
+        class_labels = ['air', 'auto', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
+    elif dataset=='cifar100':
+        dataset_mean = [0.5070751592371341, 0.48654887331495067, 0.4409178433670344]
+        dataset_std = [0.2673342858792403, 0.2564384629170882, 0.27615047132568393]
+        normalize = transforms.Normalize(mean=dataset_mean, std=dataset_std)
+        train_transform = transforms.Compose(
+            [transforms.AutoAugment(transforms.AutoAugmentPolicy.CIFAR10),
+            transforms.ToTensor(),
+            normalize,
+            ])
+        test_transform = transforms.Compose(
+            [transforms.ToTensor(),
+            normalize,
+            ])
+        train_data = datasets.CIFAR100(root, train=True, transform=train_transform, download=True)
+        test_data = datasets.CIFAR100(root, train=False, transform=test_transform, download=True)
+        idx_order = np.argsort(np.array(list(train_data.class_to_idx.values())))
+        class_labels = list(np.array(list(train_data.class_to_idx.keys()))[idx_order])
+    else:
+        raise NotImplementedError
+    return train_data, test_data, class_labels, dataset_mean, dataset_std
+if __name__=='__main__':
+    # Hosuekeeping
+    args = parse_args()
+    set_seed(args.seed, False)
+    if not os.path.exists(args.log_dir): os.makedirs(args.log_dir)
+    assert args.dataset in ['cifar10', 'cifar100', 'imagenet']
+    # Data
+    train_data, test_data, class_labels, dataset_mean, dataset_std = get_dataset(args.dataset, args.data_root)
+    num_workers_test = 1 # Defaulting to 1, change if needed
+    trainloader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers_train)
+    testloader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, shuffle=True, num_workers=num_workers_test, drop_last=False)
+    prediction_reshaper = [-1]  # Problem specific
+    args.out_dims = len(class_labels)
+    # For total reproducibility
+    zip_python_code(f'{args.log_dir}/repo_state.zip')
+    with open(f'{args.log_dir}/args.txt', 'w') as f:
+        print(args, file=f)
+    # Configure device string
+    device = f'cuda:{args.device[0]}' if args.device[0] != -1 else 'cpu'
+    print(f'Running model {args.model} on {device}')
+    # Build model conditionally
+    model = None
+    if args.model == 'ctm':
+        model = ContinuousThoughtMachine(
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            n_synch_out=args.n_synch_out,
+            n_synch_action=args.n_synch_action,
+            synapse_depth=args.synapse_depth,
+            memory_length=args.memory_length,
+            deep_nlms=args.deep_memory,
+            memory_hidden_dims=args.memory_hidden_dims,
+            do_layernorm_nlm=args.do_normalisation,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+            dropout_nlm=args.dropout_nlm,
+            neuron_select_type=args.neuron_select_type,
+            n_random_pairing_self=args.n_random_pairing_self,
+        ).to(device)
+    elif args.model == 'lstm':
+         model = LSTMBaseline(
+            num_layers=args.num_layers,
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+        ).to(device)
+    elif args.model == 'ff':
+        model = FFBaseline(
+            d_model=args.d_model,
+            backbone_type=args.backbone_type,
+            out_dims=args.out_dims,
+            dropout=args.dropout,
+        ).to(device)
+    else:
+        raise ValueError(f"Unknown model type: {args.model}")
+    # For lazy modules so that we can get param count
+    pseudo_inputs = train_data.__getitem__(0)[0].unsqueeze(0).to(device)
+    model(pseudo_inputs)
+    model.train()
+    print(f'Total params: {sum(p.numel() for p in model.parameters())}')
+    decay_params = []
+    no_decay_params = []
+    no_decay_names = []
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue # Skip parameters that don't require gradients
+        if any(exclusion_str in name for exclusion_str in args.weight_decay_exclusion_list):
+            no_decay_params.append(param)
+            no_decay_names.append(name)
+        else:
+            decay_params.append(param)
+    if len(no_decay_names):
+        print(f'WARNING, excluding: {no_decay_names}')
+    # Optimizer and scheduler (Common setup)
+    if len(no_decay_names) and args.weight_decay!=0:
+        optimizer = torch.optim.AdamW([{'params': decay_params, 'weight_decay':args.weight_decay},
+                                       {'params': no_decay_params, 'weight_decay':0}],
+                                  lr=args.lr,
+                                  eps=1e-8 if not args.use_amp else 1e-6)
+    else:
+        optimizer = torch.optim.AdamW(model.parameters(),
+                                    lr=args.lr,
+                                    eps=1e-8 if not args.use_amp else 1e-6,
+                                    weight_decay=args.weight_decay)
+    warmup_schedule = warmup(args.warmup_steps)
+    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warmup_schedule.step)
+    if args.use_scheduler:
+        if args.scheduler_type == 'multistep':
+            scheduler = WarmupMultiStepLR(optimizer, warmup_steps=args.warmup_steps, milestones=args.milestones, gamma=args.gamma)
+        elif args.scheduler_type == 'cosine':
+            scheduler = WarmupCosineAnnealingLR(optimizer, args.warmup_steps, args.training_iterations, warmup_start_lr=1e-20, eta_min=1e-7)
+        else:
+            raise NotImplementedError
+    # Metrics tracking
+    start_iter = 0
+    train_losses = []
+    test_losses = []
+    train_accuracies = []
+    test_accuracies = []
+    iters = []
+    # Conditional metrics for CTM/LSTM
+    train_accuracies_most_certain = [] if args.model in ['ctm', 'lstm'] else None
+    test_accuracies_most_certain = [] if args.model in ['ctm', 'lstm'] else None
+    scaler = torch.amp.GradScaler("cuda" if "cuda" in device else "cpu", enabled=args.use_amp)
+    # Reloading logic
+    if args.reload:
+        checkpoint_path = f'{args.log_dir}/checkpoint.pt'
+        if os.path.isfile(checkpoint_path):
+            print(f'Reloading from: {checkpoint_path}')
+            checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)
+            if not args.strict_reload: print('WARNING: not using strict reload for model weights!')
+            load_result = model.load_state_dict(checkpoint['model_state_dict'], strict=args.strict_reload)
+            print(f" Loaded state_dict. Missing: {load_result.missing_keys}, Unexpected: {load_result.unexpected_keys}")
+            if not args.reload_model_only:
+                print('Reloading optimizer etc.')
+                optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+                scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
+                scaler.load_state_dict(checkpoint['scaler_state_dict'])
+                start_iter = checkpoint['iteration']
+                # Load common metrics
+                train_losses = checkpoint['train_losses']
+                test_losses = checkpoint['test_losses']
+                train_accuracies = checkpoint['train_accuracies']
+                test_accuracies = checkpoint['test_accuracies']
+                iters = checkpoint['iters']
+                # Load conditional metrics if they exist in checkpoint and are expected for current model
+                if args.model in ['ctm', 'lstm']:
+                    train_accuracies_most_certain = checkpoint['train_accuracies_most_certain']
+                    test_accuracies_most_certain = checkpoint['test_accuracies_most_certain']
+            else:
+                print('Only reloading model!')
+            if 'torch_rng_state' in checkpoint:
+                # Reset seeds
+                torch.set_rng_state(checkpoint['torch_rng_state'].cpu().byte())
+                np.random.set_state(checkpoint['numpy_rng_state'])
+                random.setstate(checkpoint['random_rng_state'])
+            del checkpoint
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+    # Conditional Compilation
+    if args.do_compile:
+        print('Compiling...')
+        if hasattr(model, 'backbone'):
+            model.backbone = torch.compile(model.backbone, mode='reduce-overhead', fullgraph=True)
+        # Compile synapses only for CTM
+        if args.model == 'ctm':
+            model.synapses = torch.compile(model.synapses, mode='reduce-overhead', fullgraph=True)
+    # Training
+    iterator = iter(trainloader)
+    with tqdm(total=args.training_iterations, initial=start_iter, leave=False, position=0, dynamic_ncols=True) as pbar:
+        for bi in range(start_iter, args.training_iterations):
+            current_lr = optimizer.param_groups[-1]['lr']
+            try:
+                inputs, targets = next(iterator)
+            except StopIteration:
+                iterator = iter(trainloader)
+                inputs, targets = next(iterator)
+            inputs = inputs.to(device)
+            targets = targets.to(device)
+            loss = None
+            accuracy = None
+            # Model-specific forward and loss calculation
+            with torch.autocast(device_type="cuda" if "cuda" in device else "cpu", dtype=torch.float16, enabled=args.use_amp):
+                if args.do_compile: # CUDAGraph marking for clean compile
+                     torch.compiler.cudagraph_mark_step_begin()
+                if args.model == 'ctm':
+                    predictions, certainties, synchronisation = model(inputs)
+                    loss, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+                    accuracy = (predictions.argmax(1)[torch.arange(predictions.size(0), device=predictions.device),where_most_certain] == targets).float().mean().item()
+                    pbar_desc = f'CTM Loss={loss.item():0.3f}. Acc={accuracy:0.3f}. LR={current_lr:0.6f}. Where_certain={where_most_certain.float().mean().item():0.2f}+-{where_most_certain.float().std().item():0.2f} ({where_most_certain.min().item():d}<->{where_most_certain.max().item():d})'
+                elif args.model == 'lstm':
+                    predictions, certainties, synchronisation = model(inputs)
+                    loss, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+                    # LSTM where_most_certain will just be -1 because use_most_certain is False owing to stability issues with LSTM training
+                    accuracy = (predictions.argmax(1)[torch.arange(predictions.size(0), device=predictions.device),where_most_certain] == targets).float().mean().item()
+                    pbar_desc = f'LSTM Loss={loss.item():0.3f}. Acc={accuracy:0.3f}. LR={current_lr:0.6f}. Where_certain={where_most_certain.float().mean().item():0.2f}+-{where_most_certain.float().std().item():0.2f} ({where_most_certain.min().item():d}<->{where_most_certain.max().item():d})'
+                elif args.model == 'ff':
+                    predictions = model(inputs)
+                    loss = nn.CrossEntropyLoss()(predictions, targets)
+                    accuracy = (predictions.argmax(1) == targets).float().mean().item()
+                    pbar_desc = f'FF Loss={loss.item():0.3f}. Acc={accuracy:0.3f}. LR={current_lr:0.6f}'
+            scaler.scale(loss).backward()
+            if args.gradient_clipping!=-1:
+                scaler.unscale_(optimizer)
+                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=args.gradient_clipping)
+            scaler.step(optimizer)
+            scaler.update()
+            optimizer.zero_grad(set_to_none=True)
+            scheduler.step()
+            pbar.set_description(f'Dataset={args.dataset}. Model={args.model}. {pbar_desc}')
+            # Metrics tracking and plotting (conditional logic needed)
+            if (bi % args.track_every == 0 or bi == args.warmup_steps) and (bi != 0 or args.reload_model_only):
+                iters.append(bi)
+                current_train_losses = []
+                current_test_losses = []
+                current_train_accuracies = [] # Holds list of accuracies per tick for CTM/LSTM, single value for FF
+                current_test_accuracies = [] # Holds list of accuracies per tick for CTM/LSTM, single value for FF
+                current_train_accuracies_most_certain = [] # Only for CTM/LSTM
+                current_test_accuracies_most_certain = [] # Only for CTM/LSTM
+                # Reset BN stats using train mode
+                pbar.set_description('Resetting BN')
+                model.train()
+                for module in model.modules():
+                    if isinstance(module, (torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, torch.nn.BatchNorm3d)):
+                        module.reset_running_stats()
+                pbar.set_description('Tracking: Computing TRAIN metrics')
+                with torch.no_grad(): # Should use inference_mode? CTM/LSTM scripts used no_grad
+                    loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size_test, shuffle=True, num_workers=num_workers_test)
+                    all_targets_list = []
+                    all_predictions_list = [] # List to store raw predictions (B, C, T) or (B, C)
+                    all_predictions_most_certain_list = [] # Only for CTM/LSTM
+                    all_losses = []
+                    with tqdm(total=len(loader), initial=0, leave=False, position=1, dynamic_ncols=True) as pbar_inner:
+                        for inferi, (inputs, targets) in enumerate(loader):
+                            inputs = inputs.to(device)
+                            targets = targets.to(device)
+                            all_targets_list.append(targets.detach().cpu().numpy())
+                            # Model-specific forward and loss for evaluation
+                            if args.model == 'ctm':
+                                these_predictions, certainties, _ = model(inputs)
+                                loss, where_most_certain = image_classification_loss(these_predictions, certainties, targets, use_most_certain=True)
+                                all_predictions_list.append(these_predictions.argmax(1).detach().cpu().numpy()) # Shape (B, T)
+                                all_predictions_most_certain_list.append(these_predictions.argmax(1)[torch.arange(these_predictions.size(0), device=these_predictions.device), where_most_certain].detach().cpu().numpy()) # Shape (B,)
+                            elif args.model == 'lstm':
+                                these_predictions, certainties, _ = model(inputs)
+                                loss, where_most_certain = image_classification_loss(these_predictions, certainties, targets, use_most_certain=True)
+                                all_predictions_list.append(these_predictions.argmax(1).detach().cpu().numpy()) # Shape (B, T)
+                                all_predictions_most_certain_list.append(these_predictions.argmax(1)[torch.arange(these_predictions.size(0), device=these_predictions.device), where_most_certain].detach().cpu().numpy()) # Shape (B,)
+                            elif args.model == 'ff':
+                                these_predictions = model(inputs)
+                                loss = nn.CrossEntropyLoss()(these_predictions, targets)
+                                all_predictions_list.append(these_predictions.argmax(1).detach().cpu().numpy()) # Shape (B,)
+                            all_losses.append(loss.item())
+                            if args.n_test_batches != -1 and inferi >= args.n_test_batches -1 : break # Check condition >= N-1
+                            pbar_inner.set_description(f'Computing metrics for train (Batch {inferi+1})')
+                            pbar_inner.update(1)
+                    all_targets = np.concatenate(all_targets_list)
+                    all_predictions = np.concatenate(all_predictions_list) # Shape (N, T) or (N,)
+                    train_losses.append(np.mean(all_losses))
+                    if args.model in ['ctm', 'lstm']:
+                        # Accuracies per tick for CTM/LSTM
+                        current_train_accuracies = np.mean(all_predictions == all_targets[...,np.newaxis], axis=0) # Mean over batch dim -> Shape (T,)
+                        train_accuracies.append(current_train_accuracies)
+                        # Most certain accuracy
+                        all_predictions_most_certain = np.concatenate(all_predictions_most_certain_list)
+                        current_train_accuracies_most_certain = (all_targets == all_predictions_most_certain).mean()
+                        train_accuracies_most_certain.append(current_train_accuracies_most_certain)
+                    else: # FF
+                         current_train_accuracies = (all_targets == all_predictions).mean() # Shape scalar
+                         train_accuracies.append(current_train_accuracies)
+                del these_predictions
+                # Switch to eval mode for test metrics (fixed BN stats)
+                model.eval()
+                pbar.set_description('Tracking: Computing TEST metrics')
+                with torch.inference_mode(): # Use inference_mode for test eval
+                    loader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, shuffle=True, num_workers=num_workers_test)
+                    all_targets_list = []
+                    all_predictions_list = []
+                    all_predictions_most_certain_list = [] # Only for CTM/LSTM
+                    all_losses = []
+                    with tqdm(total=len(loader), initial=0, leave=False, position=1, dynamic_ncols=True) as pbar_inner:
+                       for inferi, (inputs, targets) in enumerate(loader):
+                            inputs = inputs.to(device)
+                            targets = targets.to(device)
+                            all_targets_list.append(targets.detach().cpu().numpy())
+                            # Model-specific forward and loss for evaluation
+                            if args.model == 'ctm':
+                                these_predictions, certainties, _ = model(inputs)
+                                loss, where_most_certain = image_classification_loss(these_predictions, certainties, targets, use_most_certain=True)
+                                all_predictions_list.append(these_predictions.argmax(1).detach().cpu().numpy())
+                                all_predictions_most_certain_list.append(these_predictions.argmax(1)[torch.arange(these_predictions.size(0), device=these_predictions.device), where_most_certain].detach().cpu().numpy())
+                            elif args.model == 'lstm':
+                                these_predictions, certainties, _ = model(inputs)
+                                loss, where_most_certain = image_classification_loss(these_predictions, certainties, targets, use_most_certain=True)
+                                all_predictions_list.append(these_predictions.argmax(1).detach().cpu().numpy())
+                                all_predictions_most_certain_list.append(these_predictions.argmax(1)[torch.arange(these_predictions.size(0), device=these_predictions.device), where_most_certain].detach().cpu().numpy())
+                            elif args.model == 'ff':
+                                these_predictions = model(inputs)
+                                loss = nn.CrossEntropyLoss()(these_predictions, targets)
+                                all_predictions_list.append(these_predictions.argmax(1).detach().cpu().numpy())
+                            all_losses.append(loss.item())
+                            if args.n_test_batches != -1 and inferi >= args.n_test_batches -1: break
+                            pbar_inner.set_description(f'Computing metrics for test (Batch {inferi+1})')
+                            pbar_inner.update(1)
+                    all_targets = np.concatenate(all_targets_list)
+                    all_predictions = np.concatenate(all_predictions_list)
+                    test_losses.append(np.mean(all_losses))
+                    if args.model in ['ctm', 'lstm']:
+                        current_test_accuracies = np.mean(all_predictions == all_targets[...,np.newaxis], axis=0)
+                        test_accuracies.append(current_test_accuracies)
+                        all_predictions_most_certain = np.concatenate(all_predictions_most_certain_list)
+                        current_test_accuracies_most_certain = (all_targets == all_predictions_most_certain).mean()
+                        test_accuracies_most_certain.append(current_test_accuracies_most_certain)
+                    else: # FF
+                         current_test_accuracies = (all_targets == all_predictions).mean()
+                         test_accuracies.append(current_test_accuracies)
+                # Plotting (conditional)
+                figacc = plt.figure(figsize=(10, 10))
+                axacc_train = figacc.add_subplot(211)
+                axacc_test = figacc.add_subplot(212)
+                cm = sns.color_palette("viridis", as_cmap=True)
+                if args.model in ['ctm', 'lstm']:
+                    # Plot per-tick accuracy for CTM/LSTM
+                    train_acc_arr = np.array(train_accuracies) # Shape (N_iters, T)
+                    test_acc_arr = np.array(test_accuracies) # Shape (N_iters, T)
+                    num_ticks = train_acc_arr.shape[1]
+                    for ti in range(num_ticks):
+                         axacc_train.plot(iters, train_acc_arr[:, ti], color=cm(ti / num_ticks), alpha=0.3)
+                         axacc_test.plot(iters, test_acc_arr[:, ti], color=cm(ti / num_ticks), alpha=0.3)
+                    # Plot most certain accuracy
+                    axacc_train.plot(iters, train_accuracies_most_certain, 'k--', alpha=0.7, label='Most certain')
+                    axacc_test.plot(iters, test_accuracies_most_certain, 'k--', alpha=0.7, label='Most certain')
+                else: # FF
+                    axacc_train.plot(iters, train_accuracies, 'k-', alpha=0.7, label='Accuracy') # Simple line
+                    axacc_test.plot(iters, test_accuracies, 'k-', alpha=0.7, label='Accuracy')
+                axacc_train.set_title('Train Accuracy')
+                axacc_test.set_title('Test Accuracy')
+                axacc_train.legend(loc='lower right')
+                axacc_test.legend(loc='lower right')
+                axacc_train.set_xlim([0, args.training_iterations])
+                axacc_test.set_xlim([0, args.training_iterations])
+                if args.dataset=='cifar10':
+                    axacc_train.set_ylim([0.75, 1])
+                    axacc_test.set_ylim([0.75, 1])
+                figacc.tight_layout()
+                figacc.savefig(f'{args.log_dir}/accuracies.png', dpi=150)
+                plt.close(figacc)
+                figloss = plt.figure(figsize=(10, 5))
+                axloss = figloss.add_subplot(111)
+                axloss.plot(iters, train_losses, 'b-', linewidth=1, alpha=0.8, label=f'Train: {train_losses[-1]:.4f}')
+                axloss.plot(iters, test_losses, 'r-', linewidth=1, alpha=0.8, label=f'Test: {test_losses[-1]:.4f}')
+                axloss.legend(loc='upper right')
+                axloss.set_xlim([0, args.training_iterations])
+                axloss.set_ylim(bottom=0)
+                figloss.tight_layout()
+                figloss.savefig(f'{args.log_dir}/losses.png', dpi=150)
+                plt.close(figloss)
+                # Conditional Visualization (Only for CTM/LSTM)
+                if args.model in ['ctm', 'lstm']:
+                    try: # For safety
+                        inputs_viz, targets_viz = next(iter(testloader)) # Get a fresh batch
+                        inputs_viz = inputs_viz.to(device)
+                        targets_viz = targets_viz.to(device)
+                        pbar.set_description('Tracking: Processing test data for viz')
+                        predictions_viz, certainties_viz, _, pre_activations_viz, post_activations_viz, attention_tracking_viz = model(inputs_viz, track=True)
+                        att_shape = (model.kv_features.shape[2], model.kv_features.shape[3])
+                        attention_tracking_viz = attention_tracking_viz.reshape(
+                            attention_tracking_viz.shape[0],
+                            attention_tracking_viz.shape[1], -1, att_shape[0], att_shape[1])
+                        pbar.set_description('Tracking: Neural dynamics plot')
+                        plot_neural_dynamics(post_activations_viz, 100, args.log_dir, axis_snap=True)
+                        imgi = 0 # Visualize the first image in the batch
+                        img_to_gif = np.moveaxis(np.clip(inputs_viz[imgi].detach().cpu().numpy()*np.array(dataset_std).reshape(len(dataset_std), 1, 1) + np.array(dataset_mean).reshape(len(dataset_mean), 1, 1), 0, 1), 0, -1)
+                        pbar.set_description('Tracking: Producing attention gif')
+                        make_classification_gif(img_to_gif,
+                                                targets_viz[imgi].item(),
+                                                predictions_viz[imgi].detach().cpu().numpy(),
+                                                certainties_viz[imgi].detach().cpu().numpy(),
+                                                post_activations_viz[:,imgi],
+                                                attention_tracking_viz[:,imgi],
+                                                class_labels,
+                                                f'{args.log_dir}/{imgi}_attention.gif',
+                                                )
+                        del predictions_viz, certainties_viz, pre_activations_viz, post_activations_viz, attention_tracking_viz
+                    except Exception as e:
+                        print(f"Visualization failed for model {args.model}: {e}")
+                gc.collect()
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()
+                model.train() # Switch back to train mode
+            # Save model checkpoint (conditional metrics)
+            if (bi % args.save_every == 0 or bi == args.training_iterations - 1) and bi != start_iter:
+                pbar.set_description('Saving model checkpoint...')
+                checkpoint_data = {
+                    'model_state_dict': model.state_dict(),
+                    'optimizer_state_dict': optimizer.state_dict(),
+                    'scheduler_state_dict': scheduler.state_dict(),
+                    'scaler_state_dict': scaler.state_dict(),
+                    'iteration': bi,
+                    # Always save these
+                    'train_losses': train_losses,
+                    'test_losses': test_losses,
+                    'train_accuracies': train_accuracies, # This is list of scalars for FF, list of arrays for CTM/LSTM
+                    'test_accuracies': test_accuracies, # This is list of scalars for FF, list of arrays for CTM/LSTM
+                    'iters': iters,
+                    'args': args, # Save args used for this run
+                    # RNG states
+                    'torch_rng_state': torch.get_rng_state(),
+                    'numpy_rng_state': np.random.get_state(),
+                    'random_rng_state': random.getstate(),
+                }
+                # Conditionally add metrics specific to CTM/LSTM
+                if args.model in ['ctm', 'lstm']:
+                    checkpoint_data['train_accuracies_most_certain'] = train_accuracies_most_certain
+                    checkpoint_data['test_accuracies_most_certain'] = test_accuracies_most_certain
+                torch.save(checkpoint_data, f'{args.log_dir}/checkpoint.pt')
+            pbar.update(1)

tasks/image_classification/train_distributed.py ADDED Viewed

	@@ -0,0 +1,799 @@

+import argparse
+import os
+import random
+import time
+import matplotlib.pyplot as plt
+import numpy as np
+import seaborn as sns
+sns.set_style('darkgrid')
+import torch
+if torch.cuda.is_available():
+    # For faster
+    torch.set_float32_matmul_precision('high')
+import torch.nn as nn
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+from utils.samplers import FastRandomDistributedSampler
+from tqdm.auto import tqdm
+from tasks.image_classification.train import get_dataset # Use shared get_dataset
+# Model Imports
+from models.ctm import ContinuousThoughtMachine
+from models.lstm import LSTMBaseline
+from models.ff import FFBaseline
+# Plotting/Utils Imports
+from tasks.image_classification.plotting import plot_neural_dynamics, make_classification_gif
+from utils.housekeeping import set_seed, zip_python_code
+from utils.losses import image_classification_loss # For CTM, LSTM
+from utils.schedulers import WarmupCosineAnnealingLR, WarmupMultiStepLR, warmup
+import torchvision
+torchvision.disable_beta_transforms_warning()
+import warnings
+warnings.filterwarnings("ignore", message="using precomputed metric; inverse_transform will be unavailable")
+warnings.filterwarnings('ignore', message='divide by zero encountered in power', category=RuntimeWarning)
+warnings.filterwarnings("ignore", message="UserWarning: Metadata Warning, tag 274 had too many entries: 4, expected 1")
+warnings.filterwarnings(
+    "ignore",
+    "Corrupt EXIF data",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Metadata Warning",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Truncated File Read",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Model Selection
+    parser.add_argument('--model', type=str, required=True, choices=['ctm', 'lstm', 'ff'], help='Model type to train.')
+    # Model Architecture
+    # Common
+    parser.add_argument('--d_model', type=int, default=512, help='Dimension of the model.')
+    parser.add_argument('--dropout', type=float, default=0.0, help='Dropout rate.')
+    parser.add_argument('--backbone_type', type=str, default='resnet18-4', help='Type of backbone featureiser.')
+    # CTM / LSTM specific
+    parser.add_argument('--d_input', type=int, default=128, help='Dimension of the input (CTM, LSTM).')
+    parser.add_argument('--heads', type=int, default=4, help='Number of attention heads (CTM, LSTM).')
+    parser.add_argument('--iterations', type=int, default=50, help='Number of internal ticks (CTM, LSTM).')
+    parser.add_argument('--positional_embedding_type', type=str, default='none', help='Type of positional embedding (CTM, LSTM).',
+                        choices=['none',
+                                 'learnable-fourier',
+                                 'multi-learnable-fourier',
+                                 'custom-rotational'])
+    # CTM specific
+    parser.add_argument('--synapse_depth', type=int, default=4, help='Depth of U-NET model for synapse. 1=linear, no unet (CTM only).')
+    parser.add_argument('--n_synch_out', type=int, default=32, help='Number of neurons to use for output synch (CTM only).')
+    parser.add_argument('--n_synch_action', type=int, default=32, help='Number of neurons to use for observation/action synch (CTM only).')
+    parser.add_argument('--neuron_select_type', type=str, default='first-last', help='Protocol for selecting neuron subset (CTM only).')
+    parser.add_argument('--n_random_pairing_self', type=int, default=256, help='Number of neurons paired self-to-self for synch (CTM only).')
+    parser.add_argument('--memory_length', type=int, default=25, help='Length of the pre-activation history for NLMS (CTM only).')
+    parser.add_argument('--deep_memory', action=argparse.BooleanOptionalAction, default=True, help='Use deep memory (CTM only).')
+    parser.add_argument('--memory_hidden_dims', type=int, default=4, help='Hidden dimensions of the memory if using deep memory (CTM only).')
+    parser.add_argument('--dropout_nlm', type=float, default=None, help='Dropout rate for NLMs specifically. Unset to match dropout on the rest of the model (CTM only).')
+    parser.add_argument('--do_normalisation', action=argparse.BooleanOptionalAction, default=False, help='Apply normalization in NLMs (CTM only).')
+    # LSTM specific
+    parser.add_argument('--num_layers', type=int, default=2, help='Number of LSTM stacked layers (LSTM only).')
+    # Training
+    parser.add_argument('--batch_size', type=int, default=32, help='Batch size for training (per GPU).')
+    parser.add_argument('--batch_size_test', type=int, default=32, help='Batch size for testing (per GPU).')
+    parser.add_argument('--lr', type=float, default=1e-3, help='Learning rate for the model.')
+    parser.add_argument('--training_iterations', type=int, default=100001, help='Number of training iterations.')
+    parser.add_argument('--warmup_steps', type=int, default=5000, help='Number of warmup steps.')
+    parser.add_argument('--use_scheduler', action=argparse.BooleanOptionalAction, default=True, help='Use a learning rate scheduler.')
+    parser.add_argument('--scheduler_type', type=str, default='cosine', choices=['multistep', 'cosine'], help='Type of learning rate scheduler.')
+    parser.add_argument('--milestones', type=int, default=[8000, 15000, 20000], nargs='+', help='Learning rate scheduler milestones.')
+    parser.add_argument('--gamma', type=float, default=0.1, help='Learning rate scheduler gamma for multistep.')
+    parser.add_argument('--weight_decay', type=float, default=0.0, help='Weight decay factor.')
+    parser.add_argument('--weight_decay_exclusion_list', type=str, nargs='+', default=[], help='List to exclude from weight decay. Typically good: bn, ln, bias, start')
+    parser.add_argument('--gradient_clipping', type=float, default=-1, help='Gradient quantile clipping value (-1 to disable).')
+    parser.add_argument('--num_workers_train', type=int, default=1, help='Num workers training.')
+    parser.add_argument('--use_custom_sampler', action=argparse.BooleanOptionalAction, default=False, help='Use custom fast sampler to avoid reshuffling.')
+    parser.add_argument('--do_compile', action=argparse.BooleanOptionalAction, default=False, help='Try to compile model components.')
+    # Housekeeping
+    parser.add_argument('--log_dir', type=str, default='logs/scratch', help='Directory for logging.')
+    parser.add_argument('--dataset', type=str, default='cifar10', help='Dataset to use.')
+    parser.add_argument('--data_root', type=str, default='data/', help='Where to save dataset.')
+    parser.add_argument('--save_every', type=int, default=1000, help='Save checkpoints every this many iterations.')
+    parser.add_argument('--seed', type=int, default=412, help='Random seed.')
+    parser.add_argument('--reload', action=argparse.BooleanOptionalAction, default=False, help='Reload from disk?')
+    parser.add_argument('--reload_model_only', action=argparse.BooleanOptionalAction, default=False, help='Reload only the model from disk?')
+    parser.add_argument('--strict_reload', action=argparse.BooleanOptionalAction, default=True, help='Should use strict reload for model weights.')
+    parser.add_argument('--ignore_metrics_when_reloading', action=argparse.BooleanOptionalAction, default=False, help='Ignore metrics when reloading?')
+    # Tracking
+    parser.add_argument('--track_every', type=int, default=1000, help='Track metrics every this many iterations.')
+    parser.add_argument('--n_test_batches', type=int, default=20, help='How many minibatches to approx metrics. Set to -1 for full eval')
+    parser.add_argument('--plot_indices', type=int, default=[0], nargs='+', help='Which indices in test data to plot?') # Defaulted to 0
+    # Precision
+    parser.add_argument('--use_amp', action=argparse.BooleanOptionalAction, default=False, help='AMP autocast.')
+    args = parser.parse_args()
+    return args
+# --- DDP Setup Functions ---
+def setup_ddp():
+    if 'RANK' not in os.environ:
+        # Basic setup for non-distributed run
+        os.environ['RANK'] = '0'
+        os.environ['WORLD_SIZE'] = '1'
+        os.environ['MASTER_ADDR'] = 'localhost'
+        os.environ['MASTER_PORT'] = '12355' # Ensure this port is free
+        os.environ['LOCAL_RANK'] = '0'
+        print("Running in non-distributed mode (simulated DDP setup).")
+        # Need to manually init if only 1 process desired for non-GPU testing
+        if not torch.cuda.is_available() or int(os.environ['WORLD_SIZE']) == 1:
+            dist.init_process_group(backend='gloo') # Gloo backend for CPU
+            print("Initialized process group with Gloo backend for single/CPU process.")
+            rank = int(os.environ['RANK'])
+            world_size = int(os.environ['WORLD_SIZE'])
+            local_rank = int(os.environ['LOCAL_RANK'])
+            return rank, world_size, local_rank
+    # Standard DDP setup
+    dist.init_process_group(backend='nccl') # 'nccl' for NVIDIA GPUs
+    rank = int(os.environ['RANK'])
+    world_size = int(os.environ['WORLD_SIZE'])
+    local_rank = int(os.environ['LOCAL_RANK'])
+    if torch.cuda.is_available():
+        torch.cuda.set_device(local_rank)
+        print(f"Rank {rank} setup on GPU {local_rank}")
+    else:
+         print(f"Rank {rank} setup on CPU (GPU not available or requested)")
+    return rank, world_size, local_rank
+def cleanup_ddp():
+    if dist.is_initialized():
+        dist.destroy_process_group()
+        print("DDP cleanup complete.")
+def is_main_process(rank):
+    return rank == 0
+# --- End DDP Setup ---
+if __name__=='__main__':
+    args = parse_args()
+    rank, world_size, local_rank = setup_ddp()
+    set_seed(args.seed + rank, False) # Add rank for different seeds per process
+    # Rank 0 handles directory creation and initial logging
+    if is_main_process(rank):
+        if not os.path.exists(args.log_dir): os.makedirs(args.log_dir)
+        zip_python_code(f'{args.log_dir}/repo_state.zip')
+        with open(f'{args.log_dir}/args.txt', 'w') as f:
+            print(args, file=f)
+    if world_size > 1: dist.barrier() # Sync after rank 0 setup
+    assert args.dataset in ['cifar10', 'cifar100', 'imagenet']
+    # Data Loading
+    train_data, test_data, class_labels, dataset_mean, dataset_std = get_dataset(args.dataset, args.data_root)
+    # Setup Samplers
+    # This custom sampler is useful when using large batch sizes for Cifar. Otherwise the reshuffle happens tediously often
+    train_sampler = (FastRandomDistributedSampler(train_data, num_replicas=world_size, rank=rank, seed=args.seed, epoch_steps=int(10e10))
+                     if args.use_custom_sampler else
+                     DistributedSampler(train_data, num_replicas=world_size, rank=rank, shuffle=True, seed=args.seed))
+    test_sampler = DistributedSampler(test_data, num_replicas=world_size, rank=rank, shuffle=False, seed=args.seed) # No shuffle needed for test; consistent
+    # Setup DataLoaders
+    trainloader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size, sampler=train_sampler,
+                                              num_workers=args.num_workers_train, pin_memory=True, drop_last=True) # drop_last=True often used in DDP
+    testloader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, sampler=test_sampler,
+                                             num_workers=1, pin_memory=True, drop_last=False)
+    prediction_reshaper = [-1]  # Task specific
+    args.out_dims = len(class_labels)
+    # Setup Device
+    if torch.cuda.is_available():
+        device = torch.device(f'cuda:{local_rank}')
+    else:
+        device = torch.device('cpu')
+        if world_size > 1:
+             warnings.warn("Running DDP on CPU is not recommended.")
+    if is_main_process(rank):
+        print(f'Main process (Rank {rank}): Using device {device}. World size: {world_size}. Model: {args.model}')
+    # --- Model Definition (Conditional) ---
+    model_base = None # Base model before DDP wrapping
+    if args.model == 'ctm':
+        model_base = ContinuousThoughtMachine(
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            n_synch_out=args.n_synch_out,
+            n_synch_action=args.n_synch_action,
+            synapse_depth=args.synapse_depth,
+            memory_length=args.memory_length,
+            deep_nlms=args.deep_memory,
+            memory_hidden_dims=args.memory_hidden_dims,
+            do_layernorm_nlm=args.do_normalisation,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+            dropout_nlm=args.dropout_nlm,
+            neuron_select_type=args.neuron_select_type,
+            n_random_pairing_self=args.n_random_pairing_self,
+        ).to(device)
+    elif args.model == 'lstm':
+        model_base = LSTMBaseline(
+            num_layers=args.num_layers,
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+            start_type=args.start_type,
+        ).to(device)
+    elif args.model == 'ff':
+        model_base = FFBaseline(
+            d_model=args.d_model,
+            backbone_type=args.backbone_type,
+            out_dims=args.out_dims,
+            dropout=args.dropout,
+        ).to(device)
+    else:
+        raise ValueError(f"Unknown model type: {args.model}")
+    # Initialize lazy modules if any
+    try:
+        pseudo_inputs = train_data.__getitem__(0)[0].unsqueeze(0).to(device)
+        model_base(pseudo_inputs)
+    except Exception as e:
+         print(f"Warning: Pseudo forward pass failed: {e}")
+    # Wrap model with DDP
+    if device.type == 'cuda' and world_size > 1:
+        model = DDP(model_base, device_ids=[local_rank], output_device=local_rank)
+    elif device.type == 'cpu' and world_size > 1:
+        model = DDP(model_base) # No device_ids for CPU
+    else: # Single process run
+        model = model_base # No DDP wrapping needed
+    if is_main_process(rank):
+        # Access underlying model for param count
+        param_count = sum(p.numel() for p in model.module.parameters() if p.requires_grad) if world_size > 1 else sum(p.numel() for p in model.parameters() if p.requires_grad)
+        print(f'Total trainable params: {param_count}')
+    # --- End Model Definition ---
+    # Optimizer and scheduler
+    # Use model.parameters() directly, DDP handles it
+    decay_params = []
+    no_decay_params = []
+    no_decay_names = []
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue # Skip parameters that don't require gradients
+        if any(exclusion_str in name for exclusion_str in args.weight_decay_exclusion_list):
+            no_decay_params.append(param)
+            no_decay_names.append(name)
+        else:
+            decay_params.append(param)
+    if len(no_decay_names) and is_main_process(rank):
+        print(f'WARNING, excluding: {no_decay_names}')
+    # Optimizer and scheduler (Common setup)
+    if len(no_decay_names) and args.weight_decay!=0:
+        optimizer = torch.optim.AdamW([{'params': decay_params, 'weight_decay':args.weight_decay},
+                                       {'params': no_decay_params, 'weight_decay':0}],
+                                  lr=args.lr,
+                                  eps=1e-8 if not args.use_amp else 1e-6)
+    else:
+        optimizer = torch.optim.AdamW(model.parameters(),
+                                    lr=args.lr,
+                                    eps=1e-8 if not args.use_amp else 1e-6,
+                                    weight_decay=args.weight_decay)
+    warmup_schedule = warmup(args.warmup_steps)
+    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warmup_schedule.step)
+    if args.use_scheduler:
+        if args.scheduler_type == 'multistep':
+            scheduler = WarmupMultiStepLR(optimizer, warmup_steps=args.warmup_steps, milestones=args.milestones, gamma=args.gamma)
+        elif args.scheduler_type == 'cosine':
+            scheduler = WarmupCosineAnnealingLR(optimizer, args.warmup_steps, args.training_iterations, warmup_start_lr=1e-20, eta_min=1e-7)
+        else:
+            raise NotImplementedError
+    # Metrics tracking (on Rank 0)
+    start_iter = 0
+    train_losses = []
+    test_losses = []
+    train_accuracies = [] # Placeholder for potential detailed accuracy
+    test_accuracies = []  # Placeholder for potential detailed accuracy
+    # Conditional metrics
+    train_accuracies_most_certain = [] if args.model in ['ctm', 'lstm'] else None # Scalar accuracy list
+    test_accuracies_most_certain = [] if args.model in ['ctm', 'lstm'] else None  # Scalar accuracy list
+    train_accuracies_standard = [] if args.model == 'ff' else None # Standard accuracy list for FF
+    test_accuracies_standard = [] if args.model == 'ff' else None  # Standard accuracy list for FF
+    iters = []
+    scaler = torch.amp.GradScaler("cuda" if device.type == 'cuda' else "cpu", enabled=args.use_amp)
+    # Reloading Logic
+    if args.reload:
+        map_location = device # Load directly onto the process's device
+        chkpt_path = f'{args.log_dir}/checkpoint.pt'
+        if os.path.isfile(chkpt_path):
+            print(f'Rank {rank}: Reloading from: {chkpt_path}')
+            checkpoint = torch.load(chkpt_path, map_location=map_location, weights_only=False)
+            # Determine underlying model based on whether DDP wrapping occurred
+            model_to_load = model.module if isinstance(model, DDP) else model
+            # Handle potential 'module.' prefix in saved state_dict
+            state_dict = checkpoint['model_state_dict']
+            has_module_prefix = all(k.startswith('module.') for k in state_dict)
+            is_wrapped = isinstance(model, DDP)
+            if has_module_prefix and not is_wrapped:
+                # Saved with DDP, loading into non-DDP model -> remove prefix
+                state_dict = {k.partition('module.')[2]: v for k,v in state_dict.items()}
+            elif not has_module_prefix and is_wrapped:
+                load_result = model_to_load.load_state_dict(state_dict, strict=args.strict_reload)
+                print(f" Loaded state_dict. Missing: {load_result.missing_keys}, Unexpected: {load_result.unexpected_keys}")
+                state_dict = None # Prevent loading again
+            if state_dict is not None:
+                load_result = model_to_load.load_state_dict(state_dict, strict=args.strict_reload)
+                print(f" Loaded state_dict. Missing: {load_result.missing_keys}, Unexpected: {load_result.unexpected_keys}")
+            if not args.reload_model_only:
+                print(f'Rank {rank}: Reloading optimizer, scheduler, scaler, iteration.')
+                optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+                scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
+                scaler_state_dict = checkpoint['scaler_state_dict']
+                if scaler.is_enabled():
+                    print("Loading non-empty GradScaler state dict.")
+                    try:
+                        scaler.load_state_dict(scaler_state_dict)
+                    except Exception as e:
+                        print(f"Error loading GradScaler state dict: {e}")
+                        print("Continuing with a fresh GradScaler state.")
+                start_iter = checkpoint['iteration']
+                # Only rank 0 loads metric history
+                if is_main_process(rank) and not args.ignore_metrics_when_reloading:
+                    print(f'Rank {rank}: Reloading metrics history.')
+                    iters = checkpoint['iters']
+                    train_losses = checkpoint['train_losses']
+                    test_losses = checkpoint['test_losses']
+                    train_accuracies = checkpoint['train_accuracies']
+                    test_accuracies = checkpoint['test_accuracies']
+                    if args.model in ['ctm', 'lstm']:
+                        train_accuracies_most_certain = checkpoint['train_accuracies_most_certain']
+                        test_accuracies_most_certain = checkpoint['test_accuracies_most_certain']
+                    elif args.model == 'ff':
+                        train_accuracies_standard = checkpoint['train_accuracies_standard']
+                        test_accuracies_standard = checkpoint['test_accuracies_standard']
+                elif is_main_process(rank) and args.ignore_metrics_when_reloading:
+                     print(f'Rank {rank}: Ignoring metrics history upon reload.')
+            else:
+                 print(f'Rank {rank}: Only reloading model weights!')
+            # Load RNG states
+            if is_main_process(rank) and 'torch_rng_state' in checkpoint and not args.reload_model_only:
+                print(f'Rank {rank}: Loading RNG states (may need DDP adaptation for full reproducibility).')
+                torch.set_rng_state(checkpoint['torch_rng_state'].cpu()) # Load CPU state
+                # Add CUDA state loading if needed, ensuring correct device handling
+                np.random.set_state(checkpoint['numpy_rng_state'])
+                random.setstate(checkpoint['random_rng_state'])
+            del checkpoint
+            if torch.cuda.is_available(): torch.cuda.empty_cache()
+            print(f"Rank {rank}: Reload finished, starting from iteration {start_iter}")
+        else:
+            print(f"Rank {rank}: Checkpoint not found at {chkpt_path}, starting from scratch.")
+        if world_size > 1: dist.barrier() # Sync after loading
+    # Conditional Compilation
+    if args.do_compile:
+        if is_main_process(rank): print('Compiling model components...')
+        # Compile on the underlying model if wrapped
+        model_to_compile = model.module if isinstance(model, DDP) else model
+        if hasattr(model_to_compile, 'backbone'):
+            model_to_compile.backbone = torch.compile(model_to_compile.backbone, mode='reduce-overhead', fullgraph=True)
+        if args.model == 'ctm':
+            if hasattr(model_to_compile, 'synapses'):
+                model_to_compile.synapses = torch.compile(model_to_compile.synapses, mode='reduce-overhead', fullgraph=True)
+        if world_size > 1: dist.barrier() # Sync after compilation
+        if is_main_process(rank): print('Compilation finished.')
+    # --- Training Loop ---
+    model.train() # Ensure model is in train mode
+    pbar = tqdm(total=args.training_iterations, initial=start_iter, leave=False, position=0, dynamic_ncols=True, disable=not is_main_process(rank))
+    iterator = iter(trainloader)
+    for bi in range(start_iter, args.training_iterations):
+        # Set sampler epoch (important for shuffling in DistributedSampler)
+        if not args.use_custom_sampler and hasattr(train_sampler, 'set_epoch'):
+            train_sampler.set_epoch(bi)
+        current_lr = optimizer.param_groups[-1]['lr']
+        time_start_data = time.time()
+        try:
+            inputs, targets = next(iterator)
+        except StopIteration:
+            # Reset iterator - set_epoch handles shuffling if needed
+            iterator = iter(trainloader)
+            inputs, targets = next(iterator)
+        inputs = inputs.to(device, non_blocking=True)
+        targets = targets.to(device, non_blocking=True)
+        time_end_data = time.time()
+        loss = None
+        # Model-specific forward and loss calculation
+        time_start_forward = time.time()
+        with torch.autocast(device_type="cuda" if device.type == 'cuda' else "cpu", dtype=torch.float16, enabled=args.use_amp):
+            if args.do_compile:
+                 torch.compiler.cudagraph_mark_step_begin()
+            if args.model == 'ctm':
+                predictions, certainties, synchronisation = model(inputs)
+                loss, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+            elif args.model == 'lstm':
+                predictions, certainties, synchronisation = model(inputs)
+                loss, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+            elif args.model == 'ff':
+                predictions = model(inputs) # FF returns only predictions
+                loss = nn.CrossEntropyLoss()(predictions, targets)
+                where_most_certain = None # Not applicable for FF standard loss
+        time_end_forward = time.time()
+        time_start_backward = time.time()
+        scaler.scale(loss).backward() # DDP handles gradient synchronization
+        time_end_backward = time.time()
+        if args.gradient_clipping!=-1:
+            scaler.unscale_(optimizer)
+            # Clip gradients across all parameters controlled by the optimizer
+            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=args.gradient_clipping)
+        scaler.step(optimizer)
+        scaler.update()
+        optimizer.zero_grad(set_to_none=True)
+        scheduler.step()
+        # --- Aggregation and Logging (Rank 0) ---
+        # Aggregate loss for logging
+        loss_log = loss.detach() # Use detached loss for aggregation
+        if world_size > 1: dist.all_reduce(loss_log, op=dist.ReduceOp.AVG)
+        if is_main_process(rank):
+             # Calculate accuracy locally on rank 0 for description (approximate)
+             # Note: This uses rank 0's batch, not aggregated accuracy
+             accuracy_local = 0.0
+             if args.model in ['ctm', 'lstm']:
+                accuracy_local = (predictions.argmax(1)[torch.arange(predictions.size(0), device=device), where_most_certain] == targets).float().mean().item()
+                where_certain_tensor = where_most_certain.float() # Use rank 0's tensor for stats
+                pbar_desc = f'Timing; d={(time_end_data-time_start_data):0.3f}, f={(time_end_forward-time_start_forward):0.3f}, b={(time_end_backward-time_start_backward):0.3f}. Loss(avg)={loss_log.item():.3f} Acc(loc)={accuracy_local:.3f} LR={current_lr:.6f} WhereCert(loc)={where_certain_tensor.mean().item():.2f}'
+             elif args.model == 'ff':
+                accuracy_local = (predictions.argmax(1) == targets).float().mean().item()
+                pbar_desc = f'Timing; d={(time_end_data-time_start_data):0.3f}, f={(time_end_forward-time_start_forward):0.3f}, b={(time_end_backward-time_start_backward):0.3f}. Loss(avg)={loss_log.item():.3f} Acc(loc)={accuracy_local:.3f} LR={current_lr:.6f}'
+             pbar.set_description(f'{args.model.upper()} {pbar_desc}')
+        # --- End Aggregation and Logging ---
+        # --- Evaluation and Plotting (Rank 0 + Aggregation) ---
+        if bi % args.track_every == 0 and (bi != 0 or args.reload_model_only):
+            model.eval()
+            with torch.inference_mode():
+                # --- Distributed Evaluation ---
+                iters.append(bi)
+                # TRAIN METRICS
+                total_train_loss = torch.tensor(0.0, device=device)
+                total_train_correct_certain = torch.tensor(0.0, device=device) # CTM/LSTM
+                total_train_correct_standard = torch.tensor(0.0, device=device) # FF
+                total_train_samples = torch.tensor(0.0, device=device)
+                # Use a sampler for evaluation to ensure non-overlapping data if needed
+                train_eval_sampler = DistributedSampler(train_data, num_replicas=world_size, rank=rank, shuffle=False)
+                train_eval_loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size_test, sampler=train_eval_sampler, num_workers=1, pin_memory=True)
+                pbar_inner_desc = 'Eval Train (Rank 0)' if is_main_process(rank) else None
+                with tqdm(total=len(train_eval_loader), desc=pbar_inner_desc, leave=False, position=1, dynamic_ncols=True, disable=not is_main_process(rank)) as pbar_inner:
+                    for inferi, (inputs, targets) in enumerate(train_eval_loader):
+                        inputs = inputs.to(device, non_blocking=True)
+                        targets = targets.to(device, non_blocking=True)
+                        loss_eval = None
+                        if args.model == 'ctm':
+                            predictions, certainties, _ = model(inputs)
+                            loss_eval, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+                            preds_eval = predictions.argmax(1)[torch.arange(predictions.size(0), device=device), where_most_certain]
+                            total_train_correct_certain += (preds_eval == targets).sum()
+                        elif args.model == 'lstm':
+                            predictions, certainties, _ = model(inputs)
+                            loss_eval, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+                            preds_eval = predictions.argmax(1)[torch.arange(predictions.size(0), device=device), where_most_certain]
+                            total_train_correct_certain += (preds_eval == targets).sum()
+                        elif args.model == 'ff':
+                            predictions = model(inputs)
+                            loss_eval = nn.CrossEntropyLoss()(predictions, targets)
+                            preds_eval = predictions.argmax(1)
+                            total_train_correct_standard += (preds_eval == targets).sum()
+                        total_train_loss += loss_eval * inputs.size(0)
+                        total_train_samples += inputs.size(0)
+                        if args.n_test_batches != -1 and inferi >= args.n_test_batches -1: break
+                        pbar_inner.update(1)
+                # Aggregate Train Metrics
+                if world_size > 1:
+                    dist.all_reduce(total_train_loss, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_train_correct_certain, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_train_correct_standard, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_train_samples, op=dist.ReduceOp.SUM)
+                # Calculate final Train metrics on Rank 0
+                if is_main_process(rank) and total_train_samples > 0:
+                    avg_train_loss = total_train_loss.item() / total_train_samples.item()
+                    train_losses.append(avg_train_loss)
+                    if args.model in ['ctm', 'lstm']:
+                        avg_train_acc_certain = total_train_correct_certain.item() / total_train_samples.item()
+                        train_accuracies_most_certain.append(avg_train_acc_certain)
+                    elif args.model == 'ff':
+                        avg_train_acc_standard = total_train_correct_standard.item() / total_train_samples.item()
+                        train_accuracies_standard.append(avg_train_acc_standard)
+                    print(f"Iter {bi} Train Metrics (Agg): Loss={avg_train_loss:.4f}")
+                # TEST METRICS
+                total_test_loss = torch.tensor(0.0, device=device)
+                total_test_correct_certain = torch.tensor(0.0, device=device) # CTM/LSTM
+                total_test_correct_standard = torch.tensor(0.0, device=device) # FF
+                total_test_samples = torch.tensor(0.0, device=device)
+                pbar_inner_desc = 'Eval Test (Rank 0)' if is_main_process(rank) else None
+                with tqdm(total=len(testloader), desc=pbar_inner_desc, leave=False, position=1, dynamic_ncols=True, disable=not is_main_process(rank)) as pbar_inner:
+                    for inferi, (inputs, targets) in enumerate(testloader): # Testloader already uses sampler
+                        inputs = inputs.to(device, non_blocking=True)
+                        targets = targets.to(device, non_blocking=True)
+                        loss_eval = None
+                        if args.model == 'ctm':
+                            predictions, certainties, _ = model(inputs)
+                            loss_eval, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+                            preds_eval = predictions.argmax(1)[torch.arange(predictions.size(0), device=device), where_most_certain]
+                            total_test_correct_certain += (preds_eval == targets).sum()
+                        elif args.model == 'lstm':
+                            predictions, certainties, _ = model(inputs)
+                            loss_eval, where_most_certain = image_classification_loss(predictions, certainties, targets, use_most_certain=True)
+                            preds_eval = predictions.argmax(1)[torch.arange(predictions.size(0), device=device), where_most_certain]
+                            total_test_correct_certain += (preds_eval == targets).sum()
+                        elif args.model == 'ff':
+                            predictions = model(inputs)
+                            loss_eval = nn.CrossEntropyLoss()(predictions, targets)
+                            preds_eval = predictions.argmax(1)
+                            total_test_correct_standard += (preds_eval == targets).sum()
+                        total_test_loss += loss_eval * inputs.size(0)
+                        total_test_samples += inputs.size(0)
+                        if args.n_test_batches != -1 and inferi >= args.n_test_batches -1: break
+                        pbar_inner.update(1)
+                # Aggregate Test Metrics
+                if world_size > 1:
+                    dist.all_reduce(total_test_loss, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_test_correct_certain, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_test_correct_standard, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_test_samples, op=dist.ReduceOp.SUM)
+                # Calculate and Plot final Test metrics on Rank 0
+                if is_main_process(rank) and total_test_samples > 0:
+                    avg_test_loss = total_test_loss.item() / total_test_samples.item()
+                    test_losses.append(avg_test_loss)
+                    acc_label = ''
+                    acc_val = 0.0
+                    if args.model in ['ctm', 'lstm']:
+                        avg_test_acc_certain = total_test_correct_certain.item() / total_test_samples.item()
+                        test_accuracies_most_certain.append(avg_test_acc_certain)
+                        acc_label = f'Most certain ({avg_test_acc_certain:.3f})'
+                        acc_val = avg_test_acc_certain
+                    elif args.model == 'ff':
+                        avg_test_acc_standard = total_test_correct_standard.item() / total_test_samples.item()
+                        test_accuracies_standard.append(avg_test_acc_standard)
+                        acc_label = f'Standard Acc ({avg_test_acc_standard:.3f})'
+                        acc_val = avg_test_acc_standard
+                    print(f"Iter {bi} Test Metrics (Agg): Loss={avg_test_loss:.4f}, Acc={acc_val:.4f}\n")
+                    # --- Plotting ---
+                    figacc = plt.figure(figsize=(10, 10))
+                    axacc_train = figacc.add_subplot(211)
+                    axacc_test = figacc.add_subplot(212)
+                    if args.model in ['ctm', 'lstm']:
+                        axacc_train.plot(iters, train_accuracies_most_certain, 'k-', alpha=0.9, label=f'Most certain ({train_accuracies_most_certain[-1]:.3f})')
+                        axacc_test.plot(iters, test_accuracies_most_certain, 'k-', alpha=0.9, label=acc_label)
+                    elif args.model == 'ff':
+                        axacc_train.plot(iters, train_accuracies_standard, 'k-', alpha=0.9, label=f'Standard Acc ({train_accuracies_standard[-1]:.3f})')
+                        axacc_test.plot(iters, test_accuracies_standard, 'k-', alpha=0.9, label=acc_label)
+                    axacc_train.set_title('Train Accuracy (Aggregated)')
+                    axacc_test.set_title('Test Accuracy (Aggregated)')
+                    axacc_train.legend(loc='lower right')
+                    axacc_test.legend(loc='lower right')
+                    axacc_train.set_xlim([0, args.training_iterations])
+                    axacc_test.set_xlim([0, args.training_iterations])
+                    # Keep dataset specific ylim adjustments if needed
+                    if args.dataset == 'imagenet':
+                        # For easy comparison when training
+                        train_ylim_set = False
+                        if args.model in ['ctm', 'lstm'] and len(train_accuracies_most_certain)>0 and np.any(np.array(train_accuracies_most_certain)>0.4): train_ylim_set=True; axacc_train.set_ylim([0.4, 1])
+                        if args.model == 'ff' and len(train_accuracies_standard)>0 and np.any(np.array(train_accuracies_standard)>0.4): train_ylim_set=True; axacc_train.set_ylim([0.4, 1])
+                        test_ylim_set = False
+                        if args.model in ['ctm', 'lstm'] and len(test_accuracies_most_certain)>0 and np.any(np.array(test_accuracies_most_certain)>0.3): test_ylim_set=True; axacc_test.set_ylim([0.3, 0.8])
+                        if args.model == 'ff' and len(test_accuracies_standard)>0 and np.any(np.array(test_accuracies_standard)>0.3): test_ylim_set=True; axacc_test.set_ylim([0.3, 0.8])
+                    figacc.tight_layout()
+                    figacc.savefig(f'{args.log_dir}/accuracies.png', dpi=150)
+                    plt.close(figacc)
+                    # Loss Plot
+                    figloss = plt.figure(figsize=(10, 5))
+                    axloss = figloss.add_subplot(111)
+                    axloss.plot(iters, train_losses, 'b-', linewidth=1, alpha=0.8, label=f'Train (Aggregated): {train_losses[-1]:.4f}')
+                    axloss.plot(iters, test_losses, 'r-', linewidth=1, alpha=0.8, label=f'Test (Aggregated): {test_losses[-1]:.4f}')
+                    axloss.legend(loc='upper right')
+                    axloss.set_xlabel("Iteration")
+                    axloss.set_ylabel("Loss")
+                    axloss.set_xlim([0, args.training_iterations])
+                    axloss.set_ylim(bottom=0)
+                    figloss.tight_layout()
+                    figloss.savefig(f'{args.log_dir}/losses.png', dpi=150)
+                    plt.close(figloss)
+                    # --- End Plotting ---
+                # Visualization on Rank 0
+                if is_main_process(rank) and args.model in ['ctm', 'lstm']:
+                    try:
+                        model_module = model.module if isinstance(model, DDP) else model # Get underlying model
+                        # Simplified viz: use first batch from testloader
+                        inputs_viz, targets_viz = next(iter(testloader))
+                        inputs_viz = inputs_viz.to(device)
+                        targets_viz = targets_viz.to(device)
+                        pbar.set_description('Tracking (Rank 0): Viz Fwd Pass')
+                        predictions_viz, certainties_viz, _, pre_activations_viz, post_activations_viz, attention_tracking_viz = model_module(inputs_viz, track=True)
+                        att_shape = (model_module.kv_features.shape[2], model_module.kv_features.shape[3])
+                        attention_tracking_viz = attention_tracking_viz.reshape(
+                            attention_tracking_viz.shape[0],
+                            attention_tracking_viz.shape[1], -1, att_shape[0], att_shape[1])
+                        pbar.set_description('Tracking (Rank 0): Dynamics Plot')
+                        plot_neural_dynamics(post_activations_viz, 100, args.log_dir, axis_snap=True)
+                        # Plot specific indices from test_data directly
+                        pbar.set_description('Tracking (Rank 0): GIF Generation')
+                        for plot_idx in args.plot_indices:
+                            try:
+                                if plot_idx < len(test_data):
+                                    inputs_plot, target_plot = test_data.__getitem__(plot_idx)
+                                    inputs_plot = inputs_plot.unsqueeze(0).to(device)
+                                    preds_plot, certs_plot, _, _, posts_plot, atts_plot = model_module(inputs_plot, track=True)
+                                    atts_plot = atts_plot.reshape(atts_plot.shape[0], atts_plot.shape[1], -1, att_shape[0], att_shape[1])
+                                    img_gif = np.moveaxis(np.clip(inputs_plot[0].detach().cpu().numpy()*np.array(dataset_std).reshape(len(dataset_std), 1, 1) + np.array(dataset_mean).reshape(len(dataset_mean), 1, 1), 0, 1), 0, -1)
+                                    make_classification_gif(img_gif, target_plot, preds_plot[0].detach().cpu().numpy(), certs_plot[0].detach().cpu().numpy(),
+                                                        posts_plot[:,0], atts_plot[:,0] if atts_plot is not None else None, class_labels,
+                                                        f'{args.log_dir}/idx{plot_idx}_attention.gif')
+                                else:
+                                    print(f"Warning: Plot index {plot_idx} out of range for test dataset size {len(test_data)}.")
+                            except Exception as e_gif:
+                                print(f"Rank 0 GIF generation failed for index {plot_idx}: {e_gif}")
+                    except Exception as e_viz:
+                        print(f"Rank 0 visualization failed: {e_viz}")
+            if world_size > 1: dist.barrier() # Sync after evaluation block
+            model.train() # Set back to train mode
+        # --- End Evaluation Block ---
+        # --- Checkpointing (Rank 0) ---
+        if (bi % args.save_every == 0 or bi == args.training_iterations - 1) and bi != start_iter and is_main_process(rank):
+            pbar.set_description('Rank 0: Saving checkpoint...')
+            save_path = f'{args.log_dir}/checkpoint.pt'
+            # Access underlying model state dict if DDP is used
+            model_state_to_save = model.module.state_dict() if isinstance(model, DDP) else model.state_dict()
+            save_dict = {
+                    'model_state_dict': model_state_to_save,
+                    'optimizer_state_dict': optimizer.state_dict(),
+                    'scheduler_state_dict': scheduler.state_dict(),
+                    'scaler_state_dict':scaler.state_dict(),
+                    'iteration': bi,
+                    'train_losses': train_losses,
+                    'test_losses': test_losses,
+                    'iters': iters,
+                    'args': args,
+                    'torch_rng_state': torch.get_rng_state(), # CPU state
+                    'numpy_rng_state': np.random.get_state(),
+                    'random_rng_state': random.getstate(),
+                    # Include conditional metrics
+                    'train_accuracies': train_accuracies, # Placeholder
+                    'test_accuracies': test_accuracies,   # Placeholder
+                }
+            if args.model in ['ctm', 'lstm']:
+                save_dict['train_accuracies_most_certain'] = train_accuracies_most_certain
+                save_dict['test_accuracies_most_certain'] = test_accuracies_most_certain
+            elif args.model == 'ff':
+                save_dict['train_accuracies_standard'] = train_accuracies_standard
+                save_dict['test_accuracies_standard'] = test_accuracies_standard
+            torch.save(save_dict , save_path)
+            pbar.set_description(f"Rank 0: Checkpoint saved to {save_path}")
+        # --- End Checkpointing ---
+        if world_size > 1: dist.barrier() # Sync before next iteration
+        # Update pbar on Rank 0
+        if is_main_process(rank):
+            pbar.update(1)
+    # --- End Training Loop ---
+    if is_main_process(rank):
+        pbar.close()
+    cleanup_ddp() # Cleanup DDP resources

tasks/mazes/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Mazes
+This folder contains code for training and analysing 2D maze solving experiments
+## Training
+To run the maze training that we used for the paper, run the following command from the parent directory:
+```
+python -m tasks.mazes.train --d_model 2048 --d_input 512 --synapse_depth 4 --heads 8 --n_synch_out 64 --n_synch_action 32 --neuron_select_type first-last --iterations 75 --memory_length 25 --deep_memory --memory_hidden_dims 32 --dropout 0.1 --no-do_normalisation --positional_embedding_type none --backbone_type resnet34-2 --batch_size 64 --batch_size_test 64 --lr 1e-4 --training_iterations 1000001 --warmup_steps 10000 --use_scheduler --scheduler_type cosine --weight_decay 0.0 --log_dir logs/mazes/d=2048--i=512--h=8--ns=64-32--iters=75x25--h=32--drop=0.1--pos=none--back=34-2--seed=42 --dataset mazes-medium --save_every 2000 --track_every 5000 --seed 42 --n_test_batches 50
+```

tasks/mazes/analysis/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Analysis
+This folder contains analysis code for 2D maze experiments. To build GIFs for imagenet run (from the base directory):
+To run maze analysis run the following command from the parent directory:
+```
+python -m tasks.mazes.analysis.run --actions viz viz --checkpoint checkpoints/mazes/ctm_mazeslarge_D=2048_T=75_M=25.pt
+```
+You will need to download the checkpoint from here: https://drive.google.com/file/d/1vGiMaQCxzKVT68SipxDCW0W5n5jjEQnC/view?usp=drive_link . Extract this to the appropriate directory: `checkpoints/mazes/...` . Otherwise, use your own after training.

tasks/mazes/analysis/run.py ADDED Viewed

	@@ -0,0 +1,407 @@

+import torch
+import numpy as np
+np.seterr(divide='ignore', invalid='warn') # Keep specific numpy error settings
+import matplotlib as mpl
+mpl.use('Agg') # Use Agg backend for matplotlib (important to set before importing pyplot)
+import matplotlib.pyplot as plt
+import seaborn as sns
+sns.set_style('darkgrid') # Keep seaborn style
+import os
+import argparse
+import cv2
+import imageio # Used for saving GIFs in viz
+# Local imports
+from data.custom_datasets import MazeImageFolder
+from models.ctm import ContinuousThoughtMachine
+from tasks.mazes.plotting import draw_path #
+from tasks.image_classification.plotting import save_frames_to_mp4
+def has_solved_checker(x_maze, route, valid_only=True, fault_tolerance=1, exclusions=[]):
+    """Checks if a route solves a maze."""
+    maze = np.copy(x_maze)
+    H, W, _ = maze.shape
+    start_coords = np.argwhere((maze == [1, 0, 0]).all(axis=2))
+    end_coords = np.argwhere((maze == [0, 1, 0]).all(axis=2))
+    if len(start_coords) == 0:
+        return False, (-1, -1), 0  # Cannot start
+    current_pos = tuple(start_coords[0])
+    target_pos = tuple(end_coords[0]) if len(end_coords) > 0 else None
+    mistakes_made = 0
+    final_pos = current_pos
+    path_taken_len = 0
+    for step in route:
+        if mistakes_made > fault_tolerance:
+            break
+        next_pos_candidate = list(current_pos) # Use a list for mutable coordinate calculation
+        if step == 0: next_pos_candidate[0] -= 1
+        elif step == 1: next_pos_candidate[0] += 1
+        elif step == 2: next_pos_candidate[1] -= 1
+        elif step == 3: next_pos_candidate[1] += 1
+        elif step == 4: pass  # Stay in place
+        else: continue # Invalid step action
+        next_pos = tuple(next_pos_candidate)
+        is_invalid_step = False
+        # Check bounds first, then maze content if in bounds
+        if not (0 <= next_pos[0] < H and 0 <= next_pos[1] < W):
+            is_invalid_step = True
+        elif np.all(maze[next_pos] == [0, 0, 0]):  # Wall
+            is_invalid_step = True
+        if is_invalid_step:
+            mistakes_made += 1
+            if valid_only:
+                continue
+        current_pos = next_pos
+        path_taken_len += 1
+        if target_pos and current_pos == target_pos:
+            if mistakes_made <= fault_tolerance:
+                return True, current_pos, path_taken_len
+        if mistakes_made <= fault_tolerance:
+            # Assuming exclusions is a list of tuples (as populated in the 'gen' action)
+            if current_pos not in exclusions:
+                final_pos = current_pos
+    if target_pos and final_pos == target_pos and mistakes_made <= fault_tolerance: # Added mistakes_made check here
+        return True, final_pos, path_taken_len
+    return False, final_pos, path_taken_len
+def parse_args():
+    """Parses command-line arguments for maze analysis."""
+    parser = argparse.ArgumentParser(description="Analyze Asynchronous Thought Machine on Maze Tasks")
+    parser.add_argument('--actions', type=str, nargs='+', default=['gen'], help="Actions: 'viz', 'gen'")
+    parser.add_argument('--device', type=int, nargs='+', default=[-1], help="GPU device index or -1 for CPU")
+    parser.add_argument('--checkpoint', type=str, default='checkpoints/mazes/ctm_mazeslarge_D=2048_T=75_M=25.pt', help="Path to CTM checkpoint")
+    parser.add_argument('--output_dir', type=str, default='tasks/mazes/analysis/outputs', help="Directory for analysis outputs")
+    parser.add_argument('--dataset_for_viz', type=str, default='large', help="Dataset for 'viz' action")
+    parser.add_argument('--dataset_for_gen', type=str, default='extralarge', help="Dataset for 'gen' action")
+    parser.add_argument('--batch_size_test', type=int, default=32, help="Batch size for loading test data for 'viz'")
+    parser.add_argument('--max_reapplications', type=int, default=20, help="When testing generalisation to extra large mazes")
+    parser.add_argument('--legacy_scaling', action=argparse.BooleanOptionalAction, default=True, help='Legacy checkpoints scale between 0 and 1, new ones can scale -1 to 1.')
+    return parser.parse_args()
+def _load_ctm_model(checkpoint_path, device):
+    """Loads the ContinuousThoughtMachine model from a checkpoint."""
+    print(f"Loading checkpoint: {checkpoint_path}")
+    checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)
+    model_args = checkpoint['args']
+    # Handle legacy arguments for model_args
+    if not hasattr(model_args, 'backbone_type') and hasattr(model_args, 'resnet_type'):
+        model_args.backbone_type = f'{model_args.resnet_type}-{getattr(model_args, "resnet_feature_scales", [4])[-1]}'
+    # Ensure prediction_reshaper is derived correctly
+    # Assuming out_dims exists and is used for this
+    prediction_reshaper = [model_args.out_dims // 5, 5] if hasattr(model_args, 'out_dims') else None
+    if not hasattr(model_args, 'neuron_select_type'):
+        model_args.neuron_select_type = 'first-last'
+    if not hasattr(model_args, 'n_random_pairing_self'):
+        model_args.n_random_pairing_self = 0
+    print("Instantiating CTM model...")
+    model = ContinuousThoughtMachine(
+        iterations=model_args.iterations,
+        d_model=model_args.d_model,
+        d_input=model_args.d_input,
+        heads=model_args.heads,
+        n_synch_out=model_args.n_synch_out,
+        n_synch_action=model_args.n_synch_action,
+        synapse_depth=model_args.synapse_depth,
+        memory_length=model_args.memory_length,
+        deep_nlms=model_args.deep_memory, # Mapping from model_args.deep_memory
+        memory_hidden_dims=model_args.memory_hidden_dims,
+        do_layernorm_nlm=model_args.do_normalisation, # Mapping from model_args.do_normalisation
+        backbone_type=model_args.backbone_type,
+        positional_embedding_type=model_args.positional_embedding_type,
+        out_dims=model_args.out_dims,
+        prediction_reshaper=prediction_reshaper,
+        dropout=0, # Explicitly setting dropout to 0 as in original
+        neuron_select_type=model_args.neuron_select_type,
+        n_random_pairing_self=model_args.n_random_pairing_self,
+    ).to(device)
+    load_result = model.load_state_dict(checkpoint['state_dict'], strict=False)
+    print(f"Loaded state_dict. Missing keys: {load_result.missing_keys}, Unexpected keys: {load_result.unexpected_keys}")
+    model.eval()
+    return model
+# --- Main Execution Block ---
+if __name__=='__main__':
+    args = parse_args()
+    if args.device[0] != -1 and torch.cuda.is_available():
+        device = f'cuda:{args.device[0]}'
+    else:
+        device = 'cpu'
+    print(f"Using device: {device}")
+    palette = sns.color_palette("husl", 8)
+    cmap = plt.get_cmap('gist_rainbow')
+    # --- Generalisation Action ('gen') ---
+    if 'gen' in args.actions:
+        model = _load_ctm_model(args.checkpoint, device)
+        print(f"\n--- Running Generalisation Analysis ('gen'): {args.dataset_for_gen} ---")
+        target_dataset_name = f'{args.dataset_for_gen}'
+        data_root = f'data/mazes/{target_dataset_name}/test'
+        max_target_route_len = 50 # Specific to 'gen' action
+        test_data = MazeImageFolder(
+            root=data_root, which_set='test',
+            maze_route_length=max_target_route_len,
+            expand_range=not args.legacy_scaling, # Legacy checkpoints need a [0, 1] range, but it might be better to default to [-1, 1] in the future
+            trunc=True
+        )
+        # Load a single large batch for 'gen'
+        testloader = torch.utils.data.DataLoader(
+            test_data, batch_size=min(len(test_data), 2000),
+            shuffle=False, num_workers=1
+        )
+        inputs, targets = next(iter(testloader))
+        actual_lengths = (targets != 4).sum(dim=-1)
+        sorted_indices = torch.argsort(actual_lengths, descending=True)
+        inputs, targets, actual_lengths = inputs[sorted_indices], targets[sorted_indices], actual_lengths[sorted_indices]
+        test_how_many = min(1000, len(inputs))
+        print(f"Processing {test_how_many} mazes sorted by length...")
+        results = {}
+        fault_tolerance = 2 # Specific to 'gen' analysis
+        output_gen_dir = os.path.join(args.output_dir, 'gen', args.dataset_for_gen)
+        os.makedirs(output_gen_dir, exist_ok=True)
+        for n_tested in range(test_how_many):
+            maze_actual_length = actual_lengths[n_tested].item()
+            maze_idx_display = n_tested + 1
+            print(f"Testing maze {maze_idx_display}/{test_how_many} (Len: {maze_actual_length})...")
+            initial_input_maze = inputs[n_tested:n_tested+1].clone().to(device)
+            maze_output_dir = os.path.join(output_gen_dir, f"maze_{maze_idx_display}")
+            re_applications = 0
+            has_solved = False
+            current_input_maze = initial_input_maze
+            exclusions = []
+            long_frames = []
+            ongoing_solution_img = None
+            while not has_solved and re_applications < args.max_reapplications:
+                re_applications += 1
+                with torch.no_grad():
+                     predictions, certainties, _, _, _, attention_tracking = model(current_input_maze, track=True)
+                h_feat, w_feat = model.kv_features.shape[-2:]
+                attention_tracking = attention_tracking.reshape(attention_tracking.shape[0], -1, h_feat, w_feat)
+                n_steps_viz = predictions.shape[-1] # Use a different name to avoid conflict if n_steps is used elsewhere
+                step_linspace = np.linspace(0, 1, n_steps_viz)
+                current_maze_np = current_input_maze[0].permute(1,2,0).detach().cpu().numpy()
+                for stepi in range(n_steps_viz):
+                    pred_route = predictions[0, :, stepi].reshape(-1, 5).argmax(-1).detach().cpu().numpy()
+                    frame = draw_path(current_maze_np, pred_route)
+                    if attention_tracking is not None and stepi < attention_tracking.shape[0]:
+                        try:
+                            attn = attention_tracking[stepi].mean(0)
+                            attn_resized = cv2.resize(attn, (current_maze_np.shape[1], current_maze_np.shape[0]), interpolation=cv2.INTER_LINEAR)
+                            if attn_resized.max() > attn_resized.min():
+                                attn_norm = (attn_resized - attn_resized.min()) / (attn_resized.max() - attn_resized.min())
+                                attn_norm[attn_norm < np.percentile(attn_norm, 80)] = 0.0
+                                frame = np.clip((np.copy(frame)*(1-attn_norm[:,:,np.newaxis])*1 + (attn_norm[:,:,np.newaxis]*0.8 * np.reshape(np.array(cmap(step_linspace[stepi]))[:3], (1, 1, 3)))), 0, 1)
+                        except Exception: # Keep broad except for visualization robustness
+                            pass
+                    frame_resized = cv2.resize(frame, (int(current_maze_np.shape[1]*4), int(current_maze_np.shape[0]*4)), interpolation=cv2.INTER_NEAREST) # Corrected shape[1]*4 for height
+                    long_frames.append((np.clip(frame_resized, 0, 1) * 255).astype(np.uint8))
+                where_most_certain = certainties[0, 1].argmax().item()
+                chosen_pred_route = predictions[0, :, where_most_certain].reshape(-1, 5).argmax(-1).detach().cpu().numpy()
+                current_start_loc_list = np.argwhere((current_maze_np == [1, 0, 0]).all(axis=2)).tolist()
+                # Ensure current_start_loc_list is not empty before trying to access its elements
+                if not current_start_loc_list:
+                    print(f"Warning: Could not find start location in maze {maze_idx_display} during reapplication {re_applications}. Stopping reapplication.")
+                    break # Cannot proceed without a start location
+                solved_now, final_pos, _ = has_solved_checker(current_maze_np, chosen_pred_route, True, fault_tolerance, exclusions)
+                path_img = draw_path(current_maze_np, chosen_pred_route, cmap=cmap, valid_only=True)
+                if ongoing_solution_img is None:
+                    ongoing_solution_img = path_img
+                else:
+                    mask = (np.any(ongoing_solution_img!=path_img, -1))&(~np.all(path_img==[1,1,1], -1))&(~np.all(ongoing_solution_img==[1,0,0], -1))
+                    ongoing_solution_img[mask] = path_img[mask]
+                if solved_now:
+                    has_solved = True
+                    break
+                if tuple(current_start_loc_list[0]) == final_pos:
+                    exclusions.append(tuple(current_start_loc_list[0]))
+                next_input = current_input_maze.clone()
+                old_start_idx = tuple(current_start_loc_list[0])
+                next_input[0, :, old_start_idx[0], old_start_idx[1]] = 1.0 # Reset old start to path
+                if 0 <= final_pos[0] < next_input.shape[2] and 0 <= final_pos[1] < next_input.shape[3]:
+                    next_input[0, :, final_pos[0], final_pos[1]] = torch.tensor([1,0,0], device=device, dtype=next_input.dtype) # New start
+                else:
+                    print(f"Warning: final_pos {final_pos} out of bounds for maze {maze_idx_display}. Stopping reapplication.")
+                    break
+                current_input_maze = next_input
+            if has_solved:
+                print(f'Solved maze of length {maze_actual_length}! Saving...')
+                os.makedirs(maze_output_dir, exist_ok=True)
+                if ongoing_solution_img is not None:
+                    cv2.imwrite(os.path.join(maze_output_dir, 'ongoing_solution.png'), (ongoing_solution_img * 255).astype(np.uint8)[:,:,::-1])
+                if long_frames:
+                    save_frames_to_mp4([fm[:,:,::-1] for fm in long_frames], os.path.join(maze_output_dir, f'combined_process.mp4'), fps=45, gop_size=10, preset='veryslow', crf=20)
+            else:
+                print(f'Failed maze of length {maze_actual_length} after {re_applications} reapplications. Not saving visuals for this maze.')
+            if maze_actual_length not in results: results[maze_actual_length] = []
+            results[maze_actual_length].append((has_solved, re_applications))
+            fig_success, ax_success = plt.subplots()
+            fig_reapp, ax_reapp = plt.subplots()
+            sorted_lengths = sorted(results.keys())
+            if sorted_lengths:
+                success_rates = [np.mean([r[0] for r in results[l]]) * 100 for l in sorted_lengths]
+                reapps_mean = [np.mean([r[1] for r in results[l] if r[0]]) if any(r[0] for r in results[l]) else np.nan for l in sorted_lengths]
+                ax_success.plot(sorted_lengths, success_rates, linestyle='-', color=palette[0])
+                ax_reapp.plot(sorted_lengths, reapps_mean, linestyle='-', color=palette[5])
+            ax_success.set_xlabel('Route Length'); ax_success.set_ylabel('Success (%)')
+            ax_reapp.set_xlabel('Route Length'); ax_reapp.set_ylabel('Re-applications (Avg on Success)')
+            fig_success.tight_layout(pad=0.1); fig_reapp.tight_layout(pad=0.1)
+            fig_success.savefig(os.path.join(output_gen_dir, f'{args.dataset_for_gen}-success_rate.png'), dpi=200)
+            fig_success.savefig(os.path.join(output_gen_dir, f'{args.dataset_for_gen}-success_rate.pdf'), dpi=200)
+            fig_reapp.savefig(os.path.join(output_gen_dir, f'{args.dataset_for_gen}-re-applications.png'), dpi=200)
+            fig_reapp.savefig(os.path.join(output_gen_dir, f'{args.dataset_for_gen}-re-applications.pdf'), dpi=200)
+            plt.close(fig_success); plt.close(fig_reapp)
+            np.savez(os.path.join(output_gen_dir, f'{args.dataset_for_gen}_results.npz'), results=results)
+        print("\n--- Generalisation Analysis ('gen') Complete ---")
+    # --- Visualization Action ('viz') ---
+    if 'viz' in args.actions:
+        model = _load_ctm_model(args.checkpoint, device)
+        print(f"\n--- Running Visualization ('viz'): {args.dataset_for_viz} ---")
+        output_viz_dir = os.path.join(args.output_dir, 'viz')
+        os.makedirs(output_viz_dir, exist_ok=True)
+        target_dataset_name = f'{args.dataset_for_viz}'
+        data_root = f'data/mazes/{target_dataset_name}/test'
+        test_data = MazeImageFolder(
+            root=data_root, which_set='test',
+            maze_route_length=100, # Max route length for viz data
+            expand_range=not args.legacy_scaling, #  # Legacy checkpoints need a [0, 1] range, but it might be better to default to [-1, 1] in the future
+            trunc=True
+        )
+        testloader = torch.utils.data.DataLoader(
+            test_data, batch_size=args.batch_size_test,
+            shuffle=False, num_workers=1
+        )
+        all_inputs, all_targets, all_lengths = [], [], []
+        for b_in, b_tgt in testloader:
+            all_inputs.append(b_in)
+            all_targets.append(b_tgt)
+            all_lengths.append((b_tgt != 4).sum(dim=-1))
+        if not all_inputs:
+            print("Error: No data in visualization loader. Exiting 'viz' action.")
+            exit()
+        all_inputs, all_targets, all_lengths = torch.cat(all_inputs), torch.cat(all_targets), torch.cat(all_lengths)
+        num_viz_mazes = 10
+        num_viz_mazes = min(num_viz_mazes, len(all_lengths))
+        if num_viz_mazes == 0:
+            print("Error: No mazes found to visualize. Exiting 'viz' action.")
+            exit()
+        top_indices = torch.argsort(all_lengths, descending=True)[:num_viz_mazes]
+        inputs_viz, targets_viz = all_inputs[top_indices].to(device), all_targets[top_indices]
+        print(f"Visualizing {len(inputs_viz)} longest mazes...")
+        with torch.no_grad():
+            predictions, _, _, _, _, attention_tracking = model(inputs_viz, track=True)
+        # Reshape attention: (Steps, Batch, Heads, H_feat, W_feat) assuming model.kv_features has H_feat, W_feat
+        # The original reshape was slightly different, this tries to match the likely intended dimensions for per-step, per-batch item attention
+        if attention_tracking is not None and hasattr(model, 'kv_features') and model.kv_features is not None:
+             attention_tracking = attention_tracking.reshape(
+                 attention_tracking.shape[0], # Iterations/Steps
+                 inputs_viz.size(0), # Batch size (num_viz_mazes)
+                 -1, # Heads (inferred)
+                 model.kv_features.shape[-2], # H_feat
+                 model.kv_features.shape[-1]  # W_feat
+            )
+        else:
+            attention_tracking = None # Ensure it's None if it can't be reshaped
+            print("Warning: Could not reshape attention_tracking. Visualizations may not include attention overlays.")
+        for maze_i in range(inputs_viz.size(0)):
+            maze_idx_display = maze_i + 1
+            maze_output_dir = os.path.join(output_viz_dir, f"maze_{maze_idx_display}")
+            os.makedirs(maze_output_dir, exist_ok=True)
+            current_input_np_original = inputs_viz[maze_i].permute(1,2,0).detach().cpu().numpy()
+            # Apply scaling for visualization based on legacy_scaling: Legacy checkpoints need a [0, 1] range, but it might be better to default to [-1, 1] in the future
+            current_input_np_display = (current_input_np_original + 1) / 2 if not args.legacy_scaling else current_input_np_original
+            current_target_route = targets_viz[maze_i].detach().cpu().numpy()
+            print(f"Generating viz for maze {maze_idx_display}...")
+            try:
+                 solution_maze_img = draw_path(current_input_np_display, current_target_route, gt=True)
+                 cv2.imwrite(os.path.join(maze_output_dir, 'solution_ground_truth.png'), (solution_maze_img * 255).astype(np.uint8)[:,:,::-1])
+            except Exception: # Keep broad except for visualization robustness
+                 print(f"Could not save ground truth solution for maze {maze_idx_display}")
+                 pass
+            frames = []
+            n_steps_viz = predictions.shape[-1] # Use a different name
+            step_linspace = np.linspace(0, 1, n_steps_viz)
+            for stepi in range(n_steps_viz):
+                pred_route = predictions[maze_i, :, stepi].reshape(-1, 5).argmax(-1).detach().cpu().numpy()
+                frame = draw_path(current_input_np_display, pred_route)
+                if attention_tracking is not None and stepi < attention_tracking.shape[0] and maze_i < attention_tracking.shape[1]:
+                    # Attention for current step (stepi) and current maze in batch (maze_i), average over heads
+                    attn = attention_tracking[stepi, maze_i].mean(0)
+                    attn_resized = cv2.resize(attn, (current_input_np_display.shape[1], current_input_np_display.shape[0]), interpolation=cv2.INTER_LINEAR)
+                    if attn_resized.max() > attn_resized.min():
+                            attn_norm = (attn_resized - attn_resized.min()) / (attn_resized.max() - attn_resized.min())
+                            attn_norm[attn_norm < np.percentile(attn_norm, 80)] = 0.0
+                            frame = np.clip((np.copy(frame)*(1-attn_norm[:,:,np.newaxis])*0.9 + (attn_norm[:,:,np.newaxis]*1.2 * np.reshape(np.array(cmap(step_linspace[stepi]))[:3], (1, 1, 3)))), 0, 1)
+                frame_resized = cv2.resize(frame, (256, 256), interpolation=cv2.INTER_NEAREST)
+                frames.append((np.clip(frame_resized, 0, 1) * 255).astype(np.uint8))
+            if frames:
+                imageio.mimsave(os.path.join(maze_output_dir, 'attention_overlay.gif'), frames, fps=15, loop=0)
+        print("\n--- Visualization Action ('viz') Complete ---")

tasks/mazes/plotting.py ADDED Viewed

	@@ -0,0 +1,198 @@

+import numpy as np
+import cv2
+import torch
+import os
+import matplotlib.pyplot as plt
+import imageio
+from tqdm.auto import tqdm
+def find_center_of_mass(array_2d):
+    """
+    Alternative implementation using np.average and meshgrid.
+    This version is generally faster and more concise.
+    Args:
+        array_2d: A 2D numpy array of values between 0 and 1.
+    Returns:
+        A tuple (x, y) representing the coordinates of the center of mass.
+    """
+    total_mass = np.sum(array_2d)
+    if total_mass == 0:
+      return (np.nan, np.nan)
+    y_coords, x_coords = np.mgrid[:array_2d.shape[0], :array_2d.shape[1]]
+    x_center = np.average(x_coords, weights=array_2d)
+    y_center = np.average(y_coords, weights=array_2d)
+    return (round(y_center, 4), round(x_center, 4))
+def draw_path(x, route, valid_only=False, gt=False, cmap=None):
+    """
+    Draws a path on a maze image based on a given route.
+    Args:
+        maze: A numpy array representing the maze image.
+        route: A list of integers representing the route, where 0 is up, 1 is down, 2 is left, and 3 is right.
+        valid_only: A boolean indicating whether to only draw valid steps (i.e., steps that don't go into walls).
+    Returns:
+        A numpy array representing the maze image with the path drawn in blue.
+    """
+    x = np.copy(x)
+    start = np.argwhere((x == [1, 0, 0]).all(axis=2))
+    end = np.argwhere((x == [0, 1, 0]).all(axis=2))
+    if cmap is None:
+        cmap = plt.get_cmap('winter') if not valid_only else  plt.get_cmap('summer')
+    # Initialize the current position
+    current_pos = start[0]
+    # Draw the path
+    colors = cmap(np.linspace(0, 1, len(route)))
+    si = 0
+    for step in route:
+        new_pos = current_pos
+        if step == 0:  # Up
+            new_pos = (current_pos[0] - 1, current_pos[1])
+        elif step == 1:  # Down
+            new_pos = (current_pos[0] + 1, current_pos[1])
+        elif step == 2:  # Left
+            new_pos = (current_pos[0], current_pos[1] - 1)
+        elif step == 3:  # Right
+            new_pos = (current_pos[0], current_pos[1] + 1)
+        elif step == 4:  # Do nothing
+            pass
+        else:
+            raise ValueError("Invalid step: {}".format(step))
+        # Check if the new position is valid
+        if valid_only:
+            try:
+                if np.all(x[new_pos] == [0,0,0]):  # Check if it's a wall
+                    continue  # Skip this step if it's invalid
+            except IndexError:
+                continue  # Skip this step if it's out of bounds
+        # Draw the step
+        if new_pos[0] >= 0 and new_pos[0] < x.shape[0] and new_pos[1] >= 0 and new_pos[1] < x.shape[1]:
+            if not ((x[new_pos] == [1,0,0]).all() or (x[new_pos] == [0,1,0]).all()):
+                colour = colors[si][:3]
+                si += 1
+                x[new_pos] = x[new_pos]*0.5 + colour*0.5
+        # Update the current position
+        current_pos = new_pos
+        # cv2.imwrite('maze2.png', x[:,:,::-1]*255)
+    return x
+def make_maze_gif(inputs, predictions, targets, attention_tracking, save_location):
+    """
+    Expect inputs, predictions, targets as numpy arrays
+    """
+    route_steps = []
+    route_colours = []
+    solution_maze = draw_path(np.moveaxis(inputs, 0, -1), targets)
+    # cv2.imwrite(f'{save_location}/ground_truth.png', solution_maze[:,:,::-1]*255)
+    mosaic = [['overlay', 'overlay', 'overlay', 'overlay', 'route', 'route', 'route', 'route'],
+              ['overlay', 'overlay', 'overlay', 'overlay', 'route', 'route', 'route', 'route'],
+              ['overlay', 'overlay', 'overlay', 'overlay', 'route', 'route', 'route', 'route'],
+              ['overlay', 'overlay', 'overlay', 'overlay', 'route', 'route', 'route', 'route'],
+              ['head_0', 'head_1', 'head_2', 'head_3', 'head_4', 'head_5', 'head_6', 'head_7'],
+              ['head_8', 'head_9', 'head_10', 'head_11', 'head_12', 'head_13', 'head_14', 'head_15'],
+              ]
+    img_aspect = 1
+    figscale = 1
+    aspect_ratio = (8 * figscale, 6 * figscale * img_aspect) # W, H
+    route_steps = [np.unravel_index(np.argmax((inputs == np.reshape(np.array([1, 0, 0]), (3, 1, 1))).all(0)), inputs.shape[1:])]  # Starting point
+    frames = []
+    cmap = plt.get_cmap('gist_rainbow')
+    cmap_viridis = plt.get_cmap('viridis')
+    step_linspace = np.linspace(0, 1, predictions.shape[-1])  # For sampling colours
+    with tqdm(total=predictions.shape[-1], initial=0, leave=True, position=1, dynamic_ncols=True) as pbar:
+        pbar.set_description('Processing frames for maze plotting')
+        for stepi in np.arange(0, predictions.shape[-1], 1):
+            fig, axes = plt.subplot_mosaic(mosaic, figsize=aspect_ratio)
+            for ax in axes.values():
+                ax.axis('off')
+            guess_maze = draw_path(np.moveaxis(inputs, 0, -1), predictions.argmax(1)[:,stepi], cmap=cmap)
+            attention_now = attention_tracking[stepi]
+            for hi in range(min((attention_tracking.shape[1], 16))):
+                ax = axes[f'head_{hi}']
+                attn = attention_tracking[stepi, hi]
+                attn = (attn - attn.min())/(np.ptp(attn))
+                ax.imshow(attn, cmap=cmap_viridis)
+            # Upsample attention just for visualisation
+            aggregated_attention = torch.nn.functional.interpolate(torch.from_numpy(attention_now).unsqueeze(0), inputs.shape[-1], mode='bilinear')[0].mean(0).numpy()
+            # Get approximate center of mass
+            com_attn = np.copy(aggregated_attention)
+            com_attn[com_attn < np.percentile(com_attn, 96)] = 0.0
+            aggregated_attention[aggregated_attention < np.percentile(aggregated_attention, 80)] = 0.0
+            route_steps.append(find_center_of_mass(com_attn))
+            colour = list(cmap(step_linspace[stepi]))
+            route_colours.append(colour)
+            mapped_attention = torch.nn.functional.interpolate(torch.from_numpy(attention_now).unsqueeze(0), inputs.shape[-1], mode='bilinear')[0].mean(0).numpy()
+            mapped_attention = (mapped_attention - mapped_attention.min())/np.ptp(mapped_attention)
+            # np.clip(guess_maze * (1-mapped_attention[...,np.newaxis]*0.5) + (cmap_viridis(mapped_attention)[:,:,:3] * mapped_attention[...,np.newaxis])*1.3, 0, 1)
+            overlay_img = np.clip(guess_maze * (1-mapped_attention[...,np.newaxis]*0.6) + (cmap_viridis(mapped_attention)[:,:,:3] * mapped_attention[...,np.newaxis])*1.1, 0, 1)#np.clip((np.copy(guess_maze)*(1-aggregated_attention[:,:,np.newaxis])*0.7 + (aggregated_attention[:,:,np.newaxis]*3 * np.reshape(np.array(colour)[:3], (1, 1, 3)))), 0, 1)
+            axes['overlay'].imshow(overlay_img)
+            y_coords, x_coords = zip(*route_steps)
+            y_coords = inputs.shape[-1] - np.array(list(y_coords))-1
+            axes['route'].imshow(np.flip(np.moveaxis(inputs, 0, -1), axis=0), origin='lower')
+            # ax.imshow(np.flip(solution_maze, axis=0), origin='lower')
+            arrow_scale = 2
+            for i in range(len(route_steps)-1):
+                dx = x_coords[i+1] - x_coords[i]
+                dy = y_coords[i+1] - y_coords[i]
+                axes['route'].arrow(x_coords[i], y_coords[i], dx, dy, linewidth=2*arrow_scale, head_width=0.2*arrow_scale, head_length=0.3*arrow_scale, fc=route_colours[i], ec=route_colours[i], length_includes_head = True)
+            fig.tight_layout(pad=0.1) # Adjust spacing
+            # Render the plot to a numpy array
+            canvas = fig.canvas
+            canvas.draw()
+            image_numpy = np.frombuffer(canvas.buffer_rgba(), dtype='uint8')
+            image_numpy = image_numpy.reshape(*reversed(canvas.get_width_height()), 4)[:,:,:3] # Get RGB
+            frames.append(image_numpy) # Add to list for GIF
+            # fig.savefig(f'{save_location}/frame.png', dpi=200)
+            plt.close(fig)
+            # # frame = np.clip((np.copy(guess_maze)*0.5 + (aggregated_attention[:,:,np.newaxis] * np.reshape(np.array(colour)[:3], (1, 1, 3)))), 0, 1)
+            # frame = torch.nn.functional.interpolate(torch.from_numpy(frame).permute(2,0,1).unsqueeze(0), 256)[0].permute(1,2,0).detach().cpu().numpy()
+            # frames.append((frame*255).astype(np.uint8))
+            pbar.update(1)
+    y_coords, x_coords = zip(*route_steps)
+    y_coords = inputs.shape[-1] - np.array(list(y_coords))-1
+    fig = plt.figure(figsize=(5,5))
+    ax = fig.add_subplot(111)
+    ax.imshow(np.flip(np.moveaxis(inputs, 0, -1), axis=0), origin='lower')
+    # ax.imshow(np.flip(solution_maze, axis=0), origin='lower')
+    arrow_scale = 2
+    for i in range(len(route_steps)-1):
+        dx = x_coords[i+1] - x_coords[i]
+        dy = y_coords[i+1] - y_coords[i]
+        plt.arrow(x_coords[i], y_coords[i], dx, dy, linewidth=2*arrow_scale, head_width=0.2*arrow_scale, head_length=0.3*arrow_scale, fc=route_colours[i], ec=route_colours[i], length_includes_head = True)
+    ax.axis('off')
+    fig.tight_layout(pad=0)
+    fig.savefig(f'{save_location}/route_approximation.png', dpi=200)
+    imageio.mimsave(f'{save_location}/prediction.gif', frames, fps=15, loop=100)
+    plt.close(fig)

tasks/mazes/scripts/train_ctm.sh ADDED Viewed

	@@ -0,0 +1,35 @@

+python -m tasks.mazes.train \
+--model ctm \
+--log_dir logs/mazes/ctm/d=2048--i=512--heads=16--sd=8--nlm=32--synch=64-32-h=32-first-last--iters=75x25--backbone=34-2 \
+--neuron_select_type first-last \
+--dataset mazes-large \
+--synapse_depth 8 \
+--heads 16 \
+--iterations 75 \
+--memory_length 25 \
+--d_model 2048 \
+--d_input 512 \
+--backbone_type resnet34-2 \
+--n_synch_out 64 \
+--n_synch_action 32 \
+--memory_hidden_dims 32 \
+--deep_memory \
+--weight_decay 0.000 \
+--batch_size 64 \
+--batch_size_test 128 \
+--n_test_batches 20 \
+--gradient_clipping -1 \
+--use_scheduler \
+--scheduler_type cosine \
+--warmup_steps 10000 \
+--training_iterations 1000001 \
+--no-do_normalisation \
+--track_every 1000 \
+--lr 1e-4 \
+--no-reload \
+--dropout 0.1 \
+--positional_embedding_type none  \
+--maze_route_length 100 \
+--cirriculum_lookahead 5 \
+--device 0 \
+--no-expand_range

tasks/mazes/train.py ADDED Viewed

	@@ -0,0 +1,698 @@

+import argparse
+import os
+import random
+import matplotlib.pyplot as plt
+import numpy as np
+import seaborn as sns
+sns.set_style('darkgrid')
+import torch
+if torch.cuda.is_available():
+    # For faster
+    torch.set_float32_matmul_precision('high')
+from tqdm.auto import tqdm
+from data.custom_datasets import MazeImageFolder
+from models.ctm import ContinuousThoughtMachine
+from models.lstm import LSTMBaseline
+from models.ff import FFBaseline
+from tasks.mazes.plotting import make_maze_gif
+from tasks.image_classification.plotting import plot_neural_dynamics
+from utils.housekeeping import set_seed, zip_python_code
+from utils.losses import maze_loss
+from utils.schedulers import WarmupCosineAnnealingLR, WarmupMultiStepLR, warmup
+import torchvision
+torchvision.disable_beta_transforms_warning()
+import warnings
+warnings.filterwarnings("ignore", message="using precomputed metric; inverse_transform will be unavailable")
+warnings.filterwarnings('ignore', message='divide by zero encountered in power', category=RuntimeWarning)
+warnings.filterwarnings(
+    "ignore",
+    "Corrupt EXIF data",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Metadata Warning",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Truncated File Read",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$" # Using a regular expression to match the module.
+)
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Model Selection
+    parser.add_argument('--model', type=str, required=True, choices=['ctm', 'lstm', 'ff'], help='Model type to train.')
+    # Model Architecture
+    # Common across all or most
+    parser.add_argument('--d_model', type=int, default=512, help='Dimension of the model.')
+    parser.add_argument('--dropout', type=float, default=0.0, help='Dropout rate.')
+    parser.add_argument('--backbone_type', type=str, default='resnet34-2', help='Type of backbone featureiser.') # Default changed from original script
+    # CTM / LSTM specific
+    parser.add_argument('--d_input', type=int, default=128, help='Dimension of the input (CTM, LSTM).')
+    parser.add_argument('--heads', type=int, default=8, help='Number of attention heads (CTM, LSTM).') # Default changed
+    parser.add_argument('--iterations', type=int, default=75, help='Number of internal ticks (CTM, LSTM).')
+    parser.add_argument('--positional_embedding_type', type=str, default='none',
+                        help='Type of positional embedding (CTM, LSTM).', choices=['none',
+                                                                       'learnable-fourier',
+                                                                       'multi-learnable-fourier',
+                                                                       'custom-rotational'])
+    # CTM specific
+    parser.add_argument('--synapse_depth', type=int, default=8, help='Depth of U-NET model for synapse. 1=linear, no unet (CTM only).') # Default changed
+    parser.add_argument('--n_synch_out', type=int, default=32, help='Number of neurons to use for output synch (CTM only).') # Default changed
+    parser.add_argument('--n_synch_action', type=int, default=32, help='Number of neurons to use for observation/action synch (CTM only).') # Default changed
+    parser.add_argument('--neuron_select_type', type=str, default='random-pairing', help='Protocol for selecting neuron subset (CTM only).')
+    parser.add_argument('--n_random_pairing_self', type=int, default=0, help='Number of neurons paired self-to-self for synch (CTM only).')
+    parser.add_argument('--memory_length', type=int, default=25, help='Length of the pre-activation history for NLMS (CTM only).')
+    parser.add_argument('--deep_memory', action=argparse.BooleanOptionalAction, default=True,
+                        help='Use deep memory (CTM only).')
+    parser.add_argument('--memory_hidden_dims', type=int, default=32, help='Hidden dimensions of the memory if using deep memory (CTM only).') # Default changed
+    parser.add_argument('--dropout_nlm', type=float, default=None, help='Dropout rate for NLMs specifically. Unset to match dropout on the rest of the model (CTM only).')
+    parser.add_argument('--do_normalisation', action=argparse.BooleanOptionalAction, default=False, help='Apply normalization in NLMs (CTM only).')
+    # LSTM specific
+    parser.add_argument('--num_layers', type=int, default=2, help='Number of LSTM stacked layers (LSTM only).') # Added LSTM arg
+    # Task Specific Args (Common to all models for this task)
+    parser.add_argument('--maze_route_length', type=int, default=100, help='Length to truncate targets.')
+    parser.add_argument('--cirriculum_lookahead', type=int, default=5, help='How far to look ahead for cirriculum.')
+    # Training
+    parser.add_argument('--expand_range', action=argparse.BooleanOptionalAction, default=True, help='Mazes between 0 and 1 = False. Between -1 and 1 = True. Legacy checkpoints use 0 and 1.')
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size for training.') # Default changed
+    parser.add_argument('--batch_size_test', type=int, default=64, help='Batch size for testing.') # Default changed
+    parser.add_argument('--lr', type=float, default=1e-4, help='Learning rate for the model.') # Default changed
+    parser.add_argument('--training_iterations', type=int, default=100001, help='Number of training iterations.')
+    parser.add_argument('--warmup_steps', type=int, default=5000, help='Number of warmup steps.')
+    parser.add_argument('--use_scheduler', action=argparse.BooleanOptionalAction, default=True, help='Use a learning rate scheduler.')
+    parser.add_argument('--scheduler_type', type=str, default='cosine', choices=['multistep', 'cosine'], help='Type of learning rate scheduler.')
+    parser.add_argument('--milestones', type=int, default=[8000, 15000, 20000], nargs='+', help='Learning rate scheduler milestones.')
+    parser.add_argument('--gamma', type=float, default=0.1, help='Learning rate scheduler gamma for multistep.')
+    parser.add_argument('--weight_decay', type=float, default=0.0, help='Weight decay factor.')
+    parser.add_argument('--weight_decay_exclusion_list', type=str, nargs='+', default=[], help='List to exclude from weight decay. Typically good: bn, ln, bias, start')
+    parser.add_argument('--num_workers_train', type=int, default=0, help='Num workers training.') # Renamed from num_workers, kept default
+    parser.add_argument('--gradient_clipping', type=float, default=-1, help='Gradient quantile clipping value (-1 to disable).')
+    parser.add_argument('--do_compile', action=argparse.BooleanOptionalAction, default=False, help='Try to compile model components.')
+    # Logging and Saving
+    parser.add_argument('--log_dir', type=str, default='logs/scratch', help='Directory for logging.')
+    parser.add_argument('--dataset', type=str, default='mazes-medium', help='Dataset to use.', choices=['mazes-medium', 'mazes-large'])
+    parser.add_argument('--data_root', type=str, default='data/mazes', help='Data root.')
+    parser.add_argument('--save_every', type=int, default=1000, help='Save checkpoints every this many iterations.')
+    parser.add_argument('--seed', type=int, default=412, help='Random seed.')
+    parser.add_argument('--reload', action=argparse.BooleanOptionalAction, default=False, help='Reload from disk?')
+    parser.add_argument('--reload_model_only', action=argparse.BooleanOptionalAction, default=False, help='Reload only the model from disk?')
+    parser.add_argument('--strict_reload', action=argparse.BooleanOptionalAction, default=True, help='Should use strict reload for model weights.') # Added back
+    parser.add_argument('--ignore_metrics_when_reloading', action=argparse.BooleanOptionalAction, default=False, help='Ignore metrics when reloading (for debugging)?') # Added back
+    # Tracking
+    parser.add_argument('--track_every', type=int, default=1000, help='Track metrics every this many iterations.')
+    parser.add_argument('--n_test_batches', type=int, default=20, help='How many minibatches to approx metrics. Set to -1 for full eval') # Default changed
+    # Device
+    parser.add_argument('--device', type=int, nargs='+', default=[-1], help='List of GPU(s) to use. Set to -1 to use CPU.')
+    parser.add_argument('--use_amp', action=argparse.BooleanOptionalAction, default=False, help='AMP autocast.')
+    args = parser.parse_args()
+    return args
+if __name__=='__main__':
+    # Hosuekeeping
+    args = parse_args()
+    set_seed(args.seed, False)
+    if not os.path.exists(args.log_dir): os.makedirs(args.log_dir)
+    assert args.dataset in ['mazes-medium', 'mazes-large']
+    prediction_reshaper = [args.maze_route_length, 5]  # Problem specific
+    args.out_dims = args.maze_route_length * 5 # Output dimension before reshaping
+    # For total reproducibility
+    zip_python_code(f'{args.log_dir}/repo_state.zip')
+    with open(f'{args.log_dir}/args.txt', 'w') as f:
+        print(args, file=f)
+    # Configure device string
+    device = f'cuda:{args.device[0]}' if args.device[0] != -1 else 'cpu'
+    print(f'Running model {args.model} on {device} for dataset {args.dataset}')
+    # Build model conditionally
+    model = None
+    if args.model == 'ctm':
+        model = ContinuousThoughtMachine(
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            n_synch_out=args.n_synch_out,
+            n_synch_action=args.n_synch_action,
+            synapse_depth=args.synapse_depth,
+            memory_length=args.memory_length,
+            deep_nlms=args.deep_memory,
+            memory_hidden_dims=args.memory_hidden_dims,
+            do_layernorm_nlm=args.do_normalisation,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+            dropout_nlm=args.dropout_nlm,
+            neuron_select_type=args.neuron_select_type,
+            n_random_pairing_self=args.n_random_pairing_self,
+        ).to(device)
+    elif args.model == 'lstm':
+         model = LSTMBaseline(
+            num_layers=args.num_layers,
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+        ).to(device)
+    elif args.model == 'ff':
+        model = FFBaseline(
+            d_model=args.d_model,
+            backbone_type=args.backbone_type,
+            out_dims=args.out_dims,
+            dropout=args.dropout,
+        ).to(device)
+    else:
+        raise ValueError(f"Unknown model type: {args.model}")
+    try:
+        # Determine pseudo input shape based on dataset
+        h_w = 39 if args.dataset in ['mazes-small', 'mazes-medium'] else 99 # Example dimensions
+        pseudo_inputs = torch.zeros((1, 3, h_w, h_w), device=device).float()
+        model(pseudo_inputs)
+    except Exception as e:
+         print(f"Warning: Pseudo forward pass failed: {e}")
+    print(f'Total params: {sum(p.numel() for p in model.parameters())}')
+    # Data
+    dataset_mean = [0,0,0]  # For plotting later
+    dataset_std = [1,1,1]
+    which_maze = args.dataset.split('-')[-1]
+    data_root = f'{args.data_root}/{which_maze}'
+    train_data = MazeImageFolder(root=f'{data_root}/train/', which_set='train', maze_route_length=args.maze_route_length, expand_range=args.expand_range)
+    test_data = MazeImageFolder(root=f'{data_root}/test/', which_set='test', maze_route_length=args.maze_route_length, expand_range=args.expand_range)
+    num_workers_test = 1 # Defaulting to 1, can be changed
+    trainloader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers_train, drop_last=True)
+    testloader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, shuffle=True, num_workers=num_workers_test, drop_last=False)
+    # For lazy modules so that we can get param count
+    model.train()
+    # Optimizer and scheduler
+    decay_params = []
+    no_decay_params = []
+    no_decay_names = []
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue # Skip parameters that don't require gradients
+        if any(exclusion_str in name for exclusion_str in args.weight_decay_exclusion_list):
+            no_decay_params.append(param)
+            no_decay_names.append(name)
+        else:
+            decay_params.append(param)
+    if len(no_decay_names):
+        print(f'WARNING, excluding: {no_decay_names}')
+    # Optimizer and scheduler (Common setup)
+    if len(no_decay_names) and args.weight_decay!=0:
+        optimizer = torch.optim.AdamW([{'params': decay_params, 'weight_decay':args.weight_decay},
+                                       {'params': no_decay_params, 'weight_decay':0}],
+                                  lr=args.lr,
+                                  eps=1e-8 if not args.use_amp else 1e-6)
+    else:
+        optimizer = torch.optim.AdamW(model.parameters(),
+                                    lr=args.lr,
+                                    eps=1e-8 if not args.use_amp else 1e-6,
+                                    weight_decay=args.weight_decay)
+    warmup_schedule = warmup(args.warmup_steps)
+    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warmup_schedule.step)
+    if args.use_scheduler:
+        if args.scheduler_type == 'multistep':
+            scheduler = WarmupMultiStepLR(optimizer, warmup_steps=args.warmup_steps, milestones=args.milestones, gamma=args.gamma)
+        elif args.scheduler_type == 'cosine':
+            scheduler = WarmupCosineAnnealingLR(optimizer, args.warmup_steps, args.training_iterations, warmup_start_lr=1e-20, eta_min=1e-7)
+        else:
+            raise NotImplementedError
+    # Metrics tracking
+    start_iter = 0
+    train_losses = []
+    test_losses = []
+    train_accuracies = []  # Per tick/step accuracy list
+    test_accuracies = []
+    train_accuracies_most_certain = [] # Accuracy, fine-grained
+    test_accuracies_most_certain = []
+    train_accuracies_most_certain_permaze = [] # Full maze accuracy
+    test_accuracies_most_certain_permaze = []
+    iters = []
+    scaler = torch.amp.GradScaler("cuda" if "cuda" in device else "cpu", enabled=args.use_amp)
+    if args.reload:
+        checkpoint_path = f'{args.log_dir}/checkpoint.pt'
+        if os.path.isfile(checkpoint_path):
+            print(f'Reloading from: {checkpoint_path}')
+            checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)
+            if not args.strict_reload: print('WARNING: not using strict reload for model weights!')
+            load_result = model.load_state_dict(checkpoint['model_state_dict'], strict=args.strict_reload)
+            print(f" Loaded state_dict. Missing: {load_result.missing_keys}, Unexpected: {load_result.unexpected_keys}")
+            if not args.reload_model_only:
+                print('Reloading optimizer etc.')
+                optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+                scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
+                scaler.load_state_dict(checkpoint['scaler_state_dict']) # Load scaler state
+                start_iter = checkpoint['iteration']
+                if not args.ignore_metrics_when_reloading:
+                    train_losses = checkpoint['train_losses']
+                    test_losses = checkpoint['test_losses']
+                    train_accuracies = checkpoint['train_accuracies']
+                    test_accuracies = checkpoint['test_accuracies']
+                    iters = checkpoint['iters']
+                    train_accuracies_most_certain = checkpoint['train_accuracies_most_certain']
+                    test_accuracies_most_certain = checkpoint['test_accuracies_most_certain']
+                    train_accuracies_most_certain_permaze = checkpoint['train_accuracies_most_certain_permaze']
+                    test_accuracies_most_certain_permaze = checkpoint['test_accuracies_most_certain_permaze']
+                else:
+                     print("Ignoring metrics history upon reload.")
+            else:
+                print('Only reloading model!')
+            if 'torch_rng_state' in checkpoint:
+                # Reset seeds
+                torch.set_rng_state(checkpoint['torch_rng_state'].cpu().byte())
+                np.random.set_state(checkpoint['numpy_rng_state'])
+                random.setstate(checkpoint['random_rng_state'])
+            del checkpoint
+            import gc
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+    if args.do_compile:
+        print('Compiling...')
+        if hasattr(model, 'backbone'):
+            model.backbone = torch.compile(model.backbone, mode='reduce-overhead', fullgraph=True)
+        # Compile synapses only for CTM
+        if args.model == 'ctm':
+            model.synapses = torch.compile(model.synapses, mode='reduce-overhead', fullgraph=True)
+    # Training
+    iterator = iter(trainloader)
+    with tqdm(total=args.training_iterations, initial=start_iter, leave=False, position=0, dynamic_ncols=True) as pbar:
+        for bi in range(start_iter, args.training_iterations):
+            current_lr = optimizer.param_groups[-1]['lr']
+            try:
+                inputs, targets = next(iterator)
+            except StopIteration:
+                iterator = iter(trainloader)
+                inputs, targets = next(iterator)
+            inputs = inputs.to(device)
+            targets = targets.to(device) # Shape (B, SeqLength)
+            # All for nice metric printing:
+            loss = None
+            accuracy_finegrained = None # Per-step accuracy at chosen tick
+            where_most_certain_val = -1.0 # Default value
+            where_most_certain_std = 0.0
+            where_most_certain_min = -1
+            where_most_certain_max = -1
+            upto_where_mean = -1.0
+            upto_where_std = 0.0
+            upto_where_min = -1
+            upto_where_max = -1
+            # Model-specific forward, reshape, and loss calculation
+            with torch.autocast(device_type="cuda" if "cuda" in device else "cpu", dtype=torch.float16, enabled=args.use_amp):
+                if args.do_compile: # CUDAGraph marking applied if compiling any model
+                     torch.compiler.cudagraph_mark_step_begin()
+                if args.model == 'ctm':
+                    # CTM output: (B, SeqLength*5, Ticks), Certainties: (B, Ticks)
+                    predictions_raw, certainties, synchronisation = model(inputs)
+                    # Reshape predictions: (B, SeqLength, 5, Ticks)
+                    predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1))
+                    loss, where_most_certain, upto_where = maze_loss(predictions, certainties, targets, cirriculum_lookahead=args.cirriculum_lookahead, use_most_certain=True)
+                    # Accuracy uses predictions[B, S, C, T] indexed at where_most_certain[B] -> gives (B, S, C) -> argmax(2) -> (B,S)
+                    accuracy_finegrained = (predictions.argmax(2)[torch.arange(predictions.size(0), device=predictions.device), :, where_most_certain] == targets).float().mean().item()
+                elif args.model == 'lstm':
+                    # LSTM output: (B, SeqLength*5, Ticks), Certainties: (B, Ticks)
+                    predictions_raw, certainties, synchronisation = model(inputs)
+                     # Reshape predictions: (B, SeqLength, 5, Ticks)
+                    predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1))
+                    loss, where_most_certain, upto_where = maze_loss(predictions, certainties, targets, cirriculum_lookahead=args.cirriculum_lookahead, use_most_certain=False)
+                    # where_most_certain should be -1 (last tick) here. Accuracy calc follows same logic.
+                    accuracy_finegrained = (predictions.argmax(2)[torch.arange(predictions.size(0), device=predictions.device), :, where_most_certain] == targets).float().mean().item()
+                elif args.model == 'ff':
+                    # Assume FF output: (B, SeqLength*5)
+                    predictions_raw = model(inputs)
+                    # Reshape predictions: (B, SeqLength, 5)
+                    predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5)
+                    # FF has no certainties, pass None. maze_loss must handle this.
+                    # Unsqueeze predictions for compatibility with maze loss calcluation
+                    loss, where_most_certain, upto_where = maze_loss(predictions.unsqueeze(-1), None, targets, cirriculum_lookahead=args.cirriculum_lookahead, use_most_certain=False)
+                    # where_most_certain should be -1 here. Accuracy uses 3D prediction tensor.
+                    accuracy_finegrained = (predictions.argmax(2) == targets).float().mean().item()
+                # Extract stats from loss outputs if they are tensors
+                if torch.is_tensor(where_most_certain):
+                    where_most_certain_val = where_most_certain.float().mean().item()
+                    where_most_certain_std = where_most_certain.float().std().item()
+                    where_most_certain_min = where_most_certain.min().item()
+                    where_most_certain_max = where_most_certain.max().item()
+                elif isinstance(where_most_certain, int): # Handle case where it might return -1 directly
+                     where_most_certain_val = float(where_most_certain)
+                     where_most_certain_min = where_most_certain
+                     where_most_certain_max = where_most_certain
+                if isinstance(upto_where, (np.ndarray, list)) and len(upto_where) > 0: # Check if it's a list/array
+                    upto_where_mean = np.mean(upto_where)
+                    upto_where_std = np.std(upto_where)
+                    upto_where_min = np.min(upto_where)
+                    upto_where_max = np.max(upto_where)
+            scaler.scale(loss).backward()
+            if args.gradient_clipping!=-1:
+                scaler.unscale_(optimizer)
+                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=args.gradient_clipping)
+            scaler.step(optimizer)
+            scaler.update()
+            optimizer.zero_grad(set_to_none=True)
+            scheduler.step()
+            # Conditional Tqdm Description
+            pbar_desc = f'Loss={loss.item():0.3f}. Acc(step)={accuracy_finegrained:0.3f}. LR={current_lr:0.6f}.'
+            if args.model in ['ctm', 'lstm'] or torch.is_tensor(where_most_certain): # Show stats if available
+                 pbar_desc += f' Where_certain={where_most_certain_val:0.2f}+-{where_most_certain_std:0.2f} ({where_most_certain_min:d}<->{where_most_certain_max:d}).'
+            if isinstance(upto_where, (np.ndarray, list)) and len(upto_where) > 0:
+                 pbar_desc += f' Path pred stats: {upto_where_mean:0.2f}+-{upto_where_std:0.2f} ({upto_where_min:d} --> {upto_where_max:d})'
+            pbar.set_description(f'Dataset={args.dataset}. Model={args.model}. {pbar_desc}')
+            # Metrics tracking and plotting
+            if bi%args.track_every==0 and (bi != 0 or args.reload_model_only):
+                model.eval() # Use eval mode for consistency during tracking
+                with torch.inference_mode(): # Use inference mode for tracking
+                    # --- Quantitative Metrics ---
+                    iters.append(bi)
+                    # Re-initialize metric lists for this evaluation step
+                    current_train_losses_eval = []
+                    current_test_losses_eval = []
+                    current_train_accuracies_eval = []
+                    current_test_accuracies_eval = []
+                    current_train_accuracies_most_certain_eval = []
+                    current_test_accuracies_most_certain_eval = []
+                    current_train_accuracies_most_certain_permaze_eval = []
+                    current_test_accuracies_most_certain_permaze_eval = []
+                    # TRAIN METRICS
+                    pbar.set_description('Tracking: Computing TRAIN metrics')
+                    loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size_test, shuffle=True, num_workers=num_workers_test) # Use consistent num_workers
+                    all_targets_list = []
+                    all_predictions_list = [] # Per step/tick predictions argmax (N, S, T) or (N, S)
+                    all_predictions_most_certain_list = [] # Predictions at chosen step/tick argmax (N, S)
+                    all_losses = []
+                    with tqdm(total=len(loader), initial=0, leave=False, position=1, dynamic_ncols=True) as pbar_inner:
+                        for inferi, (inputs, targets) in enumerate(loader):
+                            inputs = inputs.to(device)
+                            targets = targets.to(device)
+                            all_targets_list.append(targets.detach().cpu().numpy()) # N x S
+                            # Model-specific forward, reshape, loss for evaluation
+                            if args.model == 'ctm':
+                                predictions_raw, certainties, _ = model(inputs)
+                                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                                loss, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=True)
+                                all_predictions_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S,C,T -> argmax class -> B,S,T
+                                pred_at_certain = predictions.argmax(2)[torch.arange(predictions.size(0), device=predictions.device), :, where_most_certain] # B,S
+                                all_predictions_most_certain_list.append(pred_at_certain.detach().cpu().numpy())
+                            elif args.model == 'lstm':
+                                predictions_raw, certainties, _ = model(inputs)
+                                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                                loss, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=False) # where = -1
+                                all_predictions_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S,C,T
+                                pred_at_certain = predictions.argmax(2)[torch.arange(predictions.size(0), device=predictions.device), :, where_most_certain] # B,S (at last tick)
+                                all_predictions_most_certain_list.append(pred_at_certain.detach().cpu().numpy())
+                            elif args.model == 'ff':
+                                predictions_raw = model(inputs) # B, S*C
+                                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5) # B,S,C
+                                loss, where_most_certain, _ = maze_loss(predictions.unsqueeze(-1), None, targets, use_most_certain=False) # where = -1
+                                all_predictions_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S
+                                all_predictions_most_certain_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S (same as above for FF)
+                            all_losses.append(loss.item())
+                            if args.n_test_batches != -1 and inferi >= args.n_test_batches -1 : break
+                            pbar_inner.set_description(f'Computing metrics for train (Batch {inferi+1})')
+                            pbar_inner.update(1)
+                    all_targets = np.concatenate(all_targets_list) # N, S
+                    all_predictions = np.concatenate(all_predictions_list) # N, S, T or N, S
+                    all_predictions_most_certain = np.concatenate(all_predictions_most_certain_list) # N, S
+                    train_losses.append(np.mean(all_losses))
+                    # Calculate per step/tick accuracy averaged over batches
+                    if args.model in ['ctm', 'lstm']:
+                         # all_predictions shape (N, S, T), all_targets shape (N, S) -> compare targets to each tick prediction
+                         train_accuracies.append(np.mean(all_predictions == all_targets[:,:,np.newaxis], axis=0)) # Mean over N -> (S, T)
+                    else: # FF
+                         # all_predictions shape (N, S), all_targets shape (N, S)
+                         train_accuracies.append(np.mean(all_predictions == all_targets, axis=0)) # Mean over N -> (S,)
+                    # Calculate accuracy at chosen step/tick ("most certain") averaged over all steps and batches
+                    train_accuracies_most_certain.append((all_targets == all_predictions_most_certain).mean()) # Scalar
+                    # Calculate full maze accuracy at chosen step/tick averaged over batches
+                    train_accuracies_most_certain_permaze.append((all_targets == all_predictions_most_certain).reshape(all_targets.shape[0], -1).all(-1).mean()) # Scalar
+                    # TEST METRICS
+                    pbar.set_description('Tracking: Computing TEST metrics')
+                    loader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, shuffle=True, num_workers=num_workers_test)
+                    all_targets_list = []
+                    all_predictions_list = []
+                    all_predictions_most_certain_list = []
+                    all_losses = []
+                    with tqdm(total=len(loader), initial=0, leave=False, position=1, dynamic_ncols=True) as pbar_inner:
+                        for inferi, (inputs, targets) in enumerate(loader):
+                            inputs = inputs.to(device)
+                            targets = targets.to(device)
+                            all_targets_list.append(targets.detach().cpu().numpy())
+                             # Model-specific forward, reshape, loss for evaluation
+                            if args.model == 'ctm':
+                                predictions_raw, certainties, _ = model(inputs)
+                                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                                loss, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=True)
+                                all_predictions_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S,T
+                                pred_at_certain = predictions.argmax(2)[torch.arange(predictions.size(0), device=predictions.device), :, where_most_certain] # B,S
+                                all_predictions_most_certain_list.append(pred_at_certain.detach().cpu().numpy())
+                            elif args.model == 'lstm':
+                                predictions_raw, certainties, _ = model(inputs)
+                                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                                loss, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=False) # where = -1
+                                all_predictions_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S,T
+                                pred_at_certain = predictions.argmax(2)[torch.arange(predictions.size(0), device=predictions.device), :, where_most_certain] # B,S (at last tick)
+                                all_predictions_most_certain_list.append(pred_at_certain.detach().cpu().numpy())
+                            elif args.model == 'ff':
+                                predictions_raw = model(inputs) # B, S*C
+                                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5) # B,S,C
+                                loss, where_most_certain, _ = maze_loss(predictions.unsqueeze(-1), None, targets, use_most_certain=False) # where = -1
+                                all_predictions_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S
+                                all_predictions_most_certain_list.append(predictions.argmax(2).detach().cpu().numpy()) # B,S (same as above for FF)
+                            all_losses.append(loss.item())
+                            if args.n_test_batches != -1 and inferi >= args.n_test_batches -1: break
+                            pbar_inner.set_description(f'Computing metrics for test (Batch {inferi+1})')
+                            pbar_inner.update(1)
+                    all_targets = np.concatenate(all_targets_list)
+                    all_predictions = np.concatenate(all_predictions_list)
+                    all_predictions_most_certain = np.concatenate(all_predictions_most_certain_list)
+                    test_losses.append(np.mean(all_losses))
+                    # Calculate per step/tick accuracy
+                    if args.model in ['ctm', 'lstm']:
+                         test_accuracies.append(np.mean(all_predictions == all_targets[:,:,np.newaxis], axis=0)) # -> (S, T)
+                    else: # FF
+                         test_accuracies.append(np.mean(all_predictions == all_targets, axis=0)) # -> (S,)
+                    # Calculate "most certain" accuracy
+                    test_accuracies_most_certain.append((all_targets == all_predictions_most_certain).mean()) # Scalar
+                    # Calculate full maze accuracy
+                    test_accuracies_most_certain_permaze.append((all_targets == all_predictions_most_certain).reshape(all_targets.shape[0], -1).all(-1).mean()) # Scalar
+                    # --- Plotting ---
+                    # Accuracy Plot (Handling different dimensions)
+                    figacc = plt.figure(figsize=(10, 10))
+                    axacc_train = figacc.add_subplot(211)
+                    axacc_test = figacc.add_subplot(212)
+                    cm = sns.color_palette("viridis", as_cmap=True)
+                    # Plot per step/tick accuracy
+                    # train_accuracies is List[(S, T)] or List[(S,)]
+                    # We need to average over S dimension for plotting
+                    train_acc_plot = [np.mean(acc_s) for acc_s in train_accuracies] # List[Scalar] or List[Scalar] after mean
+                    test_acc_plot = [np.mean(acc_s) for acc_s in test_accuracies]   # List[Scalar] or List[Scalar] after mean
+                    axacc_train.plot(iters, train_acc_plot, 'g-', alpha=0.5, label='Avg Step Acc')
+                    axacc_test.plot(iters, test_acc_plot, 'g-', alpha=0.5, label='Avg Step Acc')
+                    # Plot most certain accuracy
+                    axacc_train.plot(iters, train_accuracies_most_certain, 'k--', alpha=0.7, label='Most Certain (Avg Step)')
+                    axacc_test.plot(iters, test_accuracies_most_certain, 'k--', alpha=0.7, label='Most Certain (Avg Step)')
+                    # Plot full maze accuracy
+                    axacc_train.plot(iters, train_accuracies_most_certain_permaze, 'r-', alpha=0.6, label='Full Maze')
+                    axacc_test.plot(iters, test_accuracies_most_certain_permaze, 'r-', alpha=0.6, label='Full Maze')
+                    axacc_train.set_title('Train Accuracy')
+                    axacc_test.set_title('Test Accuracy')
+                    axacc_train.legend(loc='lower right')
+                    axacc_test.legend(loc='lower right')
+                    axacc_train.set_xlim([0, args.training_iterations])
+                    axacc_test.set_xlim([0, args.training_iterations])
+                    axacc_train.set_ylim([0, 1]) # Set Ylim for accuracy
+                    axacc_test.set_ylim([0, 1])
+                    figacc.tight_layout()
+                    figacc.savefig(f'{args.log_dir}/accuracies.png', dpi=150)
+                    plt.close(figacc)
+                    # Loss Plot
+                    figloss = plt.figure(figsize=(10, 5))
+                    axloss = figloss.add_subplot(111)
+                    axloss.plot(iters, train_losses, 'b-', linewidth=1, alpha=0.8, label=f'Train: {train_losses[-1]:.4f}')
+                    axloss.plot(iters, test_losses, 'r-', linewidth=1, alpha=0.8, label=f'Test: {test_losses[-1]:.4f}')
+                    axloss.legend(loc='upper right')
+                    axloss.set_xlim([0, args.training_iterations])
+                    axloss.set_ylim(bottom=0)
+                    figloss.tight_layout()
+                    figloss.savefig(f'{args.log_dir}/losses.png', dpi=150)
+                    plt.close(figloss)
+                    # --- Visualization Section (Conditional) ---
+                    if args.model in ['ctm', 'lstm']:
+                        #  try:
+                            inputs_viz, targets_viz = next(iter(testloader))
+                            inputs_viz = inputs_viz.to(device)
+                            targets_viz = targets_viz.to(device)
+                            # Find longest path in batch for potentially better visualization
+                            longest_index = (targets_viz!=4).sum(-1).argmax() # Action 4 assumed padding/end
+                            # Track internal states
+                            predictions_viz_raw, certainties_viz, _, pre_activations_viz, post_activations_viz, attention_tracking_viz = model(inputs_viz, track=True)
+                            # Reshape predictions (assuming raw is B, D, T)
+                            predictions_viz = predictions_viz_raw.reshape(predictions_viz_raw.size(0), -1, 5, predictions_viz_raw.size(-1)) # B, S, C, T
+                            att_shape = (model.kv_features.shape[2], model.kv_features.shape[3])
+                            attention_tracking_viz = attention_tracking_viz.reshape(
+                                attention_tracking_viz.shape[0],
+                                attention_tracking_viz.shape[1], -1, att_shape[0], att_shape[1])
+                            # Plot dynamics (common plotting function)
+                            plot_neural_dynamics(post_activations_viz, 100, args.log_dir, axis_snap=True)
+                            # Create maze GIF (task-specific plotting)
+                            make_maze_gif((inputs_viz[longest_index].detach().cpu().numpy()+1)/2,
+                                          predictions_viz[longest_index].detach().cpu().numpy(), # Pass reshaped B,S,C,T -> S,C,T
+                                          targets_viz[longest_index].detach().cpu().numpy(), # S
+                                          attention_tracking_viz[:, longest_index],  # Pass T, (H), H, W
+                                          args.log_dir)
+                        #  except Exception as e:
+                        #       print(f"Visualization failed for model {args.model}: {e}")
+                    # --- End Visualization ---
+                model.train() # Switch back to train mode
+            # Save model checkpoint
+            if (bi % args.save_every == 0 or bi == args.training_iterations - 1) and bi != start_iter:
+                pbar.set_description('Saving model checkpoint...')
+                checkpoint_data = {
+                    'model_state_dict': model.state_dict(),
+                    'optimizer_state_dict': optimizer.state_dict(),
+                    'scheduler_state_dict': scheduler.state_dict(),
+                    'scaler_state_dict': scaler.state_dict(), # Save scaler state
+                    'iteration': bi,
+                    # Save all tracked metrics
+                    'train_losses': train_losses,
+                    'test_losses': test_losses,
+                    'train_accuracies': train_accuracies, # List of (S, T) or (S,) arrays
+                    'test_accuracies': test_accuracies,   # List of (S, T) or (S,) arrays
+                    'train_accuracies_most_certain': train_accuracies_most_certain, # List of scalars
+                    'test_accuracies_most_certain': test_accuracies_most_certain,   # List of scalars
+                    'train_accuracies_most_certain_permaze': train_accuracies_most_certain_permaze, # List of scalars
+                    'test_accuracies_most_certain_permaze': test_accuracies_most_certain_permaze,   # List of scalars
+                    'iters': iters,
+                    'args': args, # Save args used for this run
+                    # RNG states
+                    'torch_rng_state': torch.get_rng_state(),
+                    'numpy_rng_state': np.random.get_state(),
+                    'random_rng_state': random.getstate(),
+                }
+                torch.save(checkpoint_data, f'{args.log_dir}/checkpoint.pt')
+            pbar.update(1)

tasks/mazes/train_distributed.py ADDED Viewed

	@@ -0,0 +1,782 @@

+import argparse
+import os
+import random
+import gc
+import matplotlib.pyplot as plt
+import numpy as np
+import seaborn as sns
+sns.set_style('darkgrid')
+import torch
+if torch.cuda.is_available():
+    # For faster
+    torch.set_float32_matmul_precision('high')
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+from utils.samplers import FastRandomDistributedSampler
+from tqdm.auto import tqdm
+# Data/Task Specific Imports
+from data.custom_datasets import MazeImageFolder
+# Model Imports
+from models.ctm import ContinuousThoughtMachine
+from models.lstm import LSTMBaseline
+from models.ff import FFBaseline
+# Plotting/Utils Imports
+from tasks.mazes.plotting import make_maze_gif
+from tasks.image_classification.plotting import plot_neural_dynamics
+from utils.housekeeping import set_seed, zip_python_code
+from utils.losses import maze_loss
+from utils.schedulers import WarmupCosineAnnealingLR, WarmupMultiStepLR, warmup
+import torchvision
+torchvision.disable_beta_transforms_warning()
+import warnings
+warnings.filterwarnings("ignore", message="using precomputed metric; inverse_transform will be unavailable")
+warnings.filterwarnings('ignore', message='divide by zero encountered in power', category=RuntimeWarning)
+warnings.filterwarnings(
+    "ignore",
+    "Corrupt EXIF data",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$"
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Metadata Warning",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$"
+)
+warnings.filterwarnings(
+    "ignore",
+    "UserWarning: Truncated File Read",
+    UserWarning,
+    r"^PIL\.TiffImagePlugin$"
+)
+def parse_args():
+    parser = argparse.ArgumentParser()
+    # Model Selection
+    parser.add_argument('--model', type=str, required=True, choices=['ctm', 'lstm', 'ff'], help='Model type to train.')
+    # Model Architecture
+    parser.add_argument('--d_model', type=int, default=512, help='Dimension of the model.')
+    parser.add_argument('--dropout', type=float, default=0.0, help='Dropout rate.')
+    parser.add_argument('--backbone_type', type=str, default='resnet34-2', help='Type of backbone featureiser.')
+    # CTM / LSTM specific
+    parser.add_argument('--d_input', type=int, default=128, help='Dimension of the input (CTM, LSTM).')
+    parser.add_argument('--heads', type=int, default=8, help='Number of attention heads (CTM, LSTM).')
+    parser.add_argument('--iterations', type=int, default=75, help='Number of internal ticks (CTM, LSTM).')
+    parser.add_argument('--positional_embedding_type', type=str, default='none',
+                        help='Type of positional embedding (CTM, LSTM).', choices=['none',
+                                                                       'learnable-fourier',
+                                                                       'multi-learnable-fourier',
+                                                                       'custom-rotational'])
+    # CTM specific
+    parser.add_argument('--synapse_depth', type=int, default=8, help='Depth of U-NET model for synapse. 1=linear, no unet (CTM only).')
+    parser.add_argument('--n_synch_out', type=int, default=32, help='Number of neurons to use for output synch (CTM only).')
+    parser.add_argument('--n_synch_action', type=int, default=32, help='Number of neurons to use for observation/action synch (CTM only).')
+    parser.add_argument('--neuron_select_type', type=str, default='random-pairing', help='Protocol for selecting neuron subset (CTM only).')
+    parser.add_argument('--n_random_pairing_self', type=int, default=0, help='Number of neurons paired self-to-self for synch (CTM only).')
+    parser.add_argument('--memory_length', type=int, default=25, help='Length of the pre-activation history for NLMS (CTM only).')
+    parser.add_argument('--deep_memory', action=argparse.BooleanOptionalAction, default=True, help='Use deep memory (CTM only).')
+    parser.add_argument('--memory_hidden_dims', type=int, default=32, help='Hidden dimensions of the memory if using deep memory (CTM only).')
+    parser.add_argument('--dropout_nlm', type=float, default=None, help='Dropout rate for NLMs specifically. Unset to match dropout on the rest of the model (CTM only).')
+    parser.add_argument('--do_normalisation', action=argparse.BooleanOptionalAction, default=False, help='Apply normalization in NLMs (CTM only).')
+    # LSTM specific
+    parser.add_argument('--num_layers', type=int, default=2, help='Number of LSTM stacked layers (LSTM only).')
+    # Task Specific Args
+    parser.add_argument('--maze_route_length', type=int, default=100, help='Length to truncate targets.')
+    parser.add_argument('--cirriculum_lookahead', type=int, default=5, help='How far to look ahead for cirriculum.')
+    # Training
+    parser.add_argument('--batch_size', type=int, default=16, help='Batch size for training (per GPU).')
+    parser.add_argument('--batch_size_test', type=int, default=64, help='Batch size for testing (per GPU).')
+    parser.add_argument('--lr', type=float, default=1e-4, help='Learning rate for the model.')
+    parser.add_argument('--training_iterations', type=int, default=100001, help='Number of training iterations.')
+    parser.add_argument('--warmup_steps', type=int, default=5000, help='Number of warmup steps.')
+    parser.add_argument('--use_scheduler', action=argparse.BooleanOptionalAction, default=True, help='Use a learning rate scheduler.')
+    parser.add_argument('--scheduler_type', type=str, default='cosine', choices=['multistep', 'cosine'], help='Type of learning rate scheduler.')
+    parser.add_argument('--milestones', type=int, default=[8000, 15000, 20000], nargs='+', help='Learning rate scheduler milestones.')
+    parser.add_argument('--gamma', type=float, default=0.1, help='Learning rate scheduler gamma for multistep.')
+    parser.add_argument('--weight_decay', type=float, default=0.0, help='Weight decay factor.')
+    parser.add_argument('--weight_decay_exclusion_list', type=str, nargs='+', default=[], help='List to exclude from weight decay. Typically good: bn, ln, bias, start')
+    parser.add_argument('--num_workers_train', type=int, default=0, help='Num workers training.')
+    parser.add_argument('--gradient_clipping', type=float, default=-1, help='Gradient quantile clipping value (-1 to disable).')
+    parser.add_argument('--use_custom_sampler', action=argparse.BooleanOptionalAction, default=False, help='Use custom fast sampler to avoid reshuffling.')
+    parser.add_argument('--do_compile', action=argparse.BooleanOptionalAction, default=False, help='Try to compile model components.')
+    # Logging and Saving
+    parser.add_argument('--log_dir', type=str, default='logs/scratch', help='Directory for logging.')
+    parser.add_argument('--dataset', type=str, default='mazes-medium', help='Dataset to use.', choices=['mazes-medium', 'mazes-large'])
+    parser.add_argument('--save_every', type=int, default=1000, help='Save checkpoints every this many iterations.')
+    parser.add_argument('--seed', type=int, default=412, help='Random seed.')
+    parser.add_argument('--reload', action=argparse.BooleanOptionalAction, default=False, help='Reload from disk?')
+    parser.add_argument('--reload_model_only', action=argparse.BooleanOptionalAction, default=False, help='Reload only the model from disk?') # Default False based on user edit
+    parser.add_argument('--strict_reload', action=argparse.BooleanOptionalAction, default=False, help='Should use strict reload for model weights.')
+    parser.add_argument('--ignore_metrics_when_reloading', action=argparse.BooleanOptionalAction, default=False, help='Ignore metrics when reloading (for debugging)?')
+    # Tracking
+    parser.add_argument('--track_every', type=int, default=1000, help='Track metrics every this many iterations.')
+    parser.add_argument('--n_test_batches', type=int, default=2, help='How many minibatches to approx metrics. Set to -1 for full eval')
+    # Precision
+    parser.add_argument('--use_amp', action=argparse.BooleanOptionalAction, default=False, help='AMP autocast.')
+    args = parser.parse_args()
+    return args
+# --- DDP Setup Functions ---
+def setup_ddp():
+    if 'RANK' not in os.environ:
+        os.environ['RANK'] = '0'
+        os.environ['WORLD_SIZE'] = '1'
+        os.environ['MASTER_ADDR'] = 'localhost'
+        os.environ['MASTER_PORT'] = '12356' # Different port from image classification
+        os.environ['LOCAL_RANK'] = '0'
+        print("Running in non-distributed mode (simulated DDP setup).")
+        if not torch.cuda.is_available() or int(os.environ['WORLD_SIZE']) == 1:
+            dist.init_process_group(backend='gloo')
+            print("Initialized process group with Gloo backend for single/CPU process.")
+            rank = int(os.environ['RANK'])
+            world_size = int(os.environ['WORLD_SIZE'])
+            local_rank = int(os.environ['LOCAL_RANK'])
+            return rank, world_size, local_rank
+    dist.init_process_group(backend='nccl')
+    rank = int(os.environ['RANK'])
+    world_size = int(os.environ['WORLD_SIZE'])
+    local_rank = int(os.environ['LOCAL_RANK'])
+    if torch.cuda.is_available():
+        torch.cuda.set_device(local_rank)
+        print(f"Rank {rank} setup on GPU {local_rank}")
+    else:
+         print(f"Rank {rank} setup on CPU")
+    return rank, world_size, local_rank
+def cleanup_ddp():
+    if dist.is_initialized():
+        dist.destroy_process_group()
+        print("DDP cleanup complete.")
+def is_main_process(rank):
+    return rank == 0
+# --- End DDP Setup ---
+if __name__=='__main__':
+    args = parse_args()
+    rank, world_size, local_rank = setup_ddp()
+    set_seed(args.seed + rank, False)
+    # Rank 0 handles directory creation and initial logging
+    if is_main_process(rank):
+        if not os.path.exists(args.log_dir): os.makedirs(args.log_dir)
+        zip_python_code(f'{args.log_dir}/repo_state.zip')
+        with open(f'{args.log_dir}/args.txt', 'w') as f:
+            print(args, file=f)
+    if world_size > 1: dist.barrier()
+    assert args.dataset in ['mazes-medium', 'mazes-large']
+    # Setup Device
+    if torch.cuda.is_available():
+        device = torch.device(f'cuda:{local_rank}')
+    else:
+        device = torch.device('cpu')
+        if world_size > 1: warnings.warn("Running DDP on CPU is not recommended.")
+    if is_main_process(rank):
+        print(f'Main process (Rank {rank}): Using device {device}. World size: {world_size}. Model: {args.model}')
+    prediction_reshaper = [args.maze_route_length, 5]
+    args.out_dims = args.maze_route_length * 5
+    # --- Model Definition (Conditional) ---
+    model_base = None # Base model before DDP wrapping
+    if args.model == 'ctm':
+        model_base = ContinuousThoughtMachine(
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            n_synch_out=args.n_synch_out,
+            n_synch_action=args.n_synch_action,
+            synapse_depth=args.synapse_depth,
+            memory_length=args.memory_length,
+            deep_nlms=args.deep_memory,
+            memory_hidden_dims=args.memory_hidden_dims,
+            do_layernorm_nlm=args.do_normalisation,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+            dropout_nlm=args.dropout_nlm,
+            neuron_select_type=args.neuron_select_type,
+            n_random_pairing_self=args.n_random_pairing_self,
+        ).to(device)
+    elif args.model == 'lstm':
+        model_base = LSTMBaseline(
+            num_layers=args.num_layers,
+            iterations=args.iterations,
+            d_model=args.d_model,
+            d_input=args.d_input,
+            heads=args.heads,
+            backbone_type=args.backbone_type,
+            positional_embedding_type=args.positional_embedding_type,
+            out_dims=args.out_dims,
+            prediction_reshaper=prediction_reshaper,
+            dropout=args.dropout,
+        ).to(device)
+    elif args.model == 'ff':
+        model_base = FFBaseline(
+            d_model=args.d_model,
+            backbone_type=args.backbone_type,
+            out_dims=args.out_dims,
+            dropout=args.dropout,
+        ).to(device)
+    else:
+        raise ValueError(f"Unknown model type: {args.model}")
+    # Use pseudo-input *before* DDP wrapping
+    try:
+        # Determine pseudo input shape based on dataset
+        h_w = 39 if args.dataset in ['mazes-small', 'mazes-medium'] else 99 # Example dimensions
+        pseudo_inputs = torch.zeros((1, 3, h_w, h_w), device=device).float()
+        model_base(pseudo_inputs)
+    except Exception as e:
+         print(f"Warning: Pseudo forward pass failed: {e}")
+    if is_main_process(rank):
+        print(f'Total params: {sum(p.numel() for p in model_base.parameters() if p.requires_grad)}')
+    # Wrap model with DDP
+    if device.type == 'cuda' and world_size > 1:
+        model = DDP(model_base, device_ids=[local_rank], output_device=local_rank)
+    elif device.type == 'cpu' and world_size > 1:
+        model = DDP(model_base)
+    else:
+        model = model_base
+    # --- End Model Definition ---
+    # Data Loading (After model setup to allow pseudo pass first)
+    dataset_mean = [0,0,0]
+    dataset_std = [1,1,1]
+    which_maze = args.dataset.split('-')[-1]
+    data_root = f'data/mazes/{which_maze}'
+    train_data = MazeImageFolder(root=f'{data_root}/train/', which_set='train', maze_route_length=args.maze_route_length)
+    test_data = MazeImageFolder(root=f'{data_root}/test/', which_set='test', maze_route_length=args.maze_route_length)
+    train_sampler = (FastRandomDistributedSampler(train_data, num_replicas=world_size, rank=rank, seed=args.seed, epoch_steps=int(10e10))
+                     if args.use_custom_sampler else
+                     DistributedSampler(train_data, num_replicas=world_size, rank=rank, shuffle=True, seed=args.seed))
+    test_sampler = DistributedSampler(test_data, num_replicas=world_size, rank=rank, shuffle=False, seed=args.seed)
+    num_workers_test = 1
+    trainloader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size, sampler=train_sampler,
+                                              num_workers=args.num_workers_train, pin_memory=True, drop_last=True)
+    testloader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, sampler=test_sampler,
+                                             num_workers=num_workers_test, pin_memory=True, drop_last=False)
+    # Optimizer and scheduler
+    decay_params = []
+    no_decay_params = []
+    no_decay_names = []
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue # Skip parameters that don't require gradients
+        if any(exclusion_str in name for exclusion_str in args.weight_decay_exclusion_list):
+            no_decay_params.append(param)
+            no_decay_names.append(name)
+        else:
+            decay_params.append(param)
+    if len(no_decay_names) and is_main_process(rank):
+        print(f'WARNING, excluding: {no_decay_names}')
+    # Optimizer and scheduler (Common setup)
+    if len(no_decay_names) and args.weight_decay!=0:
+        optimizer = torch.optim.AdamW([{'params': decay_params, 'weight_decay':args.weight_decay},
+                                       {'params': no_decay_params, 'weight_decay':0}],
+                                  lr=args.lr,
+                                  eps=1e-8 if not args.use_amp else 1e-6)
+    else:
+        optimizer = torch.optim.AdamW(model.parameters(),
+                                    lr=args.lr,
+                                    eps=1e-8 if not args.use_amp else 1e-6,
+                                    weight_decay=args.weight_decay)
+    warmup_schedule = warmup(args.warmup_steps)
+    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=warmup_schedule.step)
+    if args.use_scheduler:
+        if args.scheduler_type == 'multistep':
+            scheduler = WarmupMultiStepLR(optimizer, warmup_steps=args.warmup_steps, milestones=args.milestones, gamma=args.gamma)
+        elif args.scheduler_type == 'cosine':
+            scheduler = WarmupCosineAnnealingLR(optimizer, args.warmup_steps, args.training_iterations, warmup_start_lr=1e-20, eta_min=1e-7)
+        else:
+            raise NotImplementedError
+    # Metrics tracking (Rank 0 stores history)
+    start_iter = 0
+    iters = []
+    train_losses, test_losses = [], []
+    train_accuracies, test_accuracies = [], [] # Avg Step Acc (scalar list)
+    train_accuracies_most_certain, test_accuracies_most_certain = [], [] # Avg Step Acc @ Certain tick (scalar list)
+    train_accuracies_most_certain_permaze, test_accuracies_most_certain_permaze = [], [] # Full Maze Acc @ Certain tick (scalar list)
+    scaler = torch.amp.GradScaler("cuda" if device.type == 'cuda' else "cpu", enabled=args.use_amp)
+    # Reloading Logic
+    if args.reload:
+        map_location = device
+        chkpt_path = f'{args.log_dir}/checkpoint.pt'
+        if os.path.isfile(chkpt_path):
+            print(f'Rank {rank}: Reloading from: {chkpt_path}')
+            if not args.strict_reload: print('WARNING: not using strict reload for model weights!')
+            checkpoint = torch.load(chkpt_path, map_location=map_location, weights_only=False)
+            model_to_load = model.module if isinstance(model, DDP) else model
+            state_dict = checkpoint['model_state_dict']
+            has_module_prefix = all(k.startswith('module.') for k in state_dict)
+            is_wrapped = isinstance(model, DDP)
+            if has_module_prefix and not is_wrapped:
+                state_dict = {k.partition('module.')[2]: v for k,v in state_dict.items()}
+            elif not has_module_prefix and is_wrapped:
+                load_result = model_to_load.load_state_dict(state_dict, strict=args.strict_reload)
+                print(f" Loaded state_dict. Missing: {load_result.missing_keys}, Unexpected: {load_result.unexpected_keys}")
+                state_dict = None # Prevent loading again
+            if state_dict is not None:
+                load_result = model_to_load.load_state_dict(state_dict, strict=args.strict_reload)
+                print(f" Loaded state_dict. Missing: {load_result.missing_keys}, Unexpected: {load_result.unexpected_keys}")
+            if not args.reload_model_only:
+                print(f'Rank {rank}: Reloading optimizer, scheduler, scaler, iteration.')
+                optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+                scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
+                scaler.load_state_dict(checkpoint['scaler_state_dict'])
+                start_iter = checkpoint['iteration']
+                if is_main_process(rank) and not args.ignore_metrics_when_reloading:
+                    print(f'Rank {rank}: Reloading metrics history.')
+                    iters = checkpoint['iters']
+                    train_losses = checkpoint['train_losses']
+                    test_losses = checkpoint['test_losses']
+                    train_accuracies = checkpoint['train_accuracies'] # Reloading simplified avg step acc list
+                    test_accuracies = checkpoint['test_accuracies']
+                    train_accuracies_most_certain = checkpoint['train_accuracies_most_certain']
+                    test_accuracies_most_certain = checkpoint['test_accuracies_most_certain']
+                    train_accuracies_most_certain_permaze = checkpoint['train_accuracies_most_certain_permaze']
+                    test_accuracies_most_certain_permaze = checkpoint['test_accuracies_most_certain_permaze']
+                elif is_main_process(rank) and args.ignore_metrics_when_reloading:
+                     print(f'Rank {rank}: Ignoring metrics history upon reload.')
+            else:
+                 print(f'Rank {rank}: Only reloading model weights!')
+            if is_main_process(rank) and 'torch_rng_state' in checkpoint and not args.reload_model_only:
+                print(f'Rank {rank}: Loading RNG states.')
+                torch.set_rng_state(checkpoint['torch_rng_state'].cpu())
+                np.random.set_state(checkpoint['numpy_rng_state'])
+                random.setstate(checkpoint['random_rng_state'])
+            del checkpoint
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            print(f"Rank {rank}: Reload finished, starting from iteration {start_iter}")
+        else:
+            print(f"Rank {rank}: Checkpoint not found at {chkpt_path}, starting from scratch.")
+        if world_size > 1: dist.barrier()
+    # Conditional Compilation
+    if args.do_compile:
+        if is_main_process(rank): print('Compiling model components...')
+        model_to_compile = model.module if isinstance(model, DDP) else model
+        if hasattr(model_to_compile, 'backbone'):
+            model_to_compile.backbone = torch.compile(model_to_compile.backbone, mode='reduce-overhead', fullgraph=True)
+        if args.model == 'ctm':
+             model_to_compile.synapses = torch.compile(model_to_compile.synapses, mode='reduce-overhead', fullgraph=True)
+        if world_size > 1: dist.barrier()
+        if is_main_process(rank): print('Compilation finished.')
+    # --- Training Loop ---
+    model.train()
+    pbar = tqdm(total=args.training_iterations, initial=start_iter, leave=False, position=0, dynamic_ncols=True, disable=not is_main_process(rank))
+    iterator = iter(trainloader)
+    for bi in range(start_iter, args.training_iterations):
+        # --- Evaluation and Plotting (Rank 0 + Aggregation) ---
+        if bi % args.track_every == 0 and (bi != 0 or args.reload_model_only):
+            model.eval()
+            with torch.inference_mode():
+                # --- Distributed Evaluation ---
+                if is_main_process(rank): iters.append(bi) # Track iterations on rank 0
+                # Initialize accumulators on device
+                total_train_loss = torch.tensor(0.0, device=device)
+                total_train_correct_certain = torch.tensor(0.0, device=device) # Sum correct steps @ certain tick
+                total_train_mazes_solved = torch.tensor(0.0, device=device)    # Sum solved mazes @ certain tick
+                total_train_steps = torch.tensor(0.0, device=device)           # Total steps evaluated (B * S)
+                total_train_mazes = torch.tensor(0.0, device=device)           # Total mazes evaluated (B)
+                # TRAIN METRICS
+                train_eval_sampler = DistributedSampler(train_data, num_replicas=world_size, rank=rank, shuffle=False)
+                train_eval_loader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size_test, sampler=train_eval_sampler, num_workers=num_workers_test, pin_memory=True)
+                pbar_inner_desc = 'Eval Train (Rank 0)' if is_main_process(rank) else None
+                with tqdm(total=len(train_eval_loader), desc=pbar_inner_desc, leave=False, position=1, dynamic_ncols=True, disable=not is_main_process(rank)) as pbar_inner:
+                    for inferi, (inputs, targets) in enumerate(train_eval_loader):
+                        inputs = inputs.to(device, non_blocking=True)
+                        targets = targets.to(device, non_blocking=True) # B, S
+                        batch_size = inputs.size(0)
+                        seq_len = targets.size(1)
+                        loss_eval = None
+                        pred_at_certain = None # Shape B, S
+                        if args.model == 'ctm':
+                            predictions_raw, certainties, _ = model(inputs)
+                            predictions = predictions_raw.reshape(batch_size, -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                            loss_eval, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=True)
+                            pred_at_certain = predictions.argmax(2)[torch.arange(batch_size, device=device), :, where_most_certain]
+                        elif args.model == 'lstm':
+                            predictions_raw, certainties, _ = model(inputs)
+                            predictions = predictions_raw.reshape(batch_size, -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                            loss_eval, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=False) # where = -1
+                            pred_at_certain = predictions.argmax(2)[torch.arange(batch_size, device=device), :, where_most_certain]
+                        elif args.model == 'ff':
+                            predictions_raw = model(inputs) # B, S*C
+                            predictions = predictions_raw.reshape(batch_size, -1, 5) # B,S,C
+                            loss_eval, where_most_certain, _ = maze_loss(predictions.unsqueeze(-1), None, targets, use_most_certain=False) # where = -1
+                            pred_at_certain = predictions.argmax(2)
+                        # Accumulate metrics
+                        total_train_loss += loss_eval * batch_size # Sum losses
+                        correct_steps = (pred_at_certain == targets) # B, S boolean
+                        total_train_correct_certain += correct_steps.sum() # Sum correct steps across batch
+                        total_train_mazes_solved += correct_steps.all(dim=-1).sum() # Sum mazes where all steps are correct
+                        total_train_steps += batch_size * seq_len
+                        total_train_mazes += batch_size
+                        if args.n_test_batches != -1 and inferi >= args.n_test_batches -1: break
+                        pbar_inner.update(1)
+                # Aggregate Train Metrics
+                if world_size > 1:
+                    dist.all_reduce(total_train_loss, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_train_correct_certain, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_train_mazes_solved, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_train_steps, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_train_mazes, op=dist.ReduceOp.SUM)
+                # Calculate final Train metrics on Rank 0
+                if is_main_process(rank) and total_train_mazes > 0:
+                    avg_train_loss = total_train_loss.item() / total_train_mazes.item() # Avg loss per maze/sample
+                    avg_train_acc_step = total_train_correct_certain.item() / total_train_steps.item() # Avg correct step %
+                    avg_train_acc_maze = total_train_mazes_solved.item() / total_train_mazes.item() # Avg full maze solved %
+                    train_losses.append(avg_train_loss)
+                    train_accuracies_most_certain.append(avg_train_acc_step)
+                    train_accuracies_most_certain_permaze.append(avg_train_acc_maze)
+                    # train_accuracies list remains unused/placeholder for this simplified metric structure
+                    print(f"Iter {bi} Train Metrics (Agg): Loss={avg_train_loss:.4f}, StepAcc={avg_train_acc_step:.4f}, MazeAcc={avg_train_acc_maze:.4f}")
+                # TEST METRICS
+                total_test_loss = torch.tensor(0.0, device=device)
+                total_test_correct_certain = torch.tensor(0.0, device=device)
+                total_test_mazes_solved = torch.tensor(0.0, device=device)
+                total_test_steps = torch.tensor(0.0, device=device)
+                total_test_mazes = torch.tensor(0.0, device=device)
+                pbar_inner_desc = 'Eval Test (Rank 0)' if is_main_process(rank) else None
+                with tqdm(total=len(testloader), desc=pbar_inner_desc, leave=False, position=1, dynamic_ncols=True, disable=not is_main_process(rank)) as pbar_inner:
+                    for inferi, (inputs, targets) in enumerate(testloader):
+                        inputs = inputs.to(device, non_blocking=True)
+                        targets = targets.to(device, non_blocking=True)
+                        batch_size = inputs.size(0)
+                        seq_len = targets.size(1)
+                        loss_eval = None
+                        pred_at_certain = None
+                        if args.model == 'ctm':
+                            predictions_raw, certainties, _ = model(inputs)
+                            predictions = predictions_raw.reshape(batch_size, -1, 5, predictions_raw.size(-1))
+                            loss_eval, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=True)
+                            pred_at_certain = predictions.argmax(2)[torch.arange(batch_size, device=device), :, where_most_certain]
+                        elif args.model == 'lstm':
+                            predictions_raw, certainties, _ = model(inputs)
+                            predictions = predictions_raw.reshape(batch_size, -1, 5, predictions_raw.size(-1))
+                            loss_eval, where_most_certain, _ = maze_loss(predictions, certainties, targets, use_most_certain=False)
+                            pred_at_certain = predictions.argmax(2)[torch.arange(batch_size, device=device), :, where_most_certain]
+                        elif args.model == 'ff':
+                            predictions_raw = model(inputs)
+                            predictions = predictions_raw.reshape(batch_size, -1, 5)
+                            loss_eval, where_most_certain, _ = maze_loss(predictions.unsqueeze(-1), None, targets, use_most_certain=False)
+                            pred_at_certain = predictions.argmax(2)
+                        total_test_loss += loss_eval * batch_size
+                        correct_steps = (pred_at_certain == targets)
+                        total_test_correct_certain += correct_steps.sum()
+                        total_test_mazes_solved += correct_steps.all(dim=-1).sum()
+                        total_test_steps += batch_size * seq_len
+                        total_test_mazes += batch_size
+                        if args.n_test_batches != -1 and inferi >= args.n_test_batches -1: break
+                        pbar_inner.update(1)
+                # Aggregate Test Metrics
+                if world_size > 1:
+                    dist.all_reduce(total_test_loss, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_test_correct_certain, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_test_mazes_solved, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_test_steps, op=dist.ReduceOp.SUM)
+                    dist.all_reduce(total_test_mazes, op=dist.ReduceOp.SUM)
+                # Calculate and Plot final Test metrics on Rank 0
+                if is_main_process(rank) and total_test_mazes > 0:
+                    avg_test_loss = total_test_loss.item() / total_test_mazes.item()
+                    avg_test_acc_step = total_test_correct_certain.item() / total_test_steps.item()
+                    avg_test_acc_maze = total_test_mazes_solved.item() / total_test_mazes.item()
+                    test_losses.append(avg_test_loss)
+                    test_accuracies_most_certain.append(avg_test_acc_step)
+                    test_accuracies_most_certain_permaze.append(avg_test_acc_maze)
+                    print(f"Iter {bi} Test Metrics (Agg): Loss={avg_test_loss:.4f}, StepAcc={avg_test_acc_step:.4f}, MazeAcc={avg_test_acc_maze:.4f}\n")
+                    # --- Plotting ---
+                    figacc = plt.figure(figsize=(10, 10))
+                    axacc_train = figacc.add_subplot(211)
+                    axacc_test = figacc.add_subplot(212)
+                    # Plot Avg Step Accuracy
+                    axacc_train.plot(iters, train_accuracies_most_certain, 'k-', alpha=0.7, label=f'Avg Step Acc ({train_accuracies_most_certain[-1]:.3f})')
+                    axacc_test.plot(iters, test_accuracies_most_certain, 'k-', alpha=0.7, label=f'Avg Step Acc ({test_accuracies_most_certain[-1]:.3f})')
+                    # Plot Full Maze Accuracy
+                    axacc_train.plot(iters, train_accuracies_most_certain_permaze, 'r-', alpha=0.6, label=f'Full Maze Acc ({train_accuracies_most_certain_permaze[-1]:.3f})')
+                    axacc_test.plot(iters, test_accuracies_most_certain_permaze, 'r-', alpha=0.6, label=f'Full Maze Acc ({test_accuracies_most_certain_permaze[-1]:.3f})')
+                    axacc_train.set_title('Train Accuracy (Aggregated)')
+                    axacc_test.set_title('Test Accuracy (Aggregated)')
+                    axacc_train.legend(loc='lower right')
+                    axacc_test.legend(loc='lower right')
+                    axacc_train.set_xlim([0, args.training_iterations])
+                    axacc_test.set_xlim([0, args.training_iterations])
+                    axacc_train.set_ylim([0, 1])
+                    axacc_test.set_ylim([0, 1])
+                    figacc.tight_layout()
+                    figacc.savefig(f'{args.log_dir}/accuracies.png', dpi=150)
+                    plt.close(figacc)
+                    # Loss Plot
+                    figloss = plt.figure(figsize=(10, 5))
+                    axloss = figloss.add_subplot(111)
+                    axloss.plot(iters, train_losses, 'b-', linewidth=1, alpha=0.8, label=f'Train (Agg): {train_losses[-1]:.4f}')
+                    axloss.plot(iters, test_losses, 'r-', linewidth=1, alpha=0.8, label=f'Test (Agg): {test_losses[-1]:.4f}')
+                    axloss.legend(loc='upper right')
+                    axloss.set_xlabel("Iteration")
+                    axloss.set_ylabel("Loss")
+                    axloss.set_xlim([0, args.training_iterations])
+                    axloss.set_ylim(bottom=0)
+                    figloss.tight_layout()
+                    figloss.savefig(f'{args.log_dir}/losses.png', dpi=150)
+                    plt.close(figloss)
+                    # --- End Plotting ---
+                # --- Visualization (Rank 0, Conditional) ---
+                if is_main_process(rank) and args.model in ['ctm', 'lstm']:
+                    # try:
+                    model_module = model.module if isinstance(model, DDP) else model
+                    # Use a consistent batch for viz if possible, or just next batch
+                    inputs_viz, targets_viz = next(iter(testloader))
+                    inputs_viz = inputs_viz.to(device)
+                    targets_viz = targets_viz.to(device)
+                    longest_index = (targets_viz!=4).sum(-1).argmax() # 4 assumed padding
+                    pbar.set_description('Tracking (Rank 0): Viz Fwd Pass')
+                    predictions_viz_raw, _, _, _, post_activations_viz, attention_tracking_viz = model_module(inputs_viz, track=True)
+                    predictions_viz = predictions_viz_raw.reshape(predictions_viz_raw.size(0), -1, 5, predictions_viz_raw.size(-1))
+                    att_shape = (model.module.kv_features.shape[2], model.module.kv_features.shape[3])
+                    attention_tracking_viz = attention_tracking_viz.reshape(
+                        attention_tracking_viz.shape[0],
+                        attention_tracking_viz.shape[1], -1, att_shape[0], att_shape[1])
+                    pbar.set_description('Tracking (Rank 0): Dynamics Plot')
+                    plot_neural_dynamics(post_activations_viz, 100, args.log_dir, axis_snap=True)
+                    pbar.set_description('Tracking (Rank 0): Maze GIF')
+                    if attention_tracking_viz is not None:
+                            make_maze_gif((inputs_viz[longest_index].detach().cpu().numpy()+1)/2,
+                                        predictions_viz[longest_index].detach().cpu().numpy(),
+                                        targets_viz[longest_index].detach().cpu().numpy(),
+                                        attention_tracking_viz[:, longest_index],
+                                        args.log_dir)
+                        # else:
+                        #      print("Skipping maze GIF due to attention shape issue.")
+                    # except Exception as e_viz:
+                    #     print(f"Rank 0 visualization failed: {e_viz}")
+                # --- End Visualization ---
+            gc.collect()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            if world_size > 1: dist.barrier()
+        model.train()
+        # --- End Evaluation Block ---
+        if hasattr(train_sampler, 'set_epoch'): # Check if sampler has set_epoch
+            train_sampler.set_epoch(bi)
+        current_lr = optimizer.param_groups[-1]['lr']
+        try:
+            inputs, targets = next(iterator)
+        except StopIteration:
+            iterator = iter(trainloader)
+            inputs, targets = next(iterator)
+        inputs = inputs.to(device, non_blocking=True)
+        targets = targets.to(device, non_blocking=True)
+        # Defaults for logging
+        loss = torch.tensor(0.0, device=device) # Need loss defined for logging scope
+        accuracy_finegrained = 0.0
+        where_most_certain_val = -1.0
+        where_most_certain_std = 0.0
+        where_most_certain_min = -1
+        where_most_certain_max = -1
+        upto_where_mean = -1.0
+        upto_where_std = 0.0
+        upto_where_min = -1
+        upto_where_max = -1
+        with torch.autocast(device_type="cuda" if device.type == 'cuda' else "cpu", dtype=torch.float16, enabled=args.use_amp):
+            if args.do_compile: torch.compiler.cudagraph_mark_step_begin()
+            if args.model == 'ctm':
+                predictions_raw, certainties, _ = model(inputs)
+                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                loss, where_most_certain, upto_where = maze_loss(predictions, certainties, targets, cirriculum_lookahead=args.cirriculum_lookahead, use_most_certain=True)
+                with torch.no_grad(): # Calculate local accuracy for logging
+                    accuracy_finegrained = (predictions.argmax(2)[torch.arange(predictions.size(0), device=device), :, where_most_certain] == targets).float().mean().item()
+            elif args.model == 'lstm':
+                predictions_raw, certainties, _ = model(inputs)
+                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5, predictions_raw.size(-1)) # B,S,C,T
+                loss, where_most_certain, upto_where = maze_loss(predictions, certainties, targets, cirriculum_lookahead=args.cirriculum_lookahead, use_most_certain=False) # where = -1
+                with torch.no_grad():
+                    accuracy_finegrained = (predictions.argmax(2)[torch.arange(predictions.size(0), device=device), :, where_most_certain] == targets).float().mean().item()
+            elif args.model == 'ff':
+                predictions_raw = model(inputs) # B, S*C
+                predictions = predictions_raw.reshape(predictions_raw.size(0), -1, 5) # B,S,C
+                loss, where_most_certain, upto_where = maze_loss(predictions.unsqueeze(-1), None, targets, cirriculum_lookahead=args.cirriculum_lookahead, use_most_certain=False) # where = -1
+                with torch.no_grad():
+                    accuracy_finegrained = (predictions.argmax(2) == targets).float().mean().item()
+            # Extract stats from loss outputs
+            if torch.is_tensor(where_most_certain):
+                where_most_certain_val = where_most_certain.float().mean().item()
+                where_most_certain_std = where_most_certain.float().std().item()
+                where_most_certain_min = where_most_certain.min().item()
+                where_most_certain_max = where_most_certain.max().item()
+            elif isinstance(where_most_certain, int):
+                 where_most_certain_val = float(where_most_certain); where_most_certain_min = where_most_certain; where_most_certain_max = where_most_certain
+            if isinstance(upto_where, (np.ndarray, list)) and len(upto_where) > 0:
+                 upto_where_mean = np.mean(upto_where); upto_where_std = np.std(upto_where); upto_where_min = np.min(upto_where); upto_where_max = np.max(upto_where)
+        # Backprop / Step
+        scaler.scale(loss).backward()
+        if args.gradient_clipping!=-1:
+            scaler.unscale_(optimizer)
+            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=args.gradient_clipping)
+        scaler.step(optimizer)
+        scaler.update()
+        optimizer.zero_grad(set_to_none=True)
+        scheduler.step()
+        # --- Aggregation and Logging (Rank 0) ---
+        loss_log = loss.detach()
+        if world_size > 1: dist.all_reduce(loss_log, op=dist.ReduceOp.AVG)
+        if is_main_process(rank):
+             pbar_desc = f'Loss(avg)={loss_log.item():.3f} Acc(loc)={accuracy_finegrained:.3f} LR={current_lr:.6f}'
+             if args.model in ['ctm', 'lstm'] or torch.is_tensor(where_most_certain):
+                  pbar_desc += f' Cert={where_most_certain_val:.2f}'#+-{where_most_certain_std:.2f}' # Removed std for brevity
+             if isinstance(upto_where, (np.ndarray, list)) and len(upto_where) > 0:
+                  pbar_desc += f' Path={upto_where_mean:.1f}'#+-{upto_where_std:.1f}'
+             pbar.set_description(f'{args.model.upper()} {pbar_desc}')
+        # --- End Aggregation and Logging ---
+        # --- Checkpointing (Rank 0) ---
+        if (bi % args.save_every == 0 or bi == args.training_iterations - 1) and bi != start_iter and is_main_process(rank):
+            pbar.set_description('Rank 0: Saving checkpoint...')
+            save_path = f'{args.log_dir}/checkpoint.pt'
+            model_state_to_save = model.module.state_dict() if isinstance(model, DDP) else model.state_dict()
+            checkpoint_data = {
+                'model_state_dict': model_state_to_save,
+                'optimizer_state_dict': optimizer.state_dict(),
+                'scheduler_state_dict': scheduler.state_dict(),
+                'scaler_state_dict': scaler.state_dict(),
+                'iteration': bi,
+                'train_losses': train_losses,
+                'test_losses': test_losses,
+                'train_accuracies': train_accuracies, # Saving simplified scalar list
+                'test_accuracies': test_accuracies,   # Saving simplified scalar list
+                'train_accuracies_most_certain': train_accuracies_most_certain,
+                'test_accuracies_most_certain': test_accuracies_most_certain,
+                'train_accuracies_most_certain_permaze': train_accuracies_most_certain_permaze,
+                'test_accuracies_most_certain_permaze': test_accuracies_most_certain_permaze,
+                'iters': iters,
+                'args': args,
+                'torch_rng_state': torch.get_rng_state(),
+                'numpy_rng_state': np.random.get_state(),
+                'random_rng_state': random.getstate(),
+            }
+            torch.save(checkpoint_data, save_path)
+        # --- End Checkpointing ---
+        if world_size > 1: dist.barrier()
+        if is_main_process(rank):
+            pbar.update(1)
+    # --- End Training Loop ---
+    if is_main_process(rank):
+        pbar.close()
+    cleanup_ddp()

tasks/parity/README.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# Parity
+## Training
+To run the parity training that we used for the paper, run bash scripts from the root level of the repository. For example, to train the 75-iteration, 25-memory-length CTM, run:
+```
+bash tasks/parity/scripts/train_ctm_75_25.sh
+```
+## Analysis
+To run the analysis, first make sure the checkpoints are saved in the log directory (specified by the `log_dir` argument). The checkpoints can be obtained by either running the training code, or downloading them from [this link](https://drive.google.com/file/d/1itUS5_i9AyUo_7awllTx8X0PXYw9fnaG/view?usp=drive_link).
+```
+python -m tasks.parity.analysis.run --log_dir <PATH_TO_LOG_DIR>
+```

tasks/parity/analysis/make_blog_gifs.py ADDED Viewed

	@@ -0,0 +1,263 @@

+import torch
+import os
+import math
+import imageio
+import numpy as np
+import matplotlib.pyplot as plt
+from matplotlib.patches import FancyArrowPatch
+from scipy.special import softmax
+import matplotlib.cm as cm
+from data.custom_datasets import ParityDataset
+import umap
+from tqdm import tqdm
+from models.utils import reshape_predictions
+from tasks.parity.utils import reshape_inputs
+from tasks.parity.analysis.run import build_model_from_checkpoint_path
+from tasks.image_classification.analysis.build_imagenet_viz_blog import save_frames_to_mp4
+def make_parity_gif(
+    predictions,
+    targets,
+    post_activations,
+    attention_weights,
+    inputs_to_model,
+    save_path,
+    umap_positions,
+    umap_point_scaler=1.0,
+):
+    batch_index = 0
+    figscale = 0.32
+    n_steps, n_heads, seqLen = attention_weights.shape[:3]
+    grid_side = int(np.sqrt(seqLen))
+    frames = []
+    inputs_this_batch  = inputs_to_model[:, batch_index]
+    preds_this_batch   = predictions[batch_index]
+    targets_this_batch = targets[batch_index]
+    post_act_this_batch = post_activations[:, batch_index]
+    # build a flexible mosaic
+    mosaic = [
+        [f"att_0", f"in_0", "probs", "probs", "target", "target"],
+        [f"att_1", f"in_1", "probs", "probs", "target", "target"],
+    ]
+    for h in range(2, n_heads):
+        mosaic.append(
+            [f"att_{h}", f"in_{h}", "umap", "umap",
+             "umap", "umap"]
+        )
+    for t in range(n_steps):
+        rows      = len(mosaic)
+        cell_size = figscale * 4
+        fig_h     = rows * cell_size
+        fig, ax = plt.subplot_mosaic(
+            mosaic,
+            figsize=(6 * cell_size, fig_h),
+            constrained_layout=False,
+            gridspec_kw={'wspace': 0.05, 'hspace': 0.05},  # small gaps
+        )
+        # restore a little margin
+        fig.subplots_adjust(left=0.02, right=0.98, top=0.98, bottom=0.02)
+        # probabilities heatmap
+        logits_t = preds_this_batch[:, :, t]
+        probs_t  = softmax(logits_t, axis=1)[:, 0].reshape(grid_side, grid_side)
+        ax["probs"].imshow(probs_t, cmap="gray", vmin=0, vmax=1)
+        ax["probs"].axis("off")
+        # target overlay
+        ax["target"].imshow(
+            targets_this_batch.reshape(grid_side, grid_side),
+            cmap="gray_r", vmin=0, vmax=1
+        )
+        ax["target"].axis("off")
+        ax["target"].grid(which="minor", color="black", linestyle="-", linewidth=0.5)
+        z = post_act_this_batch[t]
+        low, high = np.percentile(z, 5), np.percentile(z, 95)
+        z_norm = np.clip((z - low) / (high - low), 0, 1)
+        point_sizes = (np.abs(z_norm - 0.5) * 100 + 5) * umap_point_scaler
+        cmap = plt.get_cmap("Spectral")
+        ax["umap"].scatter(
+            umap_positions[:, 0],
+            umap_positions[:, 1],
+            s=point_sizes,
+            c=cmap(z_norm),
+            alpha=0.8
+        )
+        ax["umap"].axis("off")
+        # normalize attention
+        att_t = attention_weights[t, :, :]
+        a_min, a_max = att_t.min(), att_t.max()
+        if not np.isclose(a_min, a_max):
+            att_t = (att_t - a_min) / (a_max - a_min + 1e-8)
+        else:
+            att_t = np.zeros_like(att_t)
+        # input image for arrows
+        img_t = inputs_this_batch[t].transpose(1, 2, 0)
+        if t == 0:
+            route_history = [[] for _ in range(n_heads)]
+        img_h, img_w = img_t.shape[:2]
+        cell_h = img_h // grid_side
+        cell_w = img_w // grid_side
+        for h in range(n_heads):
+            head_map = att_t[h].reshape(grid_side, grid_side)
+            ax[f"att_{h}"].imshow(head_map, cmap="viridis", vmin=0, vmax=1)
+            ax[f"att_{h}"].axis("off")
+            ax[f"in_{h}"].imshow(img_t, cmap="gray", vmin=0, vmax=1)
+            ax[f"in_{h}"].axis("off")
+            # track argmax center
+            flat_idx = np.argmax(head_map)
+            gy, gx = divmod(flat_idx, grid_side)
+            cx = int((gx + 0.5) * cell_w)
+            cy = int((gy + 0.5) * cell_h)
+            route_history[h].append((cx, cy))
+            cmap_steps = plt.colormaps.get_cmap("Spectral")
+            colors = [cmap_steps(i / (n_steps - 1)) for i in range(n_steps)]
+            for i in range(len(route_history[h]) - 1):
+                x0, y0 = route_history[h][i]
+                x1, y1 = route_history[h][i + 1]
+                color = colors[i]
+                is_last = (i == len(route_history[h]) - 2)
+                style   = '->' if is_last else '-'
+                lw      = 2.0 if is_last else 1.6
+                alpha   = 1.0 if is_last else 0.9
+                scale   = 10  if is_last else 1
+                # draw arrow
+                arr = FancyArrowPatch(
+                    (x0, y0), (x1, y1),
+                    arrowstyle=style,
+                    linewidth=lw,
+                    mutation_scale=scale,
+                    alpha=alpha,
+                    facecolor=color,
+                    edgecolor=color,
+                    shrinkA=0, shrinkB=0,
+                    capstyle='round', joinstyle='round',
+                    zorder=3 if is_last else 2,
+                    clip_on=False,
+                )
+                ax[f"in_{h}"].add_patch(arr)
+                ax[f"in_{h}"].scatter(
+                    x1, y1,
+                    marker='x',
+                    s=40,
+                    color=color,
+                    linewidths=lw,
+                    zorder=4
+                )
+        canvas = fig.canvas
+        canvas.draw()
+        frame = np.frombuffer(canvas.buffer_rgba(), dtype=np.uint8)
+        w, h   = canvas.get_width_height()
+        frames.append(frame.reshape(h, w, 4)[..., :3])
+        plt.close(fig)
+    # save gif
+    imageio.mimsave(f"{save_path}/activation.gif", frames, fps=15, loop=0)
+    # save mp4
+    save_frames_to_mp4(
+        [fm[:, :, ::-1] for fm in frames],  # RGB→BGR
+        f"{save_path}/activation.mp4",
+        fps=15,
+        gop_size=1,
+        preset="slow"
+    )
+def run_umap(model, testloader):
+    all_post_activations = []
+    point_counts = 150
+    sampled = 0
+    with tqdm(total=point_counts, desc="Collecting UMAP data") as pbar:
+        for inputs, _ in testloader:
+            for i in range(inputs.size(0)):
+                if sampled >= point_counts:
+                    break
+                input_i = inputs[i].unsqueeze(0).to(device)
+                _, _, _, _, post_activations, _ = model(input_i, track=True)
+                all_post_activations.append(post_activations)
+                sampled += 1
+                pbar.update(1)
+            if sampled >= point_counts:
+                break
+    stacked = np.stack(all_post_activations, 1)
+    umap_features = stacked.reshape(-1, stacked.shape[-1])
+    reducer = umap.UMAP(
+        n_components=2,
+        n_neighbors=20,
+        min_dist=1,
+        spread=1,
+        metric='cosine',
+        local_connectivity=1
+    )
+    positions = reducer.fit_transform(umap_features.T)
+    return positions
+def run_model_and_make_gif(checkpoint_path, save_path, device):
+    parity_sequence_length = 64
+    iterations = 75
+    test_data = ParityDataset(sequence_length=parity_sequence_length, length=10000)
+    testloader = torch.utils.data.DataLoader(test_data, batch_size=256, shuffle=True, num_workers=0, drop_last=False)
+    model, _ = build_model_from_checkpoint_path(checkpoint_path, "ctm", device=device)
+    input = torch.randint(0, 2, (64,), dtype=torch.float32, device=device) * 2 - 1
+    input = input.unsqueeze(0)
+    target = torch.cumsum((input == -1).to(torch.long), dim=1) % 2
+    target = target.unsqueeze(0)
+    positions = run_umap(model, testloader)
+    model.eval()
+    with torch.inference_mode():
+        predictions, _, _, _, post_activations, attention = model(input, track=True)
+        predictons = reshape_predictions(predictions, prediction_reshaper=[parity_sequence_length, 2])
+        input_images = reshape_inputs(input, iterations, grid_size=int(math.sqrt(parity_sequence_length)))
+        make_parity_gif(
+            predictions=predictons.detach().cpu().numpy(),
+            targets=target.detach().cpu().numpy(),
+            post_activations=post_activations,
+            attention_weights=attention.squeeze(1).squeeze(2),
+            inputs_to_model=input_images,
+            save_path=save_path,
+            umap_positions=positions,
+            umap_point_scaler=1.0,
+        )
+if __name__ == "__main__":
+    CHECKPOINT_PATH = "checkpoints/parity/run1/ctm_75_25/checkpoint_200000.pt"
+    SAVE_PATH = f"tasks/parity/analysis/outputs/blog_gifs/"
+    os.makedirs(SAVE_PATH, exist_ok=True)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    run_model_and_make_gif(CHECKPOINT_PATH, SAVE_PATH, device)

tasks/parity/analysis/run.py ADDED Viewed

	@@ -0,0 +1,269 @@

+import torch
+import numpy as np
+import argparse
+import multiprocessing
+from tqdm import tqdm
+import math
+import os
+import csv
+from utils.housekeeping import set_seed
+from data.custom_datasets import ParityDataset
+from tasks.parity.utils import prepare_model, reshape_attention_weights, reshape_inputs, get_where_most_certain
+from tasks.parity.plotting import plot_attention_trajectory, plot_input, plot_target, plot_probabilities, plot_prediction, plot_accuracy_training, create_attentions_heatmap_gif, create_accuracies_heatmap_gif, create_stacked_gif, plot_training_curve_all_runs, plot_accuracy_thinking_time, make_parity_gif, plot_lstm_last_and_certain_accuracy
+from models.utils import compute_normalized_entropy, reshape_predictions, get_latest_checkpoint_file, get_checkpoint_files, load_checkpoint, get_model_args_from_checkpoint, get_all_log_dirs
+from tasks.image_classification.plotting import plot_neural_dynamics
+import seaborn as sns
+sns.set_palette("hls")
+sns.set_style('darkgrid')
+def parse_args():
+    parser = argparse.ArgumentParser(description='Parity Analysis')
+    parser.add_argument('--log_dir', type=str, default='checkpoints/parity', help='Directory to save logs.')
+    parser.add_argument('--batch_size_test', type=int, default=128, help='batch size for testing')
+    parser.add_argument('--scale_training_curve', type=float, default=0.6, help='Scaling factor for plots.')
+    parser.add_argument('--scale_heatmap', type=float, default=0.4, help='Scaling factor for heatmap plots.')
+    parser.add_argument('--scale_training_index_accuracy', type=float, default=0.4, help='Scaling factor for training index accuracy plots.')
+    parser.add_argument('--seed', type=int, default=0, help='Random seed for reproducibility.')
+    parser.add_argument('--device', type=int, nargs='+', default=[-1], help='List of GPU(s) to use. Set to -1 to use CPU.')
+    parser.add_argument('--model_type', type=str, choices=['ctm', 'lstm'], default='ctm', help='Type of model to analyze (ctm or lstm).')
+    return parser.parse_args()
+def calculate_corrects(predictions, targets):
+    predicted_labels = predictions.argmax(2)
+    accuracy = (predicted_labels == targets.unsqueeze(-1))
+    return accuracy.detach().cpu().numpy()
+def get_corrects_per_element_at_most_certain_time(predictions, certainty, targets):
+    where_most_certain = get_where_most_certain(certainty)
+    corrects = (predictions.argmax(2)[torch.arange(predictions.size(0), device=predictions.device),:,where_most_certain] == targets).float()
+    return corrects.detach().cpu().numpy()
+def calculate_entropy_average_over_batch(normalized_entropy_per_elements):
+    normalized_entropy_per_elements_avg_batch = normalized_entropy_per_elements.mean(axis=1)
+    return normalized_entropy_per_elements_avg_batch
+def calculate_thinking_time_average_over_batch(normalized_entropy_per_elements):
+    first_occurrence = calculate_thinking_time(normalized_entropy_per_elements)
+    average_thinking_time = np.mean(first_occurrence, axis=0)
+    return average_thinking_time
+def calculate_thinking_time(normalized_entropy_per_elements, finish_type="min", entropy_threshold=0.1):
+    if finish_type == "min":
+        min_entropy_time = np.argmin(normalized_entropy_per_elements, axis=0)
+        return min_entropy_time
+    elif finish_type == "threshold":
+        T, B, S = normalized_entropy_per_elements.shape
+        below_threshold = normalized_entropy_per_elements < entropy_threshold
+        first_occurrence = np.argmax(below_threshold, axis=0)
+        no_true = ~np.any(below_threshold, axis=0)
+        first_occurrence[no_true] = T
+        return first_occurrence
+def test_handcrafted_examples(model, args, run_model_spefic_save_dir, device):
+    test_cases = []
+    all_even_input = torch.full((args.parity_sequence_length,), 1.0, dtype=torch.float32, device=device)
+    all_even_target = torch.zeros_like(all_even_input, dtype=torch.long)
+    test_cases.append((all_even_input, all_even_target))
+    all_odd_input = torch.full((args.parity_sequence_length,), -1.0, dtype=torch.float32, device=device)
+    all_odd_target = torch.cumsum((all_odd_input == -1).to(torch.long), dim=0) % 2
+    test_cases.append((all_odd_input, all_odd_target))
+    random_input = torch.randint(0, 2, (args.parity_sequence_length,), dtype=torch.float32, device=device) * 2 - 1
+    random_target = torch.cumsum((random_input == -1).to(torch.long), dim=0) % 2
+    test_cases.append((random_input, random_target))
+    for i, (inputs, targets) in enumerate(test_cases):
+        inputs = inputs.unsqueeze(0)
+        targets = targets.unsqueeze(0)
+        filename = f"eval_handcrafted_{i}"
+        extend_inference_time = False
+        handcraft_dir = f"{run_model_spefic_save_dir}/handcrafted_examples/{i}"
+        os.makedirs(handcraft_dir, exist_ok=True)
+        model.eval()
+        with torch.inference_mode():
+            if extend_inference_time:
+                model.iterations = model.iterations * 2
+            predictions, certainties, synchronisation, pre_activations, post_activations, attention = model(inputs, track=True)
+            predictions = reshape_predictions(predictions, prediction_reshaper=[args.parity_sequence_length, 2])
+            input_images = reshape_inputs(inputs, args.iterations, grid_size=int(math.sqrt(args.parity_sequence_length)))
+            plot_neural_dynamics(post_activations, 100, handcraft_dir, axis_snap=False)
+            process = multiprocessing.Process(
+                target=make_parity_gif,
+                args=(
+                predictions.detach().cpu().numpy(),
+                certainties.detach().cpu().numpy(),
+                targets.detach().cpu().numpy(),
+                pre_activations,
+                post_activations,
+                reshape_attention_weights(attention),
+                input_images,
+                f"{handcraft_dir}/eval_output_val_{0}_iter_{0}.gif",
+            ))
+            process.start()
+            input_images = input_images.squeeze(1).squeeze(1)
+            attention = attention.squeeze(1)
+            for h in range(args.heads):
+                plot_attention_trajectory(attention[:, h, :, :], certainties, input_images, handcraft_dir, filename + f"_head_{h}", args)
+            plot_attention_trajectory(attention.mean(1), certainties, input_images, handcraft_dir, filename, args)
+            plot_input(input_images, handcraft_dir, filename)
+            plot_target(targets, handcraft_dir, filename, args)
+            plot_probabilities(predictions, certainties, handcraft_dir, filename, args)
+            plot_prediction(predictions, certainties,handcraft_dir, filename, args)
+        if extend_inference_time:
+            model.iterations = model.iterations // 2
+        model.train()
+        pass
+def build_model_from_checkpoint_path(checkpoint_path, model_type, device="cpu"):
+    checkpoint = load_checkpoint(checkpoint_path, device)
+    model_args = get_model_args_from_checkpoint(checkpoint)
+    model = prepare_model([model_args.parity_sequence_length, 2], model_args, device)
+    model.load_state_dict(checkpoint["model_state_dict"], strict=False)
+    return model, model_args
+def analyze_trained_model(run_model_spefic_save_dir, args, device):
+    with torch.no_grad():
+        latest_checkpoint_path = get_latest_checkpoint_file(args.log_dir)
+        model, model_args = build_model_from_checkpoint_path(latest_checkpoint_path, args.model_type, device=device)
+        model.eval()
+        model_args.log_dir = args.log_dir
+        test_data = ParityDataset(sequence_length=model_args.parity_sequence_length, length=10000)
+        testloader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, shuffle=True, num_workers=0, drop_last=False)
+        corrects, corrects_at_most_certain_times, entropys, attentions = [], [], [], []
+        for inputs, targets in testloader:
+            inputs = inputs.to(device)
+            targets = targets.to(device)
+            predictions, certainties, synchronisation, pre_activations, post_activations, attention = model(inputs, track=True)
+            predictions = reshape_predictions(predictions, prediction_reshaper=[model_args.parity_sequence_length, 2])
+            corrects_batch = calculate_corrects(predictions, targets)
+            corrects_at_most_certain_time_batch = get_corrects_per_element_at_most_certain_time(predictions, certainties, targets)
+            corrects.append(corrects_batch)
+            corrects_at_most_certain_times.append(corrects_at_most_certain_time_batch)
+            attentions.append(attention)
+        test_handcrafted_examples(model, model_args, run_model_spefic_save_dir, device)
+        overall_mean_accuracy = np.mean(np.vstack(corrects_at_most_certain_times))
+        overall_std_accuracy = np.std(np.mean(np.vstack(corrects_at_most_certain_times), axis=1))
+    return overall_mean_accuracy, overall_std_accuracy, model_args.iterations
+def analyze_training(run_model_spefic_save_dir, args, device):
+    checkpoint_files = get_checkpoint_files(args.log_dir)
+    all_accuracies = []
+    all_accuracies_at_most_certain_time = []
+    all_average_thinking_times = []
+    all_std_thinking_times = []
+    all_attentions = []
+    for checkpoint_path in checkpoint_files:
+        model, model_args = build_model_from_checkpoint_path(checkpoint_path, args.model_type, device=device)
+        model_args.log_dir = run_model_spefic_save_dir
+        test_data = ParityDataset(sequence_length=model_args.parity_sequence_length, length=1000)
+        testloader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size_test, shuffle=True, num_workers=0, drop_last=False)
+        corrects = []
+        corrects_at_most_certain_times = []
+        thinking_times = []
+        attentions = []
+        for inputs, targets in testloader:
+            inputs = inputs.to(device)
+            targets = targets.to(device)
+            predictions, certainties, synchronisation, pre_activations, post_activations, attention = model(inputs, track=True)
+            predictions = reshape_predictions(predictions, prediction_reshaper=[model_args.parity_sequence_length, 2])
+            attention = reshape_attention_weights(attention)
+            corrects_batch = calculate_corrects(predictions, targets)
+            corrects_at_most_certain_time_batch = get_corrects_per_element_at_most_certain_time(predictions, certainties, targets)
+            entropy_per_element = compute_normalized_entropy(predictions.permute(0,3,1,2), reduction='none').detach().cpu().numpy()
+            thinking_times_batch = np.argmin(entropy_per_element, axis=1)
+            corrects.append(corrects_batch)
+            corrects_at_most_certain_times.append(corrects_at_most_certain_time_batch)
+            thinking_times.append(thinking_times_batch)
+            attentions.append(attention)
+        checkpoint_average_accuracies = np.mean(np.concatenate(corrects, axis=0), axis=0).transpose(1,0)
+        all_accuracies.append(checkpoint_average_accuracies)
+        stacked_corrects_at_most_certain_times = np.vstack(corrects_at_most_certain_times)
+        checkpoint_average_accuracy_at_most_certain_time = np.mean(stacked_corrects_at_most_certain_times, axis=0)
+        all_accuracies_at_most_certain_time.append(checkpoint_average_accuracy_at_most_certain_time)
+        checkpoint_thinking_times = np.concatenate(thinking_times, axis=0)
+        checkpoint_average_thinking_time = np.mean(checkpoint_thinking_times, axis=0)
+        checkpoint_std_thinking_time = np.std(checkpoint_thinking_times, axis=0)
+        all_average_thinking_times.append(checkpoint_average_thinking_time)
+        all_std_thinking_times.append(checkpoint_std_thinking_time)
+        checkpoint_average_attentions = np.mean(np.concatenate(attentions, axis=1), axis=1)
+        all_attentions.append(checkpoint_average_attentions)
+    plot_accuracy_training(all_accuracies_at_most_certain_time, args.scale_training_index_accuracy, run_model_spefic_save_dir, args=model_args)
+    create_attentions_heatmap_gif(all_attentions, args.scale_heatmap, run_model_spefic_save_dir, model_args)
+    create_accuracies_heatmap_gif(np.array(all_accuracies), all_average_thinking_times, all_std_thinking_times, args.scale_heatmap, run_model_spefic_save_dir, model_args)
+    create_stacked_gif(run_model_spefic_save_dir)
+def get_accuracy_and_loss_from_checkpoint(checkpoint):
+    training_iteration = checkpoint.get('training_iteration', 0)
+    train_losses = checkpoint.get('train_losses', [])
+    test_losses = checkpoint.get('test_losses', [])
+    train_accuracies = checkpoint.get('train_accuracies_most_certain', [])
+    test_accuracies = checkpoint.get('test_accuracies_most_certain', [])
+    return training_iteration, train_losses, test_losses, train_accuracies, test_accuracies
+if __name__ == "__main__":
+    args = parse_args()
+    device = f'cuda:{args.device[0]}' if args.device[0] != -1 else 'cpu'
+    set_seed(args.seed)
+    save_dir = "tasks/parity/analysis/outputs"
+    os.makedirs(save_dir, exist_ok=True)
+    accuracy_csv_file_path = os.path.join(save_dir, "accuracy.csv")
+    if os.path.exists(accuracy_csv_file_path):
+        os.remove(accuracy_csv_file_path)
+    all_runs_log_dirs = get_all_log_dirs(args.log_dir)
+    plot_training_curve_all_runs(all_runs_log_dirs, save_dir, args.scale_training_curve, device, x_max=200_000)
+    plot_lstm_last_and_certain_accuracy(all_folders=all_runs_log_dirs, save_path=f"{save_dir}/lstm_final_vs_certain_accuracy.png", scale=args.scale_training_curve)
+    progress_bar = tqdm(all_runs_log_dirs, desc="Analyzing Runs", dynamic_ncols=True)
+    for folder in progress_bar:
+        run, model_name = folder.strip("/").split("/")[-2:]
+        run_model_spefic_save_dir = f"{save_dir}/{model_name}/{run}"
+        os.makedirs(run_model_spefic_save_dir, exist_ok=True)
+        args.log_dir = folder
+        progress_bar.set_description(f"Analyzing Trained Model at {folder}")
+        accuracy_mean, accuracy_std, num_iterations = analyze_trained_model(run_model_spefic_save_dir, args, device)
+        with open(accuracy_csv_file_path, mode='a', newline='') as file:
+            writer = csv.writer(file)
+            if file.tell() == 0:
+                writer.writerow(["Run", "Overall Mean Accuracy", "Overall Std Accuracy", "Num Iterations"])
+            writer.writerow([folder, accuracy_mean, accuracy_std, num_iterations])
+        progress_bar.set_description(f"Analyzing Training at {folder}")
+        analyze_training(run_model_spefic_save_dir, args, device)
+    plot_accuracy_thinking_time(accuracy_csv_file_path, scale=args.scale_training_curve, output_dir=save_dir)

tasks/parity/plotting.py ADDED Viewed

	@@ -0,0 +1,896 @@

+import os
+import seaborn as sns
+import numpy as np
+import pandas as pd
+from collections import defaultdict
+from matplotlib.lines import Line2D
+import matplotlib as mpl
+import matplotlib.pyplot as plt
+import matplotlib.patheffects as path_effects
+from matplotlib.ticker import FuncFormatter
+from scipy.special import softmax
+import imageio.v2 as imageio
+from PIL import Image
+import math
+import re
+sns.set_style('darkgrid')
+mpl.use('Agg')
+from tasks.parity.utils import get_where_most_certain, parse_folder_name
+from models.utils import get_latest_checkpoint_file, load_checkpoint, get_model_args_from_checkpoint, get_accuracy_and_loss_from_checkpoint
+from tasks.image_classification.plotting import save_frames_to_mp4
+def make_parity_gif(predictions, certainties, targets, pre_activations, post_activations, attention_weights, inputs_to_model, filename):
+    # Config
+    batch_index = 0
+    n_neurons_to_visualise = 16
+    figscale = 0.28
+    n_steps = len(pre_activations)
+    frames = []
+    heatmap_cmap = sns.color_palette("viridis", as_cmap=True)
+    these_pre_acts = pre_activations[:, batch_index, :] # Shape: (T, H)
+    these_post_acts = post_activations[:, batch_index, :] # Shape: (T, H)
+    these_inputs = inputs_to_model[:, batch_index, :, :, :] # Shape: (T, C, H, W)
+    these_predictions = predictions[batch_index, :, :, :] # Shape: (d, C, T)
+    these_certainties = certainties[batch_index, :, :] # Shape: (C, T)
+    these_attention_weights = attention_weights[:, batch_index, :, :]
+    # Create mosaic layout
+    mosaic = [['img_data', 'img_data', 'attention', 'attention', 'probs', 'probs', 'target', 'target'] for _ in range(2)] + \
+             [['img_data', 'img_data', 'attention', 'attention', 'probs', 'probs', 'target', 'target'] for _ in range(2)] + \
+             [['certainty', 'certainty', 'certainty', 'certainty', 'certainty', 'certainty', 'certainty', 'certainty']] + \
+             [[f'trace_{ti}', f'trace_{ti}', f'trace_{ti}', f'trace_{ti}', f'trace_{ti}', f'trace_{ti}', f'trace_{ti}', f'trace_{ti}'] for ti in range(n_neurons_to_visualise)]
+    for stepi in range(n_steps):
+        fig_gif, axes_gif = plt.subplot_mosaic(mosaic=mosaic, figsize=(31*figscale*8/4, 76*figscale))
+        # Plot predictions
+        d = these_predictions.shape[0]
+        grid_side = int(np.sqrt(d))
+        logits = these_predictions[:, :, stepi]
+        probs = softmax(logits, axis=1)
+        probs_grid = probs[:, 0].reshape(grid_side, grid_side)
+        axes_gif["probs"].imshow(probs_grid, cmap='viridis', interpolation='nearest', vmin=0, vmax=1)
+        axes_gif["probs"].axis('off')
+        axes_gif["probs"].set_title('Probabilties')
+        # Create and show attention heatmap
+        this_input_gate = these_attention_weights[stepi]
+        gate_min, gate_max = np.nanmin(this_input_gate), np.nanmax(this_input_gate)
+        if not np.isclose(gate_min, gate_max):
+            normalized_gate = (this_input_gate - gate_min) / (gate_max - gate_min + 1e-8)
+        else:
+            normalized_gate = np.zeros_like(this_input_gate)
+        attention_weights_heatmap = heatmap_cmap(normalized_gate)[:,:,:3]
+        # Show heatmaps
+        axes_gif['attention'].imshow(attention_weights_heatmap, vmin=0, vmax=1)
+        axes_gif['attention'].axis('off')
+        axes_gif['attention'].set_title('Attention')
+        # Plot target
+        target_grid = targets[batch_index].reshape(grid_side, grid_side)
+        axes_gif["target"].imshow(target_grid, cmap='viridis_r', interpolation='nearest', vmin=0, vmax=1)
+        axes_gif["target"].axis('off')
+        axes_gif["target"].set_title('Target')
+        # Add certainty plot
+        axes_gif['certainty'].plot(np.arange(n_steps), these_certainties[1], 'k-', linewidth=2)
+        axes_gif['certainty'].set_xlim([0, n_steps-1])
+        axes_gif['certainty'].axvline(x=stepi, color='black', linewidth=1, alpha=0.5)
+        axes_gif['certainty'].set_xticklabels([])
+        axes_gif['certainty'].set_yticklabels([])
+        axes_gif['certainty'].grid(False)
+        # Plot neuron traces
+        for neuroni in range(n_neurons_to_visualise):
+            ax = axes_gif[f'trace_{neuroni}']
+            pre_activation = these_pre_acts[:, neuroni]
+            post_activation = these_post_acts[:, neuroni]
+            ax_pre = ax.twinx()
+            pre_min, pre_max = np.min(pre_activation), np.max(pre_activation)
+            post_min, post_max = np.min(post_activation), np.max(post_activation)
+            ax_pre.plot(np.arange(n_steps), pre_activation,
+                        color='grey',
+                        linestyle='--',
+                        linewidth=1,
+                        alpha=0.4,
+                        label='Pre-activation')
+            color = 'blue' if neuroni % 2 else 'red'
+            ax.plot(np.arange(n_steps), post_activation,
+                    color=color,
+                    linestyle='-',
+                    linewidth=2,
+                    alpha=1.0,
+                    label='Post-activation')
+            ax.set_xlim([0, n_steps-1])
+            ax_pre.set_xlim([0, n_steps-1])
+            if pre_min != pre_max:
+                ax_pre.set_ylim([pre_min, pre_max])
+            if post_min != post_max:
+                ax.set_ylim([post_min, post_max])
+            ax.axvline(x=stepi, color='black', linewidth=1, alpha=0.5)
+            ax.set_xticklabels([])
+            ax.set_yticklabels([])
+            ax.grid(False)
+            ax_pre.set_xticklabels([])
+            ax_pre.set_yticklabels([])
+            ax_pre.grid(False)
+        # Show input image
+        this_image = these_inputs[stepi].transpose(1, 2, 0)
+        axes_gif['img_data'].imshow(this_image, cmap='viridis', vmin=0, vmax=1)
+        axes_gif['img_data'].grid(False)
+        axes_gif['img_data'].set_xticks([])
+        axes_gif['img_data'].set_yticks([])
+        axes_gif['img_data'].set_title('Input')
+        # Save frames
+        fig_gif.tight_layout(pad=0.1)
+        if stepi == 0:
+            fig_gif.savefig(filename.split('.gif')[0]+'_frame0.png', dpi=100)
+        if stepi == 1:
+            fig_gif.savefig(filename.split('.gif')[0]+'_frame1.png', dpi=100)
+        if stepi == n_steps-1:
+            fig_gif.savefig(filename.split('.gif')[0]+'_frame-1.png', dpi=100)
+        # Convert to frame
+        canvas = fig_gif.canvas
+        canvas.draw()
+        image_numpy = np.frombuffer(canvas.buffer_rgba(), dtype='uint8')
+        image_numpy = image_numpy.reshape(*reversed(canvas.get_width_height()), 4)[:,:,:3]
+        frames.append(image_numpy)
+        plt.close(fig_gif)
+    imageio.mimsave(filename, frames, fps=15, loop=100)
+    pass
+def plot_attention_trajectory(attention, certainties, input_images, save_dir, filename, args):
+    where_most_certain = get_where_most_certain(certainties)
+    grid_size = int(math.sqrt(args.parity_sequence_length))
+    trajectory = [np.unravel_index(np.argmax(attention[t]), (grid_size, grid_size)) for t in range(args.iterations)]
+    x_coords, y_coords = zip(*trajectory)
+    plt.figure(figsize=(5, 5))
+    plt.imshow(input_images[0], cmap="gray", origin="upper", vmin=0.2, vmax=0.8, interpolation='nearest')
+    ax = plt.gca()
+    nrows, ncols = input_images[0].shape
+    ax.set_xticks(np.arange(-0.5, ncols, 1), minor=True)
+    ax.set_yticks(np.arange(-0.5, nrows, 1), minor=True)
+    ax.grid(which='minor', color='black', linestyle='-', linewidth=1)
+    ax.tick_params(which="minor", size=0)
+    ax.set_axisbelow(False)
+    plt.xticks([])
+    plt.yticks([])
+    cmap = plt.get_cmap("plasma")
+    norm_time = np.linspace(0, 1, len(trajectory))
+    for i in range(len(trajectory) - 1):
+        x1, y1 = x_coords[i], y_coords[i]
+        x2, y2 = x_coords[i + 1], y_coords[i + 1]
+        color = cmap(norm_time[i])
+        line, = plt.plot([y1, y2], [x1, x2], color=color, linewidth=6, alpha=0.5, zorder=4)
+        line.set_path_effects([
+            path_effects.Stroke(linewidth=8, foreground='white'),
+            path_effects.Normal()
+        ])
+    for i, (x, y) in enumerate(trajectory):
+        plt.scatter(y, x, color=cmap(norm_time[i]), s=100, edgecolor='white', linewidth=1.5, zorder=5)
+    most_certain_point = trajectory[where_most_certain]
+    plt.plot(most_certain_point[1], most_certain_point[0],
+            marker='x', markersize=18, markeredgewidth=5,
+            color='white', linestyle='', zorder=6)
+    plt.plot(most_certain_point[1], most_certain_point[0],
+            marker='x', markersize=15, markeredgewidth=3,
+            color=cmap(norm_time[where_most_certain]), linestyle='', zorder=7)
+    plt.tight_layout()
+    plt.savefig(f"{save_dir}/{filename}_traj.png", dpi=300, bbox_inches='tight', pad_inches=0)
+    plt.savefig(f"{save_dir}/{filename}_traj.pdf", format='pdf', bbox_inches='tight', pad_inches=0)
+    plt.show()
+    plt.close()
+def plot_input(input_images, save_dir, filename):
+    plt.figure(figsize=(5, 5))
+    plt.imshow(input_images[0], cmap="gray", origin="upper", vmin=0.2, vmax=0.8, interpolation='nearest')
+    ax = plt.gca()
+    nrows, ncols = input_images[0].shape
+    ax.set_xticks(np.arange(-0.5, ncols, 1), minor=True)
+    ax.set_yticks(np.arange(-0.5, nrows, 1), minor=True)
+    ax.grid(which='minor', color='black', linestyle='-', linewidth=1)
+    ax.tick_params(which="minor", size=0)
+    ax.set_axisbelow(False)
+    plt.xticks([])
+    plt.yticks([])
+    plt.tight_layout()
+    plt.savefig(f"{save_dir}/{filename}_input.png", dpi=300, bbox_inches='tight', pad_inches=0)
+    plt.savefig(f"{save_dir}/{filename}_input.pdf", format='pdf', bbox_inches='tight', pad_inches=0)
+    plt.show()
+    plt.close()
+def plot_target(targets, save_dir, filename, args):
+    grid_size = int(math.sqrt(args.parity_sequence_length))
+    targets_grid = targets[0].reshape(grid_size, grid_size).detach().cpu().numpy()
+    plt.figure(figsize=(5, 5))
+    plt.imshow(targets_grid, cmap="gray_r", origin="upper", vmin=0.2, vmax=0.8, interpolation='nearest')
+    ax = plt.gca()
+    nrows, ncols = targets_grid.shape
+    ax.set_xticks(np.arange(-0.5, ncols, 1), minor=True)
+    ax.set_yticks(np.arange(-0.5, nrows, 1), minor=True)
+    ax.grid(which='minor', color='black', linestyle='-', linewidth=1)
+    ax.tick_params(which="minor", size=0)
+    ax.set_axisbelow(False)
+    plt.xticks([])
+    plt.yticks([])
+    plt.tight_layout()
+    plt.savefig(f"{save_dir}/{filename}_target.png", dpi=300, bbox_inches='tight', pad_inches=0)
+    plt.savefig(f"{save_dir}/{filename}_target.pdf", format='pdf', bbox_inches='tight', pad_inches=0)
+    plt.show()
+    plt.close()
+def plot_probabilities(predictions, certainties, save_dir, filename, args):
+    grid_size = int(math.sqrt(args.parity_sequence_length))
+    where_most_certain = get_where_most_certain(certainties)
+    predictions_most_certain = predictions[0, :, :, where_most_certain].detach().cpu().numpy()
+    probs = softmax(predictions_most_certain, axis=1)
+    probs_grid = probs[:, 0].reshape(grid_size, grid_size)
+    plt.figure(figsize=(5, 5))
+    plt.imshow(probs_grid, cmap="gray", origin="upper", vmin=0.2, vmax=0.8, interpolation='nearest')
+    ax = plt.gca()
+    nrows, ncols = probs_grid.shape
+    ax.set_xticks(np.arange(-0.5, ncols, 1), minor=True)
+    ax.set_yticks(np.arange(-0.5, nrows, 1), minor=True)
+    ax.grid(which='minor', color='black', linestyle='-', linewidth=1)
+    ax.tick_params(which="minor", size=0)
+    ax.set_axisbelow(False)
+    plt.xticks([])
+    plt.yticks([])
+    plt.tight_layout()
+    plt.savefig(f"{save_dir}/{filename}_probs.png", dpi=300, bbox_inches='tight', pad_inches=0)
+    plt.savefig(f"{save_dir}/{filename}_probs.pdf", format='pdf', bbox_inches='tight', pad_inches=0)
+    plt.show()
+    plt.close()
+def plot_prediction(predictions, certainties, save_dir, filename, args):
+    grid_size = int(math.sqrt(args.parity_sequence_length))
+    where_most_certain = get_where_most_certain(certainties)
+    predictions_most_certain = predictions[0, :, :, where_most_certain].detach().cpu().numpy()
+    class_grid = np.argmax(predictions_most_certain, axis=1).reshape(grid_size, grid_size)
+    plt.figure(figsize=(5, 5))
+    plt.imshow(class_grid, cmap="gray_r", origin="upper", vmin=0, vmax=1, interpolation='nearest')
+    ax = plt.gca()
+    nrows, ncols = class_grid.shape
+    ax.set_xticks(np.arange(-0.5, ncols, 1), minor=True)
+    ax.set_yticks(np.arange(-0.5, nrows, 1), minor=True)
+    ax.grid(which='minor', color='black', linestyle='-', linewidth=1)
+    ax.tick_params(which="minor", size=0)
+    ax.set_axisbelow(False)
+    plt.xticks([])
+    plt.yticks([])
+    plt.tight_layout()
+    plt.savefig(f"{save_dir}/{filename}_prediction.png", dpi=300, bbox_inches='tight', pad_inches=0)
+    plt.savefig(f"{save_dir}/{filename}_prediction.pdf", format='pdf', bbox_inches='tight', pad_inches=0)
+    plt.show()
+    plt.close()
+def plot_accuracy_heatmap(overall_accuracies_avg, average_thinking_time, std_thinking_time, scale, save_path, args):
+    fig, ax = plt.subplots(figsize=(scale*10, scale*5))
+    im = ax.imshow(overall_accuracies_avg.T * 100, aspect='auto', cmap="viridis", origin='lower', extent=[0, args.iterations-1, 0, args.parity_sequence_length-1], vmin=50, vmax=100)
+    cbar = fig.colorbar(im, ax=ax, format="%.1f")
+    cbar.set_label("Accuracy (%)")
+    ax.errorbar(average_thinking_time, np.arange(args.parity_sequence_length), xerr=std_thinking_time, fmt='ko', markersize=2, capsize=2, elinewidth=1, label="Min. Entropy")
+    ax.set_xlabel("Time Step")
+    ax.set_ylabel("Sequence Index")
+    ax.set_xlim(0, args.iterations-1)
+    ax.set_ylim(0, args.parity_sequence_length-1)
+    ax.grid(False)
+    ax.legend(loc="upper left")
+    fig.tight_layout(pad=0.1)
+    fig.savefig(save_path, dpi=300, bbox_inches="tight")
+    fig.savefig(save_path.replace(".png", ".pdf"), format='pdf', bbox_inches="tight")
+    plt.close(fig)
+def plot_attention_heatmap(overall_attentions_avg, scale, save_path, vmin=None, vmax=None):
+    overall_attentions_avg = overall_attentions_avg.reshape(overall_attentions_avg.shape[0], -1)
+    fig, ax = plt.subplots(figsize=(scale*10, scale*5))
+    im = ax.imshow(overall_attentions_avg.T, aspect='auto', cmap="viridis", origin='lower', extent=[0, overall_attentions_avg.shape[0]-1, 0, overall_attentions_avg.shape[1]-1], vmin=vmin, vmax=vmax)
+    cbar = fig.colorbar(im, ax=ax, format=FuncFormatter(lambda x, _: f"{x:05.2f}"))
+    cbar.set_label("Attention Weight")
+    ax.set_xlabel("Time Step")
+    ax.set_ylabel("Sequence Index")
+    ax.set_xlim(0, overall_attentions_avg.shape[0]-1)
+    ax.set_ylim(0, overall_attentions_avg.shape[1]-1)
+    ax.grid(False)
+    fig.tight_layout(pad=0.1)
+    fig.savefig(save_path, dpi=300, bbox_inches="tight")
+    fig.savefig(save_path.replace(".png", ".pdf"), format='pdf', bbox_inches="tight")
+    plt.close(fig)
+def create_accuracies_heatmap_gif(all_accuracies, all_average_thinking_times, all_std_thinking_times, scale, save_dir, args):
+    heatmap_components_dir = os.path.join(save_dir, "accuracy_heatmaps")
+    os.makedirs(heatmap_components_dir, exist_ok=True)
+    image_paths = []
+    for i, (accuracies, avg_thinking_time, std_thinking_time) in enumerate(zip(all_accuracies, all_average_thinking_times, all_std_thinking_times)):
+        save_path = os.path.join(heatmap_components_dir, f"frame_{i:04d}.png")
+        plot_accuracy_heatmap(accuracies, avg_thinking_time, std_thinking_time, scale, save_path, args)
+        image_paths.append(save_path)
+    gif_path = os.path.join(save_dir, "accuracy_heatmap.gif")
+    with imageio.get_writer(gif_path, mode='I', duration=0.3) as writer:
+        for image_path in image_paths:
+            image = imageio.imread(image_path)
+            writer.append_data(image)
+def create_attentions_heatmap_gif(all_attentions, scale, save_path, args):
+    heatmap_components_dir = os.path.join(args.log_dir, "attention_heatmaps")
+    os.makedirs(heatmap_components_dir, exist_ok=True)
+    global_min = min(attentions.min() for attentions in all_attentions)
+    global_max = max(attentions.max() for attentions in all_attentions)
+    image_paths = []
+    for i, attentions in enumerate(all_attentions):
+        save_path_component = os.path.join(heatmap_components_dir, f"frame_{i:04d}.png")
+        plot_attention_heatmap(attentions, scale, save_path_component, vmin=global_min, vmax=global_max)
+        image_paths.append(save_path_component)
+    gif_path = os.path.join(save_path, "attention_heatmap.gif")
+    with imageio.get_writer(gif_path, mode='I', duration=0.3) as writer:
+        for image_path in image_paths:
+            image = imageio.imread(image_path)
+            writer.append_data(image)
+def create_stacked_gif(save_path, y_shift=200):
+    accuracy_gif_path = os.path.join(save_path, "accuracy_heatmap.gif")
+    attention_gif_path = os.path.join(save_path, "attention_heatmap.gif")
+    stacked_gif_path = os.path.join(save_path, "stacked_heatmap.gif")
+    accuracy_reader = imageio.get_reader(accuracy_gif_path)
+    attention_reader = imageio.get_reader(attention_gif_path)
+    accuracy_frames = [Image.fromarray(frame) for frame in accuracy_reader]
+    attention_frames = [Image.fromarray(frame) for frame in attention_reader]
+    assert len(accuracy_frames) == len(attention_frames), "Mismatch in frame counts between accuracy and attention GIFs"
+    stacked_frames = []
+    for acc_frame, att_frame in zip(accuracy_frames, attention_frames):
+        acc_width, acc_height = acc_frame.size
+        att_width, att_height = att_frame.size
+        # Create base canvas
+        stacked_height = acc_height + att_height - y_shift
+        stacked_width = max(acc_width, att_width)
+        stacked_frame = Image.new("RGB", (stacked_width, stacked_height), color=(255, 255, 255))
+        # Paste attention frame first, shifted up
+        stacked_frame.paste(att_frame, (0, 0))  # Paste at top
+        stacked_frame.paste(acc_frame, (0, att_height - y_shift))  # Shift accuracy up by overlap
+        stacked_frames.append(stacked_frame)
+    stacked_frames[0].save(
+        stacked_gif_path,
+        save_all=True,
+        append_images=stacked_frames[1:],
+        duration=300,
+        loop=0
+    )
+    save_frames_to_mp4(
+        [np.array(fm)[:, :, ::-1] for fm in stacked_frames],
+        f"{stacked_gif_path.replace('gif', 'mp4')}",
+        fps=15,
+        gop_size=1,
+        preset="slow"
+    )
+def plot_accuracy_training(all_accuracies, scale, run_model_spefic_save_dir, args):
+    scale=0.5
+    seq_indices = range(args.parity_sequence_length)
+    fig, ax = plt.subplots(figsize=(scale*10, scale*5))
+    cmap = plt.get_cmap("viridis")
+    for i, acc in enumerate(all_accuracies):
+        color = cmap(i / (len(all_accuracies) - 1))
+        ax.plot(seq_indices, acc*100, color=color, alpha=0.7, linewidth=1)
+    num_checkpoints = 5
+    checkpoint_percentages = np.linspace(0, 100, num_checkpoints)
+    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=0, vmax=100))
+    sm.set_array([])
+    cbar = fig.colorbar(sm, ax=ax)
+    cbar.set_label("Training Progress (%)")
+    cbar.set_ticks(checkpoint_percentages)
+    cbar.set_ticklabels([f"{int(p)}%" for p in checkpoint_percentages])
+    ax.set_xlabel("Sequence Index")
+    ax.set_ylabel("Accuracy (%)")
+    ax.set_xticks([0, 16 ,32, 48, 63])
+    ax.grid(True, alpha=0.5)
+    ax.set_xlim(0, args.parity_sequence_length - 1)
+    fig.tight_layout(pad=0.1)
+    fig.savefig(f"{run_model_spefic_save_dir}/accuracy_vs_seq_element.png", dpi=300, bbox_inches="tight")
+    fig.savefig(f"{run_model_spefic_save_dir}/accuracy_vs_seq_element.pdf", format='pdf', bbox_inches="tight")
+    plt.close(fig)
+def plot_loss_all_runs(training_data, evaluate_every, save_path="train_loss_comparison_parity.png", step=1, scale=1.0, x_max=None):
+    fig, ax = plt.subplots(figsize=(scale * 10, scale * 5))
+    grouped = defaultdict(list)
+    label_map = {}
+    linestyle_map = {}
+    iters_map = {}
+    model_map = {}
+    for folder, data in training_data.items():
+        label, model_type, iters = parse_folder_name(folder)
+        if iters is None:
+            continue
+        key = f"{model_type}_{iters}"
+        grouped[key].append(data["train_losses"])
+        label_map[key] = f"{model_type}, {iters} Iters."
+        linestyle_map[key] = "--" if model_type == "LSTM" else "-"
+        iters_map[key] = iters
+        model_map[key] = model_type
+    unique_iters = sorted(set(iters_map.values()))
+    base_colors = sns.color_palette("hls", n_colors=len(unique_iters))
+    color_lookup = {iters: base_colors[i] for i, iters in enumerate(unique_iters)}
+    legend_entries = []
+    global_max_x = 0
+    for key in sorted(grouped.keys(), key=lambda k: (iters_map[k], model_map[k])):
+        runs = grouped[key]
+        if not runs:
+            continue
+        iters = iters_map[key]
+        color = color_lookup[iters]
+        linestyle = linestyle_map[key]
+        min_len = min(len(r) for r in runs)
+        trimmed = np.array([r[:min_len] for r in runs])[:, ::step]
+        mean = np.mean(trimmed, axis=0)
+        std = np.std(trimmed, axis=0)
+        x = np.arange(len(mean)) * step * evaluate_every
+        group_max_x = len(mean) * step * evaluate_every
+        global_max_x = max(global_max_x, group_max_x)
+        line, = ax.plot(x, mean, color=color, linestyle=linestyle, label=label_map[key])
+        ax.fill_between(x, mean - std, mean + std, alpha=0.1, color=color)
+        legend_entries.append((line, label_map[key]))
+    ax.set_xlabel("Training Iterations")
+    ax.set_ylabel("Loss")
+    ax.grid(True, alpha=0.5)
+    style_legend = [
+        Line2D([0], [0], color='black', linestyle='-', label='CTM'),
+        Line2D([0], [0], color='black', linestyle='--', label='LSTM')
+    ]
+    color_legend = [
+        Line2D([0], [0], color=color_lookup[it], linestyle='-', label=f"{it} Iters.")
+        for it in unique_iters
+    ]
+    if not x_max:
+        x_max = global_max_x
+    ax.set_xlim([0, x_max])
+    ax.set_ylim(bottom=0)
+    ax.set_xticks(np.arange(0, x_max + 1, 50000))
+    ax.legend(handles=color_legend + style_legend, loc="upper left")
+    fig.tight_layout(pad=0.1)
+    fig.savefig(save_path, dpi=300)
+    fig.savefig(save_path.replace("png", "pdf"), format='pdf')
+    plt.close(fig)
+def plot_accuracy_all_runs(training_data, evaluate_every, save_path="test_accuracy_comparison_parity.png", step=1, scale=1.0, smooth=False, x_max=None):
+    fig, ax = plt.subplots(figsize=(scale * 10, scale * 5))
+    grouped = defaultdict(list)
+    label_map = {}
+    linestyle_map = {}
+    iters_map = {}
+    model_map = {}
+    for folder, data in training_data.items():
+        label, model_type, iters = parse_folder_name(folder)
+        if iters is None:
+            continue
+        key = f"{model_type}_{iters}"
+        grouped[key].append(data["test_accuracies"])
+        label_map[key] = f"{model_type}, {iters} Iters."
+        linestyle_map[key] = "--" if model_type == "LSTM" else "-"
+        iters_map[key] = iters
+        model_map[key] = model_type
+    unique_iters = sorted(set(iters_map.values()))
+    base_colors = sns.color_palette("hls", n_colors=len(unique_iters))
+    color_lookup = {iters: base_colors[i] for i, iters in enumerate(unique_iters)}
+    legend_entries = []
+    global_max_x = 0
+    for key in sorted(grouped.keys(), key=lambda k: (iters_map[k], model_map[k])):
+        runs = grouped[key]
+        if not runs:
+            continue
+        iters = iters_map[key]
+        model = model_map[key]
+        color = color_lookup[iters]
+        linestyle = linestyle_map[key]
+        min_len = min(len(r) for r in runs)
+        trimmed = np.array([r[:min_len] for r in runs])[:, ::step]
+        mean = np.mean(trimmed, axis=0) * 100
+        std = np.std(trimmed, axis=0) * 100
+        if smooth:
+            window_size = max(1, int(0.05 * len(mean)))
+            if window_size % 2 == 0:
+                window_size += 1
+            kernel = np.ones(window_size) / window_size
+            smoothed_mean = np.convolve(mean, kernel, mode='same')
+            smoothed_std = np.convolve(std, kernel, mode='same')
+            valid_start = window_size // 2
+            valid_end = len(mean) - window_size // 2
+            valid_length = valid_end - valid_start
+            mean = smoothed_mean[valid_start:valid_end]
+            std = smoothed_std[valid_start:valid_end]
+            x = np.arange(valid_length) * step * evaluate_every
+            group_max_x = valid_length * step * evaluate_every
+        else:
+            x = np.arange(len(mean)) * step * evaluate_every
+            group_max_x = len(mean) * step * evaluate_every
+        global_max_x = max(global_max_x, group_max_x)
+        line, = ax.plot(x, mean, color=color, linestyle=linestyle, label=label_map[key])
+        ax.fill_between(x, mean - std, mean + std, alpha=0.1, color=color)
+        legend_entries.append((line, label_map[key]))
+    if smooth or x_max is None:
+        x_max = global_max_x
+    ax.set_xlim([0, x_max])
+    ax.set_ylim(top=100)
+    ax.set_xticks(np.arange(0, x_max + 1, 50000))
+    ax.set_xlabel("Training Iterations")
+    ax.set_ylabel("Accuracy (%)")
+    ax.grid(True, alpha=0.5)
+    style_legend = [
+        Line2D([0], [0], color='black', linestyle='-', label='CTM'),
+        Line2D([0], [0], color='black', linestyle='--', label='LSTM')
+    ]
+    color_legend = [
+        Line2D([0], [0], color=color_lookup[it], linestyle='-', label=f"{it} Iters.")
+        for it in unique_iters
+    ]
+    ax.legend(handles=color_legend + style_legend, loc="upper left")
+    fig.tight_layout(pad=0.1)
+    fig.savefig(save_path, dpi=300)
+    fig.savefig(save_path.replace("png", "pdf"), format='pdf')
+    plt.close(fig)
+def extract_run_name(folder, run_index=None):
+    # Try to extract from parent folder
+    parent = os.path.basename(os.path.dirname(folder))
+    match = re.search(r'run(\d+)', parent, re.IGNORECASE)
+    if match:
+        return f"Run {int(match.group(1))}"
+    # Try current folder name
+    basename = os.path.basename(folder)
+    match = re.search(r'run(\d+)', basename, re.IGNORECASE)
+    if match:
+        return f"Run {int(match.group(1))}"
+    # Fallback: use run index
+    if run_index is not None:
+        return f"Run {run_index + 1}"
+    raise ValueError(f"Could not extract run number from: {folder}")
+def plot_loss_individual_runs(training_data, evaluate_every, save_dir, scale=1.0, x_max=None):
+    grouped = defaultdict(list)
+    label_map = {}
+    iters_map = {}
+    model_map = {}
+    base_colors = sns.color_palette("hls", n_colors=3)
+    color_lookup = {f"Run {i+1}": base_colors[i] for i in range(3)}
+    for i, (folder, data) in enumerate(training_data.items()):
+        checkpoint = load_checkpoint(get_latest_checkpoint_file(folder), device="cpu")
+        model_args = get_model_args_from_checkpoint(checkpoint)
+        label, model_type, iters = parse_folder_name(folder)
+        if iters is None:
+            continue
+        if model_type.lower() == "ctm":
+            memory_length = getattr(model_args, "memory_length", None)
+            if memory_length is None:
+                raise ValueError(f"CTM model missing memory_length in checkpoint args from: {folder}")
+            key = f"{model_type}_{iters}_{memory_length}".lower()
+        else:
+            key = f"{model_type}_{iters}".lower()
+        run_name = extract_run_name(folder, run_index=i)
+        grouped[key].append((run_name, data["train_losses"]))
+        label_map[key] = f"{model_type}, {iters} Iters."
+        iters_map[key] = iters
+        model_map[key] = model_type
+    for key, runs in grouped.items():
+        fig, ax = plt.subplots(figsize=(scale * 10, scale * 5))
+        for run_name, losses in runs:
+            x = np.arange(len(losses)) * evaluate_every
+            color = color_lookup.get(run_name, 'gray')
+            ax.plot(x, losses, label=run_name, color=color, alpha=0.7)
+        ax.set_xlabel("Training Iterations")
+        ax.set_ylabel("Loss")
+        ax.set_ylim(bottom=-0.01)
+        ax.grid(True, alpha=0.5)
+        if x_max:
+            ax.set_xlim([0, x_max])
+            ax.set_xticks(np.arange(0, x_max + 1, 50000))
+        ax.legend()
+        fig.tight_layout(pad=0.1)
+        subdir = os.path.join(save_dir, key)
+        os.makedirs(subdir, exist_ok=True)
+        fname = os.path.join(subdir, f"individual_runs_loss_{key}.png")
+        fig.savefig(fname, dpi=300)
+        fig.savefig(fname.replace("png", "pdf"), format="pdf")
+        plt.close(fig)
+def plot_accuracy_individual_runs(training_data, evaluate_every, save_dir, scale=1.0, smooth=False, x_max=None):
+    grouped = defaultdict(list)
+    label_map = {}
+    iters_map = {}
+    model_map = {}
+    base_colors = sns.color_palette("hls", n_colors=3)
+    color_lookup = {f"Run {i+1}": base_colors[i] for i in range(3)}
+    for i, (folder, data) in enumerate(training_data.items()):
+        checkpoint = load_checkpoint(get_latest_checkpoint_file(folder), device="cpu")
+        model_args = get_model_args_from_checkpoint(checkpoint)
+        label, model_type, iters = parse_folder_name(folder)
+        if iters is None:
+            continue
+        if model_type.lower() == "ctm":
+            memory_length = getattr(model_args, "memory_length", None)
+            if memory_length is None:
+                raise ValueError(f"CTM model missing memory_length in checkpoint args from: {folder}")
+            key = f"{model_type}_{iters}_{memory_length}".lower()
+        else:
+            key = f"{model_type}_{iters}".lower()
+        run_name = extract_run_name(folder, run_index=i)
+        grouped[key].append((run_name, data["test_accuracies"]))
+        label_map[key] = f"{model_type}, {iters} Iters."
+        iters_map[key] = iters
+        model_map[key] = model_type
+    for key, runs in grouped.items():
+        fig, ax = plt.subplots(figsize=(scale * 10, scale * 5))
+        for run_name, acc in runs:
+            acc = np.array(acc) * 100
+            if smooth:
+                window_size = max(1, int(0.05 * len(acc)))
+                if window_size % 2 == 0:
+                    window_size += 1
+                kernel = np.ones(window_size) / window_size
+                acc = np.convolve(acc, kernel, mode="same")
+            x = np.arange(len(acc)) * evaluate_every
+            color = color_lookup.get(run_name, 'gray')
+            ax.plot(x, acc, label=run_name, color=color, alpha=0.7)
+        ax.set_xlabel("Training Iterations")
+        ax.set_ylabel("Accuracy (%)")
+        ax.set_ylim([50, 101])
+        ax.grid(True, alpha=0.5)
+        if x_max:
+            ax.set_xlim([0, x_max])
+            ax.set_xticks(np.arange(0, x_max + 1, 50000))
+        ax.legend()
+        fig.tight_layout(pad=0.1)
+        subdir = os.path.join(save_dir, key)
+        os.makedirs(subdir, exist_ok=True)
+        fname = os.path.join(subdir, f"individual_runs_accuracy_{key}.png")
+        fig.savefig(fname, dpi=300)
+        fig.savefig(fname.replace("png", "pdf"), format="pdf")
+        plt.close(fig)
+def plot_training_curve_all_runs(all_folders, save_dir, scale, device, smooth=False, x_max=None, plot_individual_runs=True):
+    all_folders = [folder for folder in all_folders if "certain" not in folder]
+    training_data = {}
+    evaluation_intervals = []
+    for folder in all_folders:
+        latest_checkpoint_path = get_latest_checkpoint_file(folder)
+        if latest_checkpoint_path:
+            checkpoint = load_checkpoint(latest_checkpoint_path, device=device)
+            model_args = get_model_args_from_checkpoint(checkpoint)
+            evaluation_intervals.append(model_args.track_every)
+            _, train_losses, test_losses, train_accuracies, test_accuracies = get_accuracy_and_loss_from_checkpoint(checkpoint, device=device)
+            training_data[folder] = {
+                "train_losses": train_losses,
+                "test_losses": test_losses,
+                "train_accuracies": train_accuracies,
+                "test_accuracies": test_accuracies
+            }
+        else:
+            print(f"No checkpoint found for {folder}")
+    assert len(evaluation_intervals) > 0, "No valid checkpoints found."
+    assert all(interval == evaluation_intervals[0] for interval in evaluation_intervals), "Evaluation intervals are not consistent across runs."
+    evaluate_every = evaluation_intervals[0]
+    if plot_individual_runs:
+        plot_loss_individual_runs(training_data, evaluate_every, save_dir=save_dir, scale=scale, x_max=x_max)
+        plot_accuracy_individual_runs(training_data, evaluate_every, save_dir=save_dir, scale=scale, smooth=smooth, x_max=x_max)
+    plot_loss_all_runs(training_data, evaluate_every, save_path=f"{save_dir}/loss_comparison.png", scale=scale, x_max=x_max)
+    plot_accuracy_all_runs(training_data, evaluate_every, save_path=f"{save_dir}/accuracy_comparison.png", scale=scale, smooth=smooth, x_max=x_max)
+    return training_data
+def plot_accuracy_thinking_time(csv_path, scale, output_dir="analysis/cifar"):
+    if not os.path.exists(csv_path):
+        raise FileNotFoundError(f"CSV file not found: {csv_path}")
+    df = pd.read_csv(csv_path)
+    df["RunName"] = df["Run"].apply(lambda x: os.path.basename(os.path.dirname(x)))
+    df["Model"] = df["Run"].apply(lambda x: "CTM" if "ctm" in x.lower() else "LSTM")
+    grouped = df.groupby(["Model", "Num Iterations"])
+    summary = grouped.agg(
+        mean_accuracy=("Overall Mean Accuracy", "mean"),
+        std_accuracy=("Overall Std Accuracy", lambda x: np.sqrt(np.mean(x**2)))
+    ).reset_index()
+    summary["mean_accuracy"] *= 100
+    summary["std_accuracy"] *= 100
+    fig, ax = plt.subplots(figsize=(scale*5, scale*5))
+    for model in ("CTM", "LSTM"):
+        subset = summary[summary["Model"] == model].sort_values(by="Num Iterations")
+        linestyle = "-" if model == "CTM" else "--"
+        ax.errorbar(
+            subset["Num Iterations"],
+            subset["mean_accuracy"],
+            yerr=subset["std_accuracy"],
+            linestyle=linestyle,
+            color="black",
+            marker='.',
+            label=model,
+            capsize=3,
+            elinewidth=1,
+            errorevery=1
+        )
+    ax.set_xlabel("Internal Ticks")
+    ax.set_ylabel("Accuracy (%)")
+    custom_lines = [
+        Line2D([0], [0], color='black', linestyle='-', label='CTM'),
+        Line2D([0], [0], color='black', linestyle='--', label='LSTM')
+    ]
+    ax.legend(handles=custom_lines, loc="lower right")
+    ax.grid(True, alpha=0.5)
+    os.makedirs(output_dir, exist_ok=True)
+    output_path_png = os.path.join(output_dir, "accuracy_vs_thinking_time.png")
+    fig.tight_layout(pad=0.1)
+    fig.savefig(output_path_png, dpi=300)
+    fig.savefig(output_path_png.replace("png", "pdf"), format='pdf')
+    plt.close(fig)
+def plot_lstm_last_and_certain_accuracy(all_folders, save_path="lstm_last_and_certain_accuracy.png", scale=1.0, step=1, x_max=None):
+    tags = ["lstm_10", "lstm_10_certain", "lstm_25", "lstm_25_certain"]
+    folders = [f for f in all_folders if any(tag in f.lower() for tag in tags)]
+    training_data, eval_intervals = {}, []
+    for f in folders:
+        cp = get_latest_checkpoint_file(f)
+        if not cp:
+            print(f"⚠️ No checkpoint in {f}")
+            continue
+        ckpt = load_checkpoint(cp, device="cpu")
+        args = get_model_args_from_checkpoint(ckpt)
+        eval_intervals.append(args.track_every)
+        _, _, _, _, acc = get_accuracy_and_loss_from_checkpoint(ckpt)
+        iters = "25" if "25" in f.lower() else "10"
+        label = "Certain" if "certain" in f.lower() else "Final"
+        training_data.setdefault((iters, label), []).append(acc)
+    assert training_data and all(i == eval_intervals[0] for i in eval_intervals), "Missing or inconsistent eval intervals."
+    evaluate_every = eval_intervals[0]
+    keys = sorted(training_data.keys())
+    colors = sns.color_palette("hls", n_colors=len(keys))
+    style_map = {key: ("--" if key[1] == "Certain" else "-") for key in keys}
+    color_map = {key: colors[i] for i, key in enumerate(keys)}
+    fig, ax = plt.subplots(figsize=(scale * 10, scale * 5))
+    max_x = 0
+    for key in keys:
+        runs = training_data[key]
+        min_len = min(len(r) for r in runs)
+        trimmed = np.stack([r[:min_len] for r in runs], axis=0)[:, ::step]
+        mean, std = np.mean(trimmed, 0) * 100, np.std(trimmed, 0) * 100
+        x = np.arange(len(mean)) * step * evaluate_every
+        ax.plot(x, mean, color=color_map[key], linestyle=style_map[key],
+                label=f"{key[0]} Iters, {key[1]}", linewidth=2, alpha=0.7)
+        ax.fill_between(x, mean - std, mean + std, color=color_map[key], alpha=0.1)
+        max_x = max(max_x, x[-1])
+    ax.set_xlim([0, x_max or max_x])
+    ax.set_xticks(np.arange(0, (x_max or max_x) + 1, 50000))
+    ax.set_xlabel("Training Iterations")
+    ax.set_ylabel("Accuracy (%)")
+    ax.grid(True, alpha=0.5)
+    ax.legend(loc="lower right")
+    fig.tight_layout(pad=0.1)
+    fig.savefig(save_path, dpi=300)
+    fig.savefig(save_path.replace("png", "pdf"), format="pdf")
+    plt.close(fig)

tasks/parity/scripts/train_ctm_100_50.sh ADDED Viewed

	@@ -0,0 +1,46 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=100
+MEMORY_LENGTH=50
+LOG_DIR="logs/parity/run${RUN}/ctm_${ITERATIONS}_${MEMORY_LENGTH}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --memory_length $MEMORY_LENGTH \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 1024 \
+    --d_input 512 \
+    --n_synch_out 32 \
+    --n_synch_action 32 \
+    --synapse_depth 1 \
+    --heads 8 \
+    --memory_hidden_dims 16 \
+    --dropout 0.0 \
+    --deep_memory \
+    --no-do_normalisation \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping 0.9 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --neuron_select_type "random"

tasks/parity/scripts/train_ctm_10_5.sh ADDED Viewed

	@@ -0,0 +1,46 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=10
+MEMORY_LENGTH=5
+LOG_DIR="logs/parity/run${RUN}/ctm_${ITERATIONS}_${MEMORY_LENGTH}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --memory_length $MEMORY_LENGTH \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 1024 \
+    --d_input 512 \
+    --n_synch_out 32 \
+    --n_synch_action 32 \
+    --synapse_depth 1 \
+    --heads 8 \
+    --memory_hidden_dims 16 \
+    --dropout 0.0 \
+    --deep_memory \
+    --no-do_normalisation \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping 0.9 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --neuron_select_type "random"

tasks/parity/scripts/train_ctm_1_1.sh ADDED Viewed

	@@ -0,0 +1,46 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=1
+MEMORY_LENGTH=1
+LOG_DIR="logs/parity/run${RUN}/ctm_${ITERATIONS}_${MEMORY_LENGTH}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --memory_length $MEMORY_LENGTH \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 1024 \
+    --d_input 512 \
+    --n_synch_out 32 \
+    --n_synch_action 32 \
+    --synapse_depth 1 \
+    --heads 8 \
+    --memory_hidden_dims 16 \
+    --dropout 0.0 \
+    --deep_memory \
+    --no-do_normalisation \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping 0.9 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --neuron_select_type "random"

tasks/parity/scripts/train_ctm_25_10.sh ADDED Viewed

	@@ -0,0 +1,46 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=25
+MEMORY_LENGTH=10
+LOG_DIR="logs/parity/run${RUN}/ctm_${ITERATIONS}_${MEMORY_LENGTH}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --memory_length $MEMORY_LENGTH \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 1024 \
+    --d_input 512 \
+    --n_synch_out 32 \
+    --n_synch_action 32 \
+    --synapse_depth 1 \
+    --heads 8 \
+    --memory_hidden_dims 16 \
+    --dropout 0.0 \
+    --deep_memory \
+    --no-do_normalisation \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping 0.9 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --neuron_select_type "random"

tasks/parity/scripts/train_ctm_50_25.sh ADDED Viewed

	@@ -0,0 +1,46 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=50
+MEMORY_LENGTH=25
+LOG_DIR="logs/parity/run${RUN}/ctm_${ITERATIONS}_${MEMORY_LENGTH}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --memory_length $MEMORY_LENGTH \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 1024 \
+    --d_input 512 \
+    --n_synch_out 32 \
+    --n_synch_action 32 \
+    --synapse_depth 1 \
+    --heads 8 \
+    --memory_hidden_dims 16 \
+    --dropout 0.0 \
+    --deep_memory \
+    --no-do_normalisation \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping 0.9 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --neuron_select_type "random"

tasks/parity/scripts/train_ctm_75_25.sh ADDED Viewed

	@@ -0,0 +1,46 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=75
+MEMORY_LENGTH=25
+LOG_DIR="logs/parity/run${RUN}/ctm_${ITERATIONS}_${MEMORY_LENGTH}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --memory_length $MEMORY_LENGTH \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 1024 \
+    --d_input 512 \
+    --n_synch_out 32 \
+    --n_synch_action 32 \
+    --synapse_depth 1 \
+    --heads 8 \
+    --memory_hidden_dims 16 \
+    --dropout 0.0 \
+    --deep_memory \
+    --no-do_normalisation \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping 0.9 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --neuron_select_type "random"

tasks/parity/scripts/train_lstm_1.sh ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=1
+MODEL_TYPE="lstm"
+LOG_DIR="logs/parity/run${RUN}/${MODEL_TYPE}_${ITERATIONS}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --model_type $MODEL_TYPE \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 669 \
+    --d_input 512 \
+    --heads 8 \
+    --dropout 0.0 \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping -1 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \

tasks/parity/scripts/train_lstm_10.sh ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=10
+MODEL_TYPE="lstm"
+LOG_DIR="logs/parity/run${RUN}/${MODEL_TYPE}_${ITERATIONS}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --model_type $MODEL_TYPE \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 686 \
+    --d_input 512 \
+    --heads 8 \
+    --dropout 0.0 \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping -1 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \

tasks/parity/scripts/train_lstm_100.sh ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=100
+MODEL_TYPE="lstm"
+LOG_DIR="logs/parity/run${RUN}/${MODEL_TYPE}_${ITERATIONS}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --model_type $MODEL_TYPE \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 857 \
+    --d_input 512 \
+    --heads 8 \
+    --dropout 0.0 \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping -1 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \

tasks/parity/scripts/train_lstm_10_certain.sh ADDED Viewed

	@@ -0,0 +1,40 @@

+#!/bin/bash
+RUN=3
+ITERATIONS=10
+MODEL_TYPE="lstm"
+LOG_DIR="logs/parity/run${RUN}/${MODEL_TYPE}_${ITERATIONS}_certain"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --model_type $MODEL_TYPE \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 686 \
+    --d_input 512 \
+    --heads 8 \
+    --dropout 0.0 \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping -1 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --use_most_certain_with_lstm \

tasks/parity/scripts/train_lstm_25.sh ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/bin/bash
+RUN=1
+ITERATIONS=25
+MODEL_TYPE="lstm"
+LOG_DIR="logs/parity/run${RUN}/${MODEL_TYPE}_${ITERATIONS}"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --model_type $MODEL_TYPE \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 706 \
+    --d_input 512 \
+    --heads 8 \
+    --dropout 0.0 \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping -1 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \

tasks/parity/scripts/train_lstm_25_certain.sh ADDED Viewed

	@@ -0,0 +1,40 @@

+#!/bin/bash
+RUN=3
+ITERATIONS=25
+MODEL_TYPE="lstm"
+LOG_DIR="logs/parity/run${RUN}/${MODEL_TYPE}_${ITERATIONS}_certain"
+SEED=$((RUN - 1))
+python -m tasks.parity.train \
+    --log_dir $LOG_DIR \
+    --seed $SEED \
+    --iterations $ITERATIONS \
+    --model_type $MODEL_TYPE \
+    --parity_sequence_length 64  \
+    --n_test_batches 20 \
+    --d_model 706 \
+    --d_input 512 \
+    --heads 8 \
+    --dropout 0.0 \
+    --positional_embedding_type="custom-rotational-1d" \
+    --backbone_type="parity_backbone" \
+    --no-full_eval \
+    --weight_decay 0.0 \
+    --gradient_clipping -1 \
+    --use_scheduler \
+    --scheduler_type "cosine" \
+    --milestones 0 0 0 \
+    --gamma 0 \
+    --dataset "parity" \
+    --batch_size 64 \
+    --batch_size_test 256 \
+    --lr=0.0001 \
+    --training_iterations 200001 \
+    --warmup_steps 500 \
+    --track_every 1000 \
+    --save_every 10000 \
+    --no-reload \
+    --no-reload_model_only \
+    --device 0 \
+    --no-use_amp \
+    --use_most_certain_with_lstm \