Emergent Compositional Communication for Latent World Properties

arXiv GitHub

Tomasz Kaszyński, 2026

Summary

Neural agents with different vision backbones develop shared compositional languages about physical properties through a discrete Gumbel-Softmax bottleneck. Each message position self-organizes to encode a specific physical property (elasticity, friction). The protocol achieves 91.5% accuracy on unseen collision outcomes and 85.6% on real camera footage (Physics 101 dataset).

Model Architecture

CompositionalSender — the core trainable module:

TemporalEncoder:
  Conv1d(384 → 256, k=3) → ReLU
  Conv1d(256 → 128, k=3) → ReLU
  AdaptiveAvgPool1d(1)
  Linear(128 → 128) → ReLU

Message Heads (×2):
  Linear(128 → 8) → Gumbel-Softmax(τ=1.0)

Output: 2 discrete tokens per agent, each ∈ {0, ..., 7}
  • Input: Frozen DINOv2-S features (384-dim per frame)
  • Bottleneck: 2 heads × vocab 8 = 16-dim one-hot message per agent
  • Sender params: 412,176
  • Receiver params: 12,610

Checkpoints

File Description Size
phase54b_model.pt Main result model (DINOv2 features, 2-agent, 2×8 bottleneck) 3.5 MB
phase54c_model.pt Best multi-seed variant 3.5 MB
phase54c_seed42_model.pt Seed 42 3.3 MB
phase54c_seed123_model.pt Seed 123 3.3 MB
phase54c_seed456_model.pt Seed 456 3.3 MB
phase54c_seed789_model.pt Seed 789 3.3 MB
phase54c_seed1337_model.pt Seed 1337 3.3 MB
phase87_phys101_spring_features.pt Pre-extracted DINOv2 features for Physics 101 spring (206 clips) 3.2 MB

Checkpoint Format

Each .pt file is a dictionary with keys:

{
    "sender_2x8": <state_dict>,      # CompositionalSender weights
    "receiver_2x8": <state_dict>,    # CompositionalReceiver weights
    "sender_1x64": <state_dict>,     # Alternative 1×64 bottleneck sender
    "receiver_1x64": <state_dict>,   # Alternative 1×64 receiver
}

Usage

import torch
import torch.nn as nn
import torch.nn.functional as F

class TemporalEncoder(nn.Module):
    def __init__(self, hidden_dim=128, input_dim=384, n_frames=4):
        super().__init__()
        ks = min(3, n_frames)
        self.temporal = nn.Sequential(
            nn.Conv1d(input_dim, 256, kernel_size=ks, padding=ks // 2), nn.ReLU(),
            nn.Conv1d(256, 128, kernel_size=ks, padding=ks // 2), nn.ReLU(),
            nn.AdaptiveAvgPool1d(1))
        self.fc = nn.Sequential(nn.Linear(128, hidden_dim), nn.ReLU())
    def forward(self, x):
        return self.fc(self.temporal(x.permute(0, 2, 1)).squeeze(-1))

class CompositionalSender(nn.Module):
    def __init__(self, hidden_dim=128, input_dim=384, vocab_size=8, n_heads=2):
        super().__init__()
        self.encoder = TemporalEncoder(hidden_dim, input_dim)
        self.vocab_size = vocab_size
        self.heads = nn.ModuleList([nn.Linear(hidden_dim, vocab_size) for _ in range(n_heads)])
    def forward(self, x, tau=1.0):
        h = self.encoder(x)
        tokens = [head(h).argmax(dim=-1) for head in self.heads]
        return torch.stack(tokens, dim=-1)  # [batch, n_heads]

# Load
ckpt = torch.load("phase54c_model.pt", map_location="cpu")
sender = CompositionalSender(hidden_dim=128, input_dim=384, vocab_size=8, n_heads=2)
sender.load_state_dict(ckpt["sender_2x8"])
sender.eval()

# Run on DINOv2 features: [batch, n_frames, 384]
features = torch.randn(1, 4, 384)  # Replace with real DINOv2 features
tokens = sender(features)
print(f"Discrete physics code: {tokens}")  # e.g., tensor([[3, 7]])

Training Details

  • Dataset: Physics 101 ramp scenario (surface friction + elasticity)
  • Backbone: Frozen DINOv2-S (dinov2_vits14, 21M params, not included)
  • Training: 400 epochs, Adam (sender lr=1e-3, receiver lr=3e-3)
  • Gumbel-Softmax: τ annealed from 3.0 → 1.0, hard after epoch 30
  • Iterated learning: Receiver reset every 40 epochs (3 parallel receivers)
  • Entropy regularization: coefficient=0.03 when entropy < 0.1

Key Results

  • 91.5% accuracy on unseen collision outcomes (80 random seeds)
  • 85.6% accuracy on real camera footage (Physics 101)
  • PosDis = 0.999 — near-perfect positional disentanglement
  • 25× compression with 94% predictive performance retained
  • Works across V-JEPA 2, DINOv2, and CLIP ViT-L/14

Citation

@article{kaszynski2026emergent,
  title={Emergent Compositional Communication for Latent World Properties},
  author={Kaszy{\'n}ski, Tomasz},
  journal={arXiv preprint arXiv:2604.03266},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for TomekKaszynski/emergent-physics-comm