Owl IDM - Owl IDM v4

Inverse Dynamics Model (IDM) that predicts keyboard and mouse inputs from gameplay video.

Model Description

  • Input: Sequence of RGB frames (128x128), normalized to [-1, 1]
  • Output:
    • Button predictions (7 outputs): W, A, S, D, Space, LShift, LCtrl
    • Mouse movement (dx, dy in pixels)

Architecture

Architecture is based on OpenAI VPT IDM, with some general improvements.

  • Backbone: Conv3D temporal mixer โ†’ ResNet spatial encoder โ†’ learned spatial pooling
  • Temporal model: Transformer (d_model=1024, 12 layers)
  • Window size: 32 frames
  • Model size: N/A parameters

Training

  • Dataset: FPS gameplay recordings
  • Preprocessing: Frames scaled to [-1, 1], log1p mouse scaling: True
  • Loss: BCE with class-balancing pos_weight for buttons, Huber for mouse

Usage

Installation

pip install git+https://github.com/overworld/owl-idm-3.git

Inference

from owl_idms import InferencePipeline
import torch

pipeline = InferencePipeline.from_pretrained(
    "Overworld/owl-idm-4",
    device="cuda"
)

# video: [batch, frames, channels, height, width] in range [-1, 1]
video = torch.randn(1, 256, 3, 128, 128)

button_preds, mouse_preds = pipeline(video)
# button_preds: [1, 256, 7] bool  โ€” order: `W`, `A`, `S`, `D`, `Space`, `LShift`, `LCtrl`
# mouse_preds:  [1, 256, 2]          float  โ€” (dx, dy) in pixels

# Check which buttons are pressed at frame 100
for label, pressed in zip(pipeline.button_labels, button_preds[0, 100]):
    if pressed:
        print(f"{label} pressed")

Model Files

  • config.yml: Full training configuration
  • model.pt: EMA model weights (state_dict, ready for load_state_dict)

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support