Image-to-Video
Transformers
psi
feature-extraction
world-model
video-generation
multimodal
physical-world-model
controllable-generation
custom_code
Instructions to use StanfordNeuroAILab/psi0_5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use StanfordNeuroAILab/psi0_5 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("StanfordNeuroAILab/psi0_5", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # PSI-0.5 Usage Guide | |
| PSI-0.5 is a promptable physical world model. It accepts notation strings such | |
| as `rgb0->rgb1`, `rgb0,f01->f01,rgb1`, and `rgb0,c01->rgb1`, then fills in the | |
| requested missing visual variables. | |
| ## Install | |
| ```bash | |
| conda create -n psi-demos python=3.10 -y | |
| conda activate psi-demos | |
| pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126 | |
| pip install transformers huggingface-hub einops h5py tiktoken numpy pillow opencv-python gradio matplotlib scipy | |
| ``` | |
| The PyTorch command above installs the CUDA 12.6 wheel used on the ccn2 A40 | |
| nodes. For other machines, install the PyTorch build recommended for your | |
| driver/platform first. | |
| ## Load With Transformers | |
| ```python | |
| from PIL import Image | |
| from transformers import AutoModel | |
| predictor = AutoModel.from_pretrained( | |
| "StanfordNeuroAILab/psi0_5", | |
| trust_remote_code=True, | |
| device="cuda:0", | |
| ) | |
| rgb1 = predictor.generate( | |
| "rgb0->rgb1", | |
| rgb0=Image.open("scene.png").convert("RGB"), | |
| seed=1110, | |
| temp=1.0, | |
| top_k=1000, | |
| top_p=1.0, | |
| ) | |
| rgb1.save("scene_next.png") | |
| ``` | |
| ## Sparse Flow Prompt | |
| ```python | |
| from PIL import Image | |
| from transformers import AutoModel | |
| predictor = AutoModel.from_pretrained( | |
| "StanfordNeuroAILab/psi0_5", | |
| trust_remote_code=True, | |
| device="cuda:0", | |
| ) | |
| rgb0 = Image.open("block_slide_rgb0.png").convert("RGB") | |
| f01 = predictor.sparse_flow_prompt([((70, 221), (168, 221))], rgb0.size) | |
| dense_flow, rgb1 = predictor.generate( | |
| "rgb0,f01->f01,rgb1", | |
| rgb0=rgb0, | |
| f01=f01, | |
| seed=1110, | |
| num_seq_patches=256, | |
| ) | |
| ``` | |
| ## Depth, Flow, And RGB | |
| ```python | |
| import numpy as np | |
| from PIL import Image | |
| rgb0 = Image.open("billiards_rgb0.png").convert("RGB") | |
| depth0 = np.load("billiards_d0_meters.npy").astype(np.float32) | |
| f01 = predictor.sparse_flow_prompt([((392, 171), (238, 94))], rgb0.size) | |
| dense_flow, depth1, rgb1 = predictor.generate( | |
| "rgb0,d0,f01->f01,d1,rgb1", | |
| rgb0=rgb0, | |
| d0=depth0, | |
| f01=f01, | |
| seed=1110, | |
| num_seq_patches=256, | |
| ) | |
| ``` | |
| ## Camera-Conditioned Novel View Synthesis | |
| ```python | |
| camera = { | |
| "fov_x": 60.0, | |
| "fov_y": 60.0, | |
| "euler_angles": [0.0, -0.12, 0.0], | |
| "translation": [0.10, 0.0, 0.04], | |
| } | |
| rgb1 = predictor.generate( | |
| "rgb0,c01->rgb1", | |
| rgb0=Image.open("coffee_mug_000.png").convert("RGB"), | |
| c01=camera, | |
| seed=1110, | |
| ) | |
| ``` | |
| ## Advanced Paths | |
| All runtime files needed by Transformers remote code live at the repository | |
| root. The release manifest lists the default checkpoint and tokenizer assets for | |
| reproducibility. | |
| PSIv0.5 is a modestly sized model that has not undergone any post-training yet. | |
| Some of its rollouts diverge. We recommend unrestricted sampling for flow | |
| prediction and `top_p=0.9`, `top_k=1000` for RGB rendering. Correct prompting | |
| can significantly improve generations, and simple harnesses such as those in the | |
| provided Gradio app can be used to steer the model much more effectively. We | |
| believe this direction has great potential for scaling to create even more | |
| comprehensive models of the world while maintaining this highly controllable | |
| API. | |