Physics IQ Benchmark:
Do generative video models learn physical principles from watching videos?

1: INSAIT, Sofia University    2: Google DeepMind
* Work done while at Google DeepMind.
† Joint last authors.
Paper PDF πŸ“ƒ Code Podcast πŸ“»

TL;DR

We develop the Physics-IQ benchmark and score, which reveals that current generative video models lack physical understanding despite sometimes achieving visual realism. Use our benchmark and dataset to assess your video model's physics understanding!

Solid mechanics

GIF r11 GIF r21

Fluid dynamics

GIF r12 GIF r22

Optics

GIF r13 GIF r23

Thermodynamics

GIF r14 GIF r24

Magnetism

GIF r15 GIF r25

Our benchmark tests the physics understanding of generative video models across 5 categories.

Abstract πŸ“£

AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn β€œworld models” that discover laws of physics β€” or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality?

We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited. At the same time, some test cases can already be successfully solved. For instance, fluid dynamics tend to work better than solid mechanics. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding.

Podcast πŸ“»

On a run and want to get a gist of our paper? Listen to the following podcast!

Leaderboard πŸ₯‡

Model Physics IQ Score
VideoPoet (multiframe) 🏆
24.1%
Runway Gen 3 (i2v)
16.2%
Lumiere (multiframe)
15.7%
VideoPoet (i2v)
18.0%
Lumiere (i2v)
17.1%
Stable Video Diffusion (i2v)
13.5%
Pika 1.0 (i2v)
9.2%
Sora (i2v)
8.8%

Examples 🎞️

Real Test Frames
Real Test Frames
VideoPoet (multiframe)
VideoPoet (multiframe)
VideoPoet (i2v)
VideoPoet (i2v)
Sora (i2v)
Sora (i2v)
Pika 1.0 (i2v)
Pika 1.0 (i2v)
Runway Gen 3 (i2v)
Runway Gen 3 (i2v)
Lumiere (multiframe)
Lumiere (multiframe)
Lumiere (i2v)
Lumiere (i2v)
Real Test Frames
Real Test Frames
VideoPoet (multiframe)
VideoPoet (multiframe)
VideoPoet (i2v)
VideoPoet (i2v)
Sora (i2v)
Sora (i2v)
Pika 1.0 (i2v)
Pika 1.0 (i2v)
Runway Gen 3 (i2v)
Runway Gen 3 (i2v)
Lumiere (multiframe)
Lumiere (multiframe)
Lumiere (i2v)
Lumiere (i2v)
Real Test Frames
Real Test Frames
VideoPoet (multiframe)
VideoPoet (multiframe)
VideoPoet (i2v)
VideoPoet (i2v)
Sora (i2v)
Sora (i2v)
Pika 1.0 (i2v)
Pika 1.0 (i2v)
Runway Gen 3 (i2v)
Runway Gen 3 (i2v)
Lumiere (multiframe)
Lumiere (multiframe)
Lumiere (i2v)
Lumiere (i2v)
Real Test Frames
Real Test Frames
VideoPoet (multiframe)
VideoPoet (multiframe)
VideoPoet (i2v)
VideoPoet (i2v)
Sora (i2v)
Sora (i2v)
Pika 1.0 (i2v)
Pika 1.0 (i2v)
Runway Gen 3 (i2v)
Runway Gen 3 (i2v)
Lumiere (multiframe)
Lumiere (multiframe)
Lumiere (i2v)
Lumiere (i2v)

BibTeX 😊

If you use our dataset and metrics, please cite our work:

@article{motamed2025physics,
      title={Do generative video models learn physical principles from watching videos?},
      author={Motamed, Saman and Culp, Laura and Swersky, Kevin and Jaini, Priyank and Geirhos, Robert},
      year={2025},
}