Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun
April 2, 2026
AI Summary
5 min readMoon Lake, founded by Fan-yun Sun, Chris Manning, and Sharon Lin, develops causal world models for multimodal, interactive simulations. These models emphasize action-conditioned predictions over mere video generation, aiming for efficiency through structured abstractions rather than pixel-level scale alone. Guests explain how this addresses gaps in current AI, particularly for embodied agents and gaming.
Core Definition of World Models
World models must predict the consequences of actions over extended time scales, not just the next video frame. Video generators like Sora or Genie 2 produce visually impressive outputs but lack 3D understanding, object persistence, or causal reasoning—pins in a bowling game do not reliably fall when struck, and interactions fail. True world models require action-conditioning: given an input like "throw ball," the system simulates changes in geometry, physics, scores, audio cues, and timers. Observational video data complicates this, as actions must be inferred; simulations provide known actions but demand reasoning for human-like generality.
Continue reading the full summary in the app — free to try.
Read Full Summary →Free • No credit card required
What you'll learn
- 1 (01:34) **Guest Introductions and Moonlake Origins** - Hosts welcome Fan-yun Sun and Chris Manning, discuss team background and motivation for interactive world models from NVIDIA research and embodied AI needs
- 2 (02:25) **Chris Manning's Vision Shift** - Manning explains move from NLP success to multimodal worlds, critiques vision field's pixel-level stagnation despite scale
- 3 (06:38) **Defining World Models** - Distinguishes true world models (action-conditioned, causal prediction) from video generators like Sora/Veo
- 4 (08:44) **Scaling Video Data Limitations** - Action inference hard from observational videos; simulation enables known actions but data efficiency key
- 5 (11:58) **Human-Like Efficiency and Abstraction** - Advocates abstracted representations for real-time, long-term consistency like neuroscience shows
- 6 (14:37) **Reasoning Model Showcase** - Reasoning traces integrate geometry, physics, affordances for consistent worlds unlike prompt-only LLMs
- 7 (16:02) **Philosophical Clash with Yann LeCun/JEPA** - Language/symbols as cognitive tools vaulted human intelligence over vision-dominant chimps
+ Full timestamped outline available in the app
Show Notes
We’ve been on a bit of a mini World Models series over the last quarter: from introducing the topic with Yi Tay, to exploring Marble with World Labs’ Fei-Fei Li and Justin Johnson, to previewing World Models learned from massive gaming datasets with General Intuition’s Pim de Witte (who has now written down their approach to World Models with Not Boring), to discussing the Cosmos World Model with with Andrew White of Edison Scientific on our new Science pod, to writing up our own theses on Adversarial World Models. Meanwhile Nvidia, Waymo and Tesla have published their own approaches, Google has released Genie 3, and Yann LeCun has raised $1B for AMI and published LeWorldModel.
Today’s guests have a radically different approach to World Modeling to every player we just mentioned — while Genie 3 is impressive, its many flaws demonstrate the issues with their approach - terrain clipping, noninteractivity (single player, no physics/no objects other than the player move), and maximum of 60 second immersion.
Moonlake AI (inspired by the Dreamworks logo) is the diametric opposite - immediately multiplayer, incredibly interactive, indefinite lifetime, capable of MANY different kinds of world models by simulating environments, predicting outcomes, and planning over long horizons. This is enabled by bootstrapping from game engines and training custom age
More from this podcast
Latent Space: The AI Engineer Podcast →