[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

January 2, 2026

AI Summary

5 min read

🎙️ The Voices & The Context

The Format: This casual conference-side interview on the "Light Space" podcast dives into a fresh Best Paper award at a major ML event, blending excitement over breakthroughs with technical deep dives into reinforcement learning innovations. Technical and triumphant, like fresh PhDs geeking out post-victory lap.
The Format: Casual interview with academic guests.
The Key Players:
- Jeron (Host): Energetic interviewer steering the chat with probing questions on RL scaling and real-world apps, injecting poker analogies and robotics curiosity.
- Kevin (Princeton undergrad, project lead): Fresh grad who kickstarted the research from a seminar, explains the self-supervised RL recipe with infectious enthusiasm.
- Ishaan (PhD collaborator): Dives into architecture tweaks like residuals and LayerNorm, highlights efficiency gains.
- Ben (Princeton prof/advisor): Skeptical mentor who greenlit the "crazy" deep nets idea despite past failures, offers nuanced caveats on RL vs. self-supervised boundaries.

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

Listen to Audio Summary Open in App

Never miss an episode of Latent Space: The AI Engineer Podcast

Get every new episode summarized in your inbox — free, ~5 minutes to read.

No spam. Unsubscribe anytime.

What you'll learn

1 (00:00) **🎙️ Introduction: Kevin, Ishaan, and Ben (Princeton ML researchers)**
2 (01:30) **Project Origins and Initial Skepticism**
3 (03:30) **Project Overview: Scaling Deep Nets in RL**
4 (05:00) **Key Innovations Unlocking Scaling**
5 (07:00) **Core Insight: Objective > Architecture Alone**
6 (09:00) **Deep Dive on Contrastive Objective**
7 (11:00) **Applications and Domains**

+ Full timestamped outline available in the app

Show Notes

From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.

We discuss:

The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem
Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon
More from this podcast
Latent Space: The AI Engineer Podcast →