Latent Space: The AI Engineer Podcast
Latent Space: The AI Engineer Podcast

SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

December 18, 2025

AI Summary

5 min read

🎙️ The Voices & The Context

  • The Format: This expert-led podcast blends casual demo walkthroughs with deep technical dives into Meta's SAM3 model launch, tying open-source AI advancements to real-world computer vision breakthroughs in a collaborative and forward-looking atmosphere.
  • The Format: Panel discussion with demos and interviews.
  • The Key Players:
    • Nikila Ravi: Meta researcher leading SAM series for nearly four years, sharing evolution from SAM1 to SAM3.
    • Peng Chuan: Meta researcher with nine years in computer vision, focused on SAM3's data engine and benchmarks.
    • Joseph: Roboflow CEO and co-host, highlighting production impacts and tools like RFDet.
    • Swix: Vision co-host, probing technical details and future trends.

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

What you'll learn

  • 1 **(00:00) 🎙️ Introduction: Nikila Ravi & Peng Chuan (Meta SAM Team Researchers)**
  • 2 **(05:27) SAM3 Demo: Concept Prompts for Images & Video**
  • 3 **(09:13) SAM3 Speed & Scalability**
  • 4 **(10:51) SAM3 Development: From POC to Atomic Concepts**
  • 5 **(13:42) Real-World SAM Impact via Roboflow Stats**
  • 6 **(17:17) Real-World Eval as Ultimate Metric**
  • 7 **(18:28) Fine-Tuning SAM3: Low-Data Adaptation**

+ Full timestamped outline available in the app

Show Notes

as with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!)

From SAM 1's 11-million-image data engine to SAM 2's memory-based video tracking, MSL’s Segment Anything project has redefined what's possible in computer vision. Now SAM 3 takes the next leap: concept segmentation—prompting with natural language like "yellow school bus" or "tablecloth" to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio (https://x.com/aiatmeta/status/2000980784425931067?s=46), SAM can now even segment audio output!

We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception.

We discuss:

  • What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like "purple umbrella" or "watering can"

  • How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly

  • Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and "fast mode" tracking

  • The SACO benchmark: 200,000+ unique c

Latent Space: The AI Engineer Podcast