Lenny's Podcast: Product | Career | Growth

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

September 25, 2025

AI Summary

5 min read

Hamel Husain and Shreya Shankar explain evals as a systematic process for measuring and improving AI applications, rooted in error analysis of real user traces rather than upfront tests. Using a real example from Nurture Boss, an AI assistant for property managers, they demonstrate how evals reveal hidden issues like poor human handoffs or garbled text conversations, turning chaotic logs into prioritized fixes. The process is high-ROI, addictive for builders, and grounded in established data science techniques like open coding.

Starting with error analysis

Begin by sampling ~100 production traces from an observability tool like Braintrust or LangSmith. A domain expert—often the product manager acting as "benevolent dictator"—reviews them manually, noting the first upstream error per trace, such as "should have handed off to human" or "hallucinated virtual tour." Avoid committees or LLMs here; human context spots product smells that AI misses. Continue until theoretical saturation, where no new error types emerge, typically after 40-100 traces.

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

Listen to Audio Summary Open in App

Never miss an episode of Lenny's Podcast: Product | Career | Growth

Get every new episode summarized in your inbox — free, ~5 minutes to read.

No spam. Unsubscribe anytime.

What you'll learn

1 **[05:04] Evals Defined and Why They Matter**
2 **[06:36] Evals vs Unit Tests Spectrum**
3 **[10:06] Live Error Analysis Demo (Nurture Boss AI Assistant)**
4 **[17:07] Open Coding Process and Benevolent Dictator**
5 **[31:42] Axial Coding: Synthesize Errors with LLMs**
6 **[44:40] Prioritize via Pivot Tables and Decide on Evals**
7 **[48:31] Building/Validating LLM-as-Judge Evals**

+ Full timestamped outline available in the app

Show Notes

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

—

Brought to you by:

Fin—The #1 AI agent for customer service