Lenny's Podcast: Product | Career | Growth
Lenny's Podcast: Product | Career | Growth

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

September 25, 2025

AI Summary

5 min read

Hamel Husain and Shreya Shankar explain evals as a systematic process for measuring and improving AI applications, rooted in error analysis of real user traces rather than upfront tests. Using a real example from Nurture Boss, an AI assistant for property managers, they demonstrate how evals reveal hidden issues like poor human handoffs or garbled text conversations, turning chaotic logs into prioritized fixes. The process is high-ROI, addictive for builders, and grounded in established data science techniques like open coding.

Starting with error analysis

Begin by sampling ~100 production traces from an observability tool like Braintrust or LangSmith. A domain expert—often the product manager acting as "benevolent dictator"—reviews them manually, noting the first upstream error per trace, such as "should have handed off to human" or "hallucinated virtual tour." Avoid committees or LLMs here; human context spots product smells that AI misses. Continue until theoretical saturation, where no new error types emerge, typically after 40-100 traces.

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

What you'll learn

  • 1 **[05:04] Evals Defined and Why They Matter**
  • 2 **[06:36] Evals vs Unit Tests Spectrum**
  • 3 **[10:06] Live Error Analysis Demo (Nurture Boss AI Assistant)**
  • 4 **[17:07] Open Coding Process and Benevolent Dictator**
  • 5 **[31:42] Axial Coding: Synthesize Errors with LLMs**
  • 6 **[44:40] Prioritize via Pivot Tables and Decide on Evals**
  • 7 **[48:31] Building/Validating LLM-as-Judge Evals**

+ Full timestamped outline available in the app

Show Notes

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

Brought to you by:

Fin—The #1 AI agent for customer service

Dscout—The UX platform to capture insights at every stage: from ideation to production

Mercury—The art of simplified finances

Where to find Shreya Shankar

• X: https://x.com/sh_reya

• LinkedIn: https://www.linkedin.com/in/shrshnk/

• Website: https://www.sh-reya.com/

• Maven course: https://bit.ly/4myp27m

Where to find Hamel Husain

• X: https://x.com/HamelHusain

• LinkedIn: https://www.linkedin.com/in/hamelhusain/

• Website: https://hamel.dev/

• Maven course: https://bit.ly/4myp27m

In this episode, we cover:

(00:00) Introduction to Hamel and Shreya

(04:57) What are evals?

(09:56) Demo: Examining real traces from a property management AI assistant

(16:51) Writing notes on errors

(23:54) Why LLMs can’t replace humans in the initial error analysis

(25:16) The concept of a “benevolent dictator” in the eval process

(28:07) Theoretical saturation: when to stop

(31:39) Using axial codes to help categorize and synthesize error notes

(44:39) The results

(46:06) Building an LLM-as-judge to evaluate specific failure modes

(48:31) The difference be

Lenny's Podcast: Product | Career | Growth