⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

February 23, 2026

AI Summary

5 min read

🎙️ The Voices & The Context

The Format: Casual studio chat and interview in the OpenAI studio, blending technical deep dives with collaborative discussion on AI benchmarks.
The Key Players:
- Mia Olivia: VP of Research at OpenAI, overseeing Codex (coding models), human data, and Alignment teams; key in creating SWE-Bench Verified and evolving evals.
- Olivia: From Frontier Evals team, collaborated on SWE-Bench Verified and new benchmarks like GDP-Bench; provides frontline insights on model performance.
- Host: OpenAI insider facilitating, probing with insider jokes and future-focused questions; great chemistry—playful banter amid expertise.
The Vibe: Educational and optimistic, with a dash of self-deprecating humor about AI progress stalling; feels like an insider peek into cutting-edge AI evolution.

🗝️ Key Themes & Topics

The discussion revolves around the lifecycle of AI coding benchmarks: from breakthroughs to breakdowns, pushing for harder, fairer tests amid rapid model advances.

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

Listen to Audio Summary Open in App

Never miss an episode of Latent Space: The AI Engineer Podcast

Get every new episode summarized in your inbox — free, ~5 minutes to read.

No spam. Unsubscribe anytime.

What you'll learn

1 (00:00) **🎙️ Introduction: Olivia and Mia from OpenAI**
2 (00:56) **SWE-Bench Verified: Original Creation and Massive Effort**
3 (04:18) **Issues with SWE-Bench Verified: Contamination and Unfair Tests**
4 (08:16) **Benchmark Saturation and Evolution**
5 (10:28) **SWE-Bench Pro: The Next Coding Benchmark**
6 (12:31) **Ideal Capabilities for Future Coding Benchmarks**
7 (15:27) **Evaluating Qualitative Aspects: Human vs. LLM Judging**

+ Full timestamped outline available in the app

Show Notes

Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/) arguing that SWE-Bench Verified—long treated as a key “North Star” coding benchmark—has become saturated and highly contaminated, making it less useful for measuring real coding progress. SWE-Bench Verified originated as a major OpenAI-led cleanup of the original Princeton SWE-Bench benchmark, including a large human review effort with nearly 100 software engineers and multiple independent reviews to curate ~500 higher-quality tasks. But recent findings show that many remaining failures can reflect unfair or overly narrow tests (e.g., requiring specific naming or unspecified implementation details) rather than true model inability, and cite examples suggesting contamination such as models recalling repository-specific implementation details or task identifiers. From now on, OpenAI plans to stop reporting SWE-Bench Verified and instead focus on SWE-Bench Pro (from Scale), which is harder, more diverse (more repos and languages), includes longer tasks (1–4 hours and 4+ hours), and shows substantially less evidence of contamination under their “contamination auditor agent” analysis. We also discuss what future coding/agent benchmarks should measure beyond pass/fail tests—longer-horizon tasks, open-ended design decisions, code quality/maintainability, and real-world product-building—along with the tradeoffs between fast automated grading and human-intensive evaluation. 00:00 Meet the Frontier Evals Team00:56 Why SWE Bench Stalled01:47 How Verified Was Built04:32 Contamination In The Wild06:16 Unfair Tests And Narrow Specs08:40 When Benchmarks Saturate10:28 Switching To SWE Bench Pro12:31 What Great Coding Evals Measure18:17 Beyond Tests Dollars And Autonomy21:49 Preparedness And Future Directions

Get full access to Latent.Space at www.latent.space/subscribe