[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
December 31, 2025
AI Summary
5 min read🎙️ The Voices & The Context
- The Format: This casual interview dives deep into the rapidly evolving world of AI coding benchmarks, blending insider updates with forward-looking debates on evaluation methods, hosted at New Rips podcast. Technical and enthusiastic, it captures the excitement of benchmark creators pushing AI capabilities in real-world coding scenarios.
- The Format: Casual interview between host and guest.
- The Key Players:
- Guest: John Yang – Creator of Sweetbench, a pivotal coding benchmark; researcher in Ophir's group at Stanford, now advancing Code Clash for long-horizon AI development evaluation; known for multimodal, multilingual extensions and innovative arenas like programming tournaments.
Continue reading the full summary in the app — free to try.
Read Full Summary →Free • No credit card required
What you'll learn
- 1 `(00:00)` **🎙️ Introduction: John Yang**
- 2 `(00:20)` **Sweetbench Updates and Ecosystem**
- 3 `(03:08)` **Code Clash: Multi-Agent Programming Tournaments**
- 4 `(05:57)` **Ophir Gordon Group Highlights**
- 5 `(07:21)` **Notable 2025 Coding Benchmarks**
- 6 `(08:21)` **User Simulators and Agent Environments**
- 7 `(09:12)` **TAU Bench Controversy and Impossible Tasks**
+ Full timestamped outline available in the app
Show Notes
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.
We discuss:
John's path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks
The SWE-bench origin story: released October 2023, mostly ignored until Cognition's Devin launch kicked off the arms race (Walden emailed John two weeks before: "we have a good number")
SWE-bench Verified: the curated, high-quality split that became the stand
More from this podcast
Latent Space: The AI Engineer Podcast →