Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

January 8, 2026

AI Summary

5 min read

🎙️ The Voices & The Context

The Format: Casual interview-style podcast chat with live demo of charts and benchmarks.
The Key Players:
- Host: Swyx (Alessio), AI podcaster from Latent Space, spotlighting early projects.
- Guests: George Hotz (wait, no—George and Micah from Artificial Analysis), founders of the leading independent AI model benchmarking site; Micah from Sydney (now SF), George Australian in SF. They're the "presumptive new gardener of AI" for unbiased metrics.
The Vibe: Educational yet fun and insider-y—excited geek-out on AI evals, trends, with banter on "shenanigans," rumors, and full-circle origin nods.

🗝️ Key Themes & Topics

Deep dive into Artificial Analysis (AA), the go-to site for independent AI model benchmarks across intelligence, speed, cost, and now hallucination/openness.

Topic 1: Origin & Business Model. Started as 2023 side project for devs building LLMs (Micah's legal AI tool needed trade-offs). Blew up via tweets (shoutout Swyx's early mention). Now 20+ people, sustainable via enterprise reports/workshops and private benchmarks—no pay-to-play on public site.
Topic 2: Benchmarking Challenges & Rigor. Why run evals themselves: labs cherry-pick prompts, formats bias results (e.g., Google's 32 CoT examples). Control variance, repeats for 95% confidence; "mystery shopper" policy tests private

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

Listen to Audio Summary Open in App

Never miss an episode of Latent Space: The AI Engineer Podcast

Get every new episode summarized in your inbox — free, ~5 minutes to read.

No spam. Unsubscribe anytime.

What you'll learn

1 (00:00) **🎙️ Introduction: George and Micah (Artificial Analysis)**
2 (01:20) **Business Model**
3 (04:16) **Origin Story**
4 (07:15) **Benchmarking Challenges**
5 (13:04) **Independence Measures**
6 (16:14) **AI Grant Experience**
7 (19:10) **Intelligence Index Evolution**

+ Full timestamped outline available in the app

Show Notes

don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk

—-

From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?

We discuss:

The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet
Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers
The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints
How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)
The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs
Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest
GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)
The Openness Index: scores models 0-18 on transparency of pre-training data, po