[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

December 31, 2025

AI Summary

5 min read

🎙️ The Voices & The Context

The Format: This interview-style podcast dives into the gritty realities of AI model development at OpenAI, blending technical deep dives with casual banter on cutting-edge post-training techniques and future trends. Technical and insider-driven.
The Format: An interview with light host-guest chemistry focused on AI research updates.
The Key Players:
- Guest: Josh from OpenAI, a post-training researcher specializing in thinking models, search tools, and RL systems; notable for work on GPT-4.1, the new shopping model, and shifting from pre-training data curation to high-impact behavioral improvements.

🗝️ Key Themes & Topics

The episode unpacks OpenAI's post-training evolution, from RL challenges to new model releases, emphasizing data quality over optimization tweaks, token efficiency, and the blurring lines between research and systems engineering.

Topic 1: Post-Training vs. Pre-Training Shift – Josh explains his move from pre-training (small compute wins like 3%) to post-training (40% behavior changes), highlighting RL's complexity with more moving parts like task-specific grading infrastructure, late-night debugging across internal/external code, and the rise of tools like Codex for rapid code understanding.
Topic 2: New Models and Interaction Paradigms – Discussion on the shopping model

Continue reading the full summary in the app — free to try.

Read Full Summary →

Free • No credit card required

Listen to Audio Summary Open in App

Never miss an episode of Latent Space: The AI Engineer Podcast

Get every new episode summarized in your inbox — free, ~5 minutes to read.

No spam. Unsubscribe anytime.

What you'll learn

1 (00:00) **🎙️ Introduction: Josh from OpenAI**
2 (01:00) **Josh's Path to Post-Training**
3 (01:52) **RL Engineering Challenges**
4 (04:37) **Shopping Model Release Insights**
5 (06:24) **Deep Research vs. o1 Thinking (High Reasoning)**
6 (07:11) **Personality Toggles and User Preferences**
7 (08:25) **Post-Training Progress in 2024-2025**

+ Full timestamped outline available in the app

Show Notes

From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency. We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL ("way more moving parts than pre-training"), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn't producing enough people who can do both distributed systems and ML research—the exact skill set required to push the frontier when the bottleneck moves every few weeks.

We discuss:

Josh's path: pre-training data curation → post-training researcher at OpenAI, shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model
Why he switched from pre-training to post-training: "Do I want to make 3% compute efficiency wins, or change behavior by 40%?"
The RL infrastructure challenge: way more moving parts than pre-training (tasks, grading setups, external partners), and why babysitting runs at 12:30am means jumping into unfamiliar code constantly
How Codex has changed his workflow: 40-minute design sessions compressed into 15-minute agent sprints, and the strange "trapped" feeling of waiting for the agent to finish
The RLHF vs RLVR debate: both are policy gradient methods, the real difference is data quality and signal trust (human preference vs.