August 2025, Revisited: GPT-5 and the Router That Decided How Hard to Think
Eric Greene June 11, 2026This post is part of our Three-Year Retrospective series: thirty-six posts, one per month, looking back at what actually mattered in software engineering. This one covers August 2025.
On August 7, 2025, OpenAI released GPT-5. After more than two years of speculation — GPT-4 had landed in March 2023, and everything since had carried deliberately un-five-like names — the most anticipated version number in software finally shipped. What arrived was less a single model than a system: a fast, efficient model for most queries, a deeper "thinking" model for hard problems, and a real-time router that decided, per request, which one you needed. The model-picker dropdown that power users had been agonizing over was supposed to disappear.
Unification was the actual headline
It's easy to forget how fragmented the OpenAI lineup was by mid-2025. GPT-4o for general chat, the o-series (o1, o3, o4-mini) for reasoning, each with different strengths, latencies, and quirks — choosing correctly was genuine expertise, and most users chose wrong. GPT-5's pitch was that the system should make that decision for you, based on the conversation's complexity, tool needs, and your explicit intent ("think hard about this" actually routed you to the reasoning model).
For developers, the API told a cleaner story: gpt-5, gpt-5-mini, and gpt-5-nano, with a reasoning-effort dial instead of a model zoo. That dial turned out to be the durable idea. Within a year, nearly every frontier lab exposed some version of "how hard should the model think?" as a first-class parameter, and prompt-and-parameter tuning for reasoning effort became a normal part of building with LLMs.
What it meant at the keyboard
The coding improvements were real — GPT-5 posted strong results on SWE-bench Verified and was noticeably better at long, multi-step agentic tasks, front-end work, and debugging across large repositories. But our honest read at the time, teaching through that autumn, was that the workflow impact mattered more than the benchmark delta. Two things changed in practice.
First, routing made cost and latency conversations concrete. Teams that had defaulted everything to the most capable model started asking which calls in their pipeline actually needed deep reasoning, because the platform itself was now modeling that question. We saw LLM application bills restructured around effort tiers within months.
Second, the launch was a lesson in deployment humility. The router misbehaved in the first days — OpenAI's own leadership admitted an autoswitcher malfunction made the model seem "way dumber" at launch — and user backlash over the abrupt retirement of GPT-4o forced OpenAI to restore legacy model access within a week. The most sophisticated AI company in the world had under-estimated how attached users were to a specific model's behavior. Every team building on LLMs took the note: model changes are breaking changes, even when the new model is better.
Another leapfrog, the same advice
GPT-5 retook the frontier for a while, and then — as had become the rhythm — competitors answered within months. By late 2025 the leaderboard had churned again. The teams that handled this well were the ones who had already learned the lesson we kept repeating in this series: keep an evaluation set drawn from your own codebase, treat the model as a swappable dependency, and rerun your evals when the frontier moves. August 2025 was the loudest frontier jump of the year, and it changed nothing about that advice except to confirm it.
Looking back from June 2026
The version number turned out to matter less than the architecture. The router idea — one entry point, variable depth of thought behind it — is now how essentially every frontier system works, and the reasoning-effort parameter GPT-5 mainstreamed is a standard knob in production LLM applications. The 4o backlash, meanwhile, gave the industry its canonical case study in model deprecation gone wrong; deprecation timelines and behavioral-compatibility testing got noticeably more careful afterward. GPT-5 was a good model. The system around it was the lasting contribution.
If your team is still deciding how to put frontier models to work — and how to keep your footing when the frontier moves again — Working with Frontier Coding Models builds the evaluation and portability habits that make leapfrogs routine, and LLM Application Development with Python covers routing, reasoning-effort tuning, and cost engineering in production applications.