September 2024, Revisited: OpenAI o1 and the Arrival of Reasoning Models

Eric Greene June 11, 2026

This post is part of our Three-Year Retrospective series: thirty-six posts, one per month, looking back at what actually mattered in software engineering. This one covers September 2024.

On September 12, 2024, OpenAI released o1-preview and o1-mini, and for the first time mainstream developers met a model that thought before it answered. Not metaphorically — o1 was trained with reinforcement learning to produce a long internal chain of reasoning before emitting its response, and you watched the seconds tick by while it did. After two years of models getting faster and chattier, here was one that got slower and, on a specific class of problems, dramatically smarter.

Test-time compute: a new axis to scale on

The conceptual shift mattered more than the model. Until o1, the industry's scaling story was about training: bigger models, more data, better post-training. o1 demonstrated a second axis — spend more compute at inference time, on a per-question basis, and accuracy on hard reasoning problems climbs. The benchmark results made the case loudly: performance on competition mathematics (AIME) and competitive programming (Codeforces) that earlier models couldn't approach, with o1 ranking among strong human competitors.

There were unfamiliar mechanics to absorb. The reasoning happened in hidden tokens you were billed for but never saw — you got a summary, not the chain itself. The API arrived with real constraints: no system prompts at first, no streaming, long and variable latencies. And prompting habits had to change: the "think step by step" incantations we'd all internalized were now counterproductive, because the model already did that internally. Telling o1 how to think generally made it worse; telling it precisely what you wanted made it better.

What slower-but-smarter was actually good for

The practical question that fall was where a slow, expensive, deliberate model earned its keep when fast models handled everyday completion and chat perfectly well. A consensus emerged quickly, and it held up:

Hard bugs. Race conditions, heisenbugs, the failure that only reproduces under load. Pasting a full investigation — logs, suspect code, what you'd ruled out — into o1 and letting it reason for thirty seconds frequently surfaced hypotheses that faster models, and tired humans, missed.
Code review of tricky changes. Concurrency, cryptography, query planning, anything where the bug is in the interaction between lines rather than in any single line. o1's willingness to actually trace through state made it a different kind of reviewer.
Design analysis. Asking for the failure modes of a proposed architecture, or the edge cases in a migration plan, played to the model's strength: long-horizon reasoning over a problem you could state completely in the prompt.

What it was not for was just as important. Autocomplete, boilerplate, quick refactors — anywhere latency mattered — stayed with conventional models. September 2024 is when "model routing" entered working vocabulary: fast model by default, reasoning model when the problem deserves it.

The evaluation problem gets harder

o1 also sharpened a problem we'd been teaching around all year: how do you know any of this is working? Reasoning models made benchmark numbers even less transferable to your own workload — a model that wins competition math may or may not be better at your legacy codebase's bugs. The teams that handled the o1 transition well were the ones with a small, curated set of their own hard problems to test against. Everyone else was choosing models by vibes and screenshots.

Looking back from June 2026

Test-time compute went from novelty to default. Every major lab shipped reasoning models within months — and the open-weight world followed in January 2025, which is a story for that month's post. By 2026 the sharp line between "chat models" and "reasoning models" has mostly dissolved into a dial: how much thinking do you want to pay for on this request? The routing instinct that o1 forced on us — match the compute to the problem — turned out to be the durable skill.

We cover when and how to reach for reasoning models, and how to prompt them differently, in Working with Frontier Coding Models, and if your team is still choosing models by leaderboard screenshot, Evaluating AI Coding Assistants and LLM Apps is where we build the private evaluation sets that make these decisions defensible.