June 2024, Revisited: Claude 3.5 Sonnet and the Month Model Choice Became an Engineering Decision
Eric Greene June 11, 2026This post is part of our Three-Year Retrospective series: thirty-six posts, one per month, looking back at what actually mattered in software engineering. This one covers June 2024.
On June 20, 2024, Anthropic released Claude 3.5 Sonnet. On paper it was a mid-tier model — Sonnet, not Opus — released barely three months after the Claude 3 family. In practice, it outperformed Anthropic's own flagship and every competitor on the benchmarks engineers cared about most, and it did so while running at roughly twice the speed of Claude 3 Opus. For a lot of teams we worked with, this was the month "which model are we using?" stopped being an idle question and became an engineering decision with a review cadence.
The coding jump was the story
The headline numbers were strong across the board — graduate-level reasoning, vision, the usual benchmark suite — but the result that circulated in engineering Slack channels was coding. Claude 3.5 Sonnet hit 92% on HumanEval, and on Anthropic's internal agentic coding evaluation it solved 64% of problems against Claude 3 Opus's 38%. That second number mattered more than the first: it wasn't measuring whether the model could write a function from a docstring, but whether it could understand an existing codebase, plan a fix, and carry out a multi-step edit.
Developers felt the difference immediately. The model was noticeably better at holding a large code context, at following instructions about style and constraints, and at producing diffs that compiled on the first try. Through the summer of 2024, "just try it on 3.5 Sonnet" became the standard reply to complaints about AI coding quality, and a measurable share of developers quietly switched their default model for the first time.
Artifacts: the chat window grows a workbench
The same announcement included Artifacts — a side panel in Claude.ai where generated code, HTML pages, SVG diagrams, and React components rendered live next to the conversation. It sounds like a UI feature, and it was, but it changed the texture of working with a model. Code stopped being something you copied out of a chat transcript and pasted somewhere else to evaluate; it became something you watched run, critiqued, and iterated on in place.
For us as instructors, Artifacts was the first mainstream glimpse of a pattern that would define the next two years: the model's output as a live, editable workspace rather than a wall of text. Every agentic coding tool that followed owes something to the realization, in June 2024, that the feedback loop is the product.
Leapfrogging becomes the normal rhythm
The deeper lesson of June 2024 wasn't about one model. It was that the frontier had changed hands — GPT-4 and GPT-4o had defined "best available" for over a year, and now they didn't — and that this would keep happening. Teams that had hardcoded a single provider into their tooling, their prompts, and their evaluation habits suddenly had a concrete reason to regret it.
The practical advice we started giving that summer still holds: treat the model as a dependency, not an identity. Keep your prompts portable, keep a small evaluation set of tasks from your own codebase, and rerun it when a frontier release lands. The teams that did this in 2024 absorbed every subsequent leapfrog — and there were many — as a configuration change. The teams that didn't spent weeks re-discovering their own requirements each time.
Looking back from June 2026
The leapfrog cadence never slowed down: the frontier changed hands repeatedly across 2024 and 2025, exactly as that June suggested it would, and "evaluate models against your own workload" went from contrarian advice to table stakes. Claude 3.5 Sonnet itself had a remarkably long run as a developer favorite — it was still many teams' coding default well into 2025 — and Artifacts' render-it-live pattern is now so universal that it's hard to remember chat interfaces without it. June 2024 is, in hindsight, the month the AI coding era stopped being a single-vendor story.
If your team is still treating model choice as a set-and-forget decision, that's exactly the habit we work on in Working with Frontier Coding Models, where we build the evaluation harnesses and portability practices that turn each new frontier release into an opportunity instead of a migration.