March 2025, Revisited: Kafka 4.0 Says Goodbye to ZooKeeper

Eric Greene June 11, 2026

This post is part of our Three-Year Retrospective series: thirty-six posts, one per month, looking back at what actually mattered in software engineering. This one covers March 2025.

On March 18, 2025, Apache Kafka 4.0 shipped, and with it ended one of the longest goodbyes in open-source infrastructure: ZooKeeper was gone. Not deprecated, not optional — removed. Every Kafka 4.0 cluster runs in KRaft mode, with metadata managed by Kafka's own Raft-based quorum. For anyone who had spent a decade operating Kafka, this was the release where the answer to "and how's your ZooKeeper ensemble?" finally became "what ZooKeeper ensemble?"

Why killing ZooKeeper took six years

KIP-500 — "Replace ZooKeeper with a Self-Managed Metadata Quorum" — dated back to 2019, and the slow burn was justified. ZooKeeper wasn't just a dependency; it was where Kafka kept its brain: controller election, topic metadata, ACLs, configuration. It was also Kafka's most notorious operational burden — a second distributed system with its own quorum mechanics, its own failure modes, its own tuning folklore, which most teams understood far less well than Kafka itself.

KRaft moved metadata into Kafka, stored as an internal event log replicated by Raft among controller nodes. The benefits were concrete rather than cosmetic: one system to deploy, secure, monitor, and upgrade; dramatically faster controller failover, because the new controller already has the metadata log instead of reloading state from ZooKeeper; and metadata capacity that scales as a log rather than as a tree of znodes, lifting practical ceilings on partition counts per cluster. KRaft had been production-ready since 3.3 and the default since 3.5 — 4.0 simply removed the old path entirely.

The upgrade had a hard prerequisite

The operational fine print mattered more than usual, and it's the part we drilled in courses that spring. There was no direct upgrade from a ZooKeeper-based cluster to 4.0. The required path was a bridge: upgrade to a 3.9.x release, perform the documented KRaft migration — provision controllers, migrate the metadata, decommission ZooKeeper — and only then move to 4.0. Teams that had deferred the KRaft migration for years suddenly had a forcing function.

The release also took out the trash accumulated across the 3.x era. Old protocol API versions were removed (KIP-896), setting a floor on client ages — very old consumers and producers simply couldn't talk to a 4.0 broker. Java requirements rose: brokers now demanded Java 17, clients and Streams Java 11. And a stack of long-deprecated tools and flags vanished. None of this was difficult individually; collectively it meant 4.0 was an upgrade you planned, with a client-version audit first.

Queues for Kafka, at last

The headline new feature was share groups (KIP-932), shipping in early access: queue-like semantics on ordinary Kafka topics. Where a classic consumer group caps useful parallelism at the partition count — one consumer per partition, strictly ordered — share groups let many consumers cooperatively pull from the same partitions, with per-record acknowledgment and redelivery. In other words: the work-queue pattern that had been pushing teams to run RabbitMQ or SQS alongside Kafka could now, at least in preview, live on infrastructure they already operated. Alongside it, the new consumer rebalance protocol (KIP-848) reached GA, replacing stop-the-world rebalances with broker-coordinated incremental ones — an enormous quality-of-life improvement for large consumer groups that needed no application changes beyond a config flip.

Looking back from June 2026

The transition went about as smoothly as removing a distributed system's brain can go — which is to say, the loud failures were almost all teams that skipped the bridge-release reading. KRaft-only Kafka proved itself boring in the best way, and the operational simplification was real: a whole genre of ZooKeeper-tuning runbooks quietly became historical documents. Share groups matured through subsequent 4.x releases toward production readiness, and "do we still need a separate queue?" became a live architecture question. Mostly, though, March 2025 stands as a model for how mature open-source projects retire load-bearing infrastructure: announce early, migrate gradually, remove decisively.

For teams building on this stack, our Distributed Task Automation with Python, Kafka, and Celery course covers exactly the queue-versus-log architecture decisions that share groups reopened, and Distributed Task Automation with Python, Faust, and Kafka works through stream processing patterns on modern KRaft-era Kafka.