Designing Multi-Agent Systems

Class Duration

14 hours of live training delivered over 2-3 days to accommodate your scheduling needs.

Student Prerequisites

Professional software development experience in Python or TypeScript
Familiarity with LLM API usage and function/tool calling

Target Audience

Software engineers and architects designing or building systems where multiple AI agents collaborate on complex, long-horizon tasks. Relevant for teams building internal automation platforms, enterprise AI assistants, or agentic pipelines that go beyond single-agent workflows.

Description

Multi-agent systems unlock capabilities beyond single-agent limits — but they introduce new challenges in orchestration, state management, error recovery, and observability. This course covers the architectural patterns and practical techniques for building reliable multi-agent systems: planner/worker decomposition, agent hand-off protocols, shared memory and state passing, failure detection and recovery, evaluation, and the safety considerations unique to autonomous agent collaboration. Labs build progressively more complex multi-agent pipelines using TypeScript and Python with real model backends.

Learning Outcomes

Describe the key multi-agent architectural patterns: orchestrator, planner/worker, pipeline, and network topologies.
Implement agent hand-off with clear task scope, context transfer, and completion signaling.
Design shared memory and state stores for multi-agent coordination.
Apply failure detection, retry, and escalation patterns to agent pipelines.
Build a planner/worker system with dynamic task decomposition.
Evaluate multi-agent system behavior using trace analysis and task-completion metrics.
Apply safety boundaries: capability scoping, confirmation gates, and human-in-the-loop escalation.

Training Materials

Comprehensive courseware is distributed online at the start of class. All students receive a downloadable MP4 recording of the training.

Software Requirements

Python 3.12+ or Node.js 20+, API keys for at least one frontier model, and Git.

Training Topics

Why Multi-Agent Systems

Task classes that benefit from multiple agents
Limits of single-agent context and capability
Tradeoffs: complexity, cost, and latency

Architectural Patterns

Orchestrator/worker pattern
Planner/executor decomposition
Pipeline (sequential) agents
Peer/network agent collaboration
Choosing the right topology

Agent Hand-Off Protocols

Task scope and acceptance criteria definition
Context package design for hand-offs
Completion signaling and result validation
Partial completion and resumption

Shared Memory and State

In-process vs. external state stores
Shared context formats and schemas
Concurrent write safety
Memory pruning for long-running systems

Failure Detection and Recovery

Detecting stuck, looping, or incorrect agents
Retry strategies per agent type
Escalation to human-in-the-loop
Graceful degradation when an agent fails

Dynamic Task Decomposition

Planner agent design: input → task graph
Dependency resolution and parallel dispatch
Handling plan revisions mid-execution
Task graph visualization and debugging

Evaluation and Observability

Tracing multi-agent execution end-to-end
Task-completion metrics and success criteria
Intermediate step quality evaluation
Cost attribution across agent roles

Safety Boundaries

Capability scoping per agent role
Confirmation gates for destructive actions
Human-in-the-loop escalation triggers
Audit logs for fully autonomous pipelines

Workshop

Build a planner/worker pipeline for a realistic task
Failure recovery exercise
Q&A session