Building Multi-Agent Systems That Don't Break in Production

Multi-agent demos exploded in 2023. The ones that survived contact with production all had the same traits: supervisor patterns, typed shared state, and ruthless failure handling.

2023 was the year multi-agent demos went viral. Diagrams with swarms of specialised agents passing tasks to each other looked impressive in conference talks and Twitter threads. But many of those architectures died within weeks of being exposed to real traffic. The LLMs missed tool calls, retries cascaded into loops, and the swarm that looked so elegant in a diagram became impossible to debug at 2 AM. The systems that lasted looked almost boring by comparison - and that was exactly the point.

Over two years of multi-agent work across finance, legal, and enterprise SaaS clients, we found that architectural simplicity consistently beat cleverness. Teams that resisted the urge to build elaborate agent meshes - and instead spent that energy on failure handling and observability - shipped reliable systems that people actually trusted.

The Supervisor Pattern Wins on Reliability

In our production work across 2023–25, the architecture we came back to over and over was simple: one supervisor agent that decomposes tasks and routes work, plus a small set of specialised workers. Workers never talk to each other directly. All coordination goes through the supervisor, which makes reasoning about failures much easier.

The peer-to-peer mesh model - where agents hand off work to each other in chains - creates a debugging nightmare. When something fails three handoffs deep, it is hard to reconstruct what state was passed, what each agent assumed, and which one made the decision that led to a bad outcome. With a supervisor, every decision has a single logged home.

Why this matters

With a supervisor, every decision has a single, logged home. That makes it possible to ask 'why did the system do this?' and actually answer with evidence rather than guesswork.

What a Good Supervisor Does

Receives the user's goal or task and breaks it into sub-tasks with explicit outputs
Routes each sub-task to the right specialist worker based on type and context
Collects results and handles retries when workers fail or return unexpected shapes
Knows when to escalate to a human rather than retrying indefinitely
Records every routing decision and result with enough context to replay later

Supervisors do not need to be large, expensive models. In many of our deployments, a mid-tier model with a clear routing prompt outperformed a more powerful model with vague instructions. The supervisor's job is coordination, not deep reasoning.

Shared State as a First-Class API

The biggest source of subtle bugs in 2024-era agent systems was shared state. One agent wrote shape A, another expected shape B, and the mismatch only appeared as a vague failure downstream - sometimes two or three steps later, when context was gone and the error message was meaningless.

The fix was treating shared state exactly like an API between services: versioned, typed, and validated at every boundary. When we moved clients to typed Pydantic schemas for agent state in LangGraph, the number of subtle state-related failures dropped dramatically.

1Define a single state schema upfront - include all fields every agent might read or write.
2Use strict typing (TypedDict or Pydantic) - catch mismatches at code time, not runtime.
3Version the state schema - when you add new fields, add them as optional with defaults.
4Validate on entry and exit for each agent node - surface problems at boundaries, not downstream.
5Never let an agent write freeform JSON into a shared dict - structured state only.

Failure Handling: The Work Most Teams Skip

Most multi-agent architectures in 2023 had basically no failure handling beyond a generic try/except at the top level. This worked fine in demos where the happy path was scripted. It failed badly in production where tool calls time out, APIs return unexpected shapes, and LLMs occasionally refuse to follow instructions.

The patterns that held up across the systems we built and reviewed were relatively simple but consistently absent in teams that were moving fast:

Classify failures before retrying - a timeout is different from a malformed response, which is different from a rate limit
Cap retries per node, not just at the top level - cascading retries in an agent graph can amplify cost and latency by an order of magnitude
Define explicit fallback states - when a node fails terminally, transition to a known, safe partial result rather than propagating corruption
Log the full input/output for every failure - not just the error code, but what the agent received and what it tried to do
Test failure paths explicitly in staging - inject tool failures, malformed responses, and slow calls before going live

Observability: Making Multi-Agent Systems Debuggable

A multi-agent system that you cannot observe is essentially a black box. You can see inputs and outputs, but you cannot understand what happened in between - which makes tuning, debugging, and incident response nearly impossible.

The minimum viable observability stack for a multi-agent system includes three layers: structured traces per run (what each node received, returned, and how long it took), aggregate dashboards per node (error rates, latency distributions, retries), and full run replay for incident investigation.

Tools that helped in 2023–25

LangSmith, Langfuse, and Arize all offered multi-agent tracing by 2024. The specific tool mattered less than the discipline of capturing every node transition with structured metadata. Teams that logged consistently spent a fraction of the time on incident response compared to those that didn't.

How Many Agents Do You Actually Need?

One of the clearest patterns across our engagements: teams almost always over-specified the number of agents in their initial designs. A 10-agent architecture drawn in a workshop would often reduce to 3 or 4 after implementation, because many of the proposed agents were doing things that a single, well-prompted worker could handle more simply.

A useful heuristic: a new agent is justified when it needs a meaningfully different context window, a different model, a different set of tools, or a different failure mode from an existing agent. If none of those apply, it is probably just a function.

Practical Checklist Before Going Live

1Every agent node has a typed input and output schema - no freeform dicts.
2The supervisor has explicit routing logic and a documented fallback path for each worker type.
3All tool calls have timeouts and retry caps with exponential backoff.
4Full run traces are captured in a structured format and retained for at least 30 days.
5A staging suite exercises failure paths: tool timeouts, unexpected response shapes, and budget overruns.
6There is a clear human escalation path - not a generic error page.

Multi-agent systems built with this checklist in mind did not impress in demos the same way swarm architectures did. But they ran in production for months without waking anyone up at 3 AM - and that was the real measure of success.

Building something in this space?

We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.

Book a 30-minute call

Key takeaways

The supervisor pattern proved far more debuggable than peer-to-peer agent meshes
Shared state must be explicit and versioned, not ad-hoc JSON blobs
Partial failure handling matters more than clever planning strategies
Replayable traces are essential for incident response and tuning
Most teams over-estimated how many agents they really needed

Back to all articles