Multi-Agent Orchestration: Architectures, Frameworks, and Tradeoffs

Multi-agent systems are the most over-prescribed pattern in AI engineering. The honest truth is that for the majority of tasks, a single well-prompted agent with the right tools beats a multi-agent crew on cost, latency, and reliability. This post lays out when multi-agent actually wins, which architecture fits which problem, how the major frameworks compare, and what changes in your observability and operations once agents start talking to each other.

We end with a real case study: single-agent versus multi-agent on the same contract analysis task. Spoiler: it is closer than the multi-agent hype suggests, and the choice depends on factors that have nothing to do with the model.

The architecture spectrum

Four canonical multi-agent shapes, ordered roughly by control structure:

Hierarchical (orchestrator plus sub-agents): a top-level orchestrator decides which sub-agent handles which sub-task and synthesizes outputs. The sub-agents do not talk to each other directly; everything routes through the orchestrator. This is the dominant pattern in production. Anthropic's "Claude Code" follows this shape: the main agent dispatches sub-agents for specific phases.

Peer-to-peer collaboration: agents converse directly with each other, often through a shared message bus or a "groupchat" abstraction. Each agent has a role and the conversation ends when a termination condition is met. AutoGen's classic pattern. Higher emergence, harder to constrain.

Swarm or voting: N identical or near-identical agents tackle the same task in parallel, results are aggregated by voting or by a separate aggregator. Useful for high-stakes classification, jury-style decisions, or when you want output diversity to surface false negatives. Same shape as the parallelization-voting workflow pattern, just with full agents instead of single LLM calls.

Debate: two or more agents argue opposing positions, and a judge agent (or a deterministic rule) picks the winner. Used in alignment research and occasionally in production for adversarial review (security review by a "red team" agent versus a "defender" agent). Expensive and slow.

When monolithic beats multi-agent

Most of the time. Specifically:

The task fits in a single context window with reasonable headroom (under 60 percent of the window after system prompt and tools)
Sub-tasks are tightly coupled and require shared state
Latency budget is under 30 seconds per request
Your team has not yet operationalized a single-agent baseline

The cost math is brutal for multi-agent. A 3-agent hierarchical setup typically costs 5x to 8x a comparable single-agent run because the orchestrator's context grows with each sub-agent return, and the sub-agents re-establish context on every call.

Multi-agent makes sense when at least one of these is true:

Total context needed exceeds the window even with prompt caching and retrieval
Sub-tasks are genuinely independent and can parallelize
Different sub-tasks need different specializations (tools, models, system prompts) that materially diverge
You need separate trust boundaries (a privileged sub-agent that can write to a database versus a read-only research sub-agent)

Mapping to real frameworks

| Framework | Architectural fit | Strengths | Where it struggles | |-----------|-------------------|-----------|-------------------| | LangGraph | Hierarchical, conditional graphs | Best-in-class state management via StateGraph, native checkpointing, time-travel debugging | Steep learning curve, verbose for simple cases | | CrewAI | Role-based hierarchical | Fast prototyping, clean abstractions for "crews", good for non-engineers | Less control over execution, weaker observability hooks | | AutoGen (Microsoft) | Peer-to-peer, conversational | Mature multi-agent conversation patterns, AutoGen Studio for low-code | Conversation can run away without strict termination conditions | | LlamaIndex Agents | Hierarchical with strong retrieval | Best when retrieval is the core capability | Less mature for non-RAG agentic tasks | | Anthropic SDK + custom | Anything | Maximum control, minimum dependencies | You build the orchestration plumbing |

A useful heuristic: if you can sketch your agent topology on a whiteboard in under 5 minutes and it has fewer than 6 nodes, LangGraph is usually the right pick. If your domain experts (non-engineers) need to author and tune the agents, CrewAI wins. If you need adversarial or debate patterns out of the box, AutoGen.

LangGraph StateGraph in practice

LangGraph models the agent system as a directed graph of nodes (functions) and edges (transitions). State flows through the graph as a TypedDict.

```python from langgraph.graph import StateGraph, END from typing import TypedDict, List

class ContractState(TypedDict): contract_text: str clauses_extracted: List[dict] risks_identified: List[dict] summary: str needs_human_review: bool

def extract_clauses(state: ContractState) -> ContractState: # call extraction agent return {"clauses_extracted": [...]}

def analyze_risks(state: ContractState) -> ContractState: # call risk analyst agent against clauses_extracted return {"risks_identified": [...]}

def summarize(state: ContractState) -> ContractState: return {"summary": "...", "needs_human_review": any(r["severity"] >= 4 for r in state["risks_identified"])}

def route_after_summary(state: ContractState): return "human_review" if state["needs_human_review"] else END

graph = StateGraph(ContractState) graph.add_node("extract", extract_clauses) graph.add_node("analyze", analyze_risks) graph.add_node("summarize", summarize) graph.add_node("human_review", lambda s: s)

graph.set_entry_point("extract") graph.add_edge("extract", "analyze") graph.add_edge("analyze", "summarize") graph.add_conditional_edges("summarize", route_after_summary)

app = graph.compile(checkpointer=PostgresCheckpointer(...)) ```

The checkpointer is the killer feature. Every state transition is persisted, which means: pause and resume across days, time-travel to a prior state and replay with a different prompt, and recover from infrastructure failures mid-trajectory.

MCP for tool standardization across agents

The Model Context Protocol (MCP), released by Anthropic in late 2024 and now broadly supported by Anthropic, OpenAI, Google, and major IDEs, solves a real problem in multi-agent systems: each sub-agent needs the same tool set, but defining tools per-agent leads to drift.

With MCP, you stand up tools as MCP servers (one per resource: Postgres, GitHub, Stripe, internal API) and any agent that speaks MCP can connect. Concretely:

Sub-agents in different frameworks can share the same Stripe MCP server
Tool versioning happens at the server, not in each agent's prompt
Permissions and auth are handled by the server, not by the agent
Observability captures MCP calls uniformly

If you are starting a multi-agent project in 2026, designing your tools as MCP servers from day one will save you significant refactoring later.

State management and inter-agent communication

Three legitimate patterns:

Shared state object: the LangGraph approach. Single source of truth, every node reads and writes a slice. Simple, debuggable. Default choice.

Message passing: each agent has an inbox; the orchestrator routes messages. Useful for genuinely asynchronous workflows or when agents run on different hosts. Adds complexity.

Blackboard: a shared key-value store that all agents read and write. Classic AI architecture, useful when you have many agents and emergent collaboration patterns. Hard to debug at scale.

For most production systems, shared state via a typed schema is the right answer. Resist the urge to build a "general agent framework" before you have shipped a single use case.

Observability across agent boundaries

The default observability tooling (a single trace per LLM call) breaks the moment you go multi-agent. You need:

Distributed tracing: every sub-agent invocation is a span under a parent trace. OpenTelemetry-compatible (Langfuse, Arize Phoenix, Datadog LLM Observability, PostHog LLM analytics) all work.
Token attribution per agent: track which agent consumed how many tokens and at what cost. Without this, cost regressions are invisible.
Tool call audit: every MCP tool call logged with agent identity, parameters, and result.
State diffs: on every state transition, capture before and after. The cheapest way to debug a "the agent did something weird in step 4" report.
Trajectory replay: ability to re-run a failed trajectory from a checkpoint with a modified prompt or tool.

Budget for observability tooling on day one. The number of multi-agent systems running blind in production is alarming, and incident response without traces is guesswork.

Case study: contract analysis, single agent vs multi-agent

The task: given a 40-page MSA, produce a risk report with clause extractions, deviations from a standard template, and a recommended action.

Single-agent build:

One agent, Claude Opus 4.5
System prompt includes the standard template and the risk categories
Tools: `extract_pdf_text`, `search_template_clause`, `write_report`
Wall clock: 22 seconds
Tokens: 38k input, 6k output (with prompt caching enabled)
Cost per contract: ~$0.42
Quality (LLM-judge vs human gold standard): 87 percent agreement

Multi-agent build (LangGraph, hierarchical):

Orchestrator (Claude Sonnet 4) plans the analysis
Extractor agent (Claude Sonnet 4) pulls clauses by category
Risk analyst agent (Claude Opus 4.5) evaluates each clause vs template, in parallel for each of 6 categories
Synthesizer agent (Claude Opus 4.5) writes the report
Wall clock: 38 seconds (parallelism helps on the risk analyst stage)
Tokens: 95k input, 14k output across all agents
Cost per contract: ~$0.94
Quality: 91 percent agreement

The multi-agent version wins on quality (4 percentage points) and loses on cost (2.2x) and latency (1.7x). The break-even decision turns on:

Volume: at 50 contracts a day, the cost delta is ~$26 daily, trivial. At 5,000 contracts, it is $2,600 daily, real money.
Quality bar: if legal cares about the 4-point quality lift (catches more high-severity deviations), the multi-agent build pays for itself in one missed risk avoided.
Latency tolerance: if the contract review is async (overnight batch), the latency is irrelevant. If it is interactive, 38 seconds may be too long.

In most production deployments we have seen, the right answer is the single-agent build until volume or quality requirements force the move. Premature multi-agent is the most common over-engineering trap in this space.

Failure modes specific to multi-agent

Three failure modes are unique to multi-agent systems and worth designing for explicitly.

Cascading hallucinations. A sub-agent's hallucinated output becomes input to the next sub-agent, which treats it as ground truth. By the time the orchestrator sees the final result, the hallucination is buried under three layers of confident-sounding analysis. Mitigation: every sub-agent output that flows to another agent should include explicit uncertainty markers ("This clause was extracted with low confidence because the text was OCR'd") and downstream agents should respect them.

Context fragmentation. The orchestrator sees a summary of what each sub-agent did, not the full trajectory. Critical detail gets lost in the summarization step. Mitigation: keep full sub-agent trajectories accessible to the orchestrator via a retrieval step, not just a summary. LangGraph's checkpoint store makes this practical.

Tool conflict. Two sub-agents both write to the same resource (a database row, a Sheet, a Jira ticket) and clobber each other's work. Mitigation: serialize writes through the orchestrator, or use idempotent tool operations with optimistic locking. The latter is more work but scales better.

Runaway termination. Without strict termination conditions, peer-to-peer agents can loop indefinitely or wander off-task. Mitigation: every multi-agent workflow needs a wall-clock budget, a max-turns counter, and an "escape hatch" instruction in every agent's system prompt that says "if you have been called more than 5 times without progress, return a final answer and stop."

Decision checklist

[ ] You have shipped a single-agent baseline and measured its quality and cost
[ ] You have identified the specific sub-task that needs a different model, tool set, or trust boundary
[ ] You have chosen a framework that matches your architecture (LangGraph for hierarchical, AutoGen for peer-to-peer, CrewAI for role-based)
[ ] Your tools are exposed via MCP or have a clean abstraction that makes them easy to share
[ ] You have wired distributed tracing before going to production
[ ] You have token budgets and wall-clock caps enforced in code
[ ] You have a checkpoint or replay mechanism for debugging mid-trajectory failures
[ ] You have run side-by-side evaluation against the single-agent baseline and the multi-agent build wins on a metric you can defend

For the metrics behind that side-by-side evaluation, agent-observability-metrics covers SLI design. If governance or policy on multi-agent autonomy is the open question, ai-governance-framework-template is the place to start.

Next steps

If you are considering a multi-agent system, run the single-agent baseline first and let the numbers tell you whether the additional complexity is justified. We help engineering teams design agent topologies and pick the right framework for the workload. Reach out if you want a structured architecture review before you commit to a build.

View All Insights