Scaling Agentic Systems: Cost, Latency, and Token Economics

Most teams discover agent economics the hard way: the prototype costs $0.02 per task on the demo and $0.40 per task in production once you turn on retrieval, tool calls, and longer context. At 50,000 tasks a month, that is a $20,000 line item on a workload your CFO had not budgeted for.

This article gives you the numbers you need to plan agent cost before you ship, and the techniques teams use to reduce per-task cost by 5-10x without losing quality. It ends with a real cost-per-task calculator for a customer service triage agent at 50,000 tickets per month.

The token math you actually need

Every model bills on input and output tokens, but the asymmetry is huge. Output tokens cost roughly 4-5x more than input tokens on most modern frontier models. An agent that reads 10,000 tokens of context and produces 200 tokens of decision is cheap. An agent that reads 2,000 tokens and produces a 4,000-token report is expensive.

Approximate published prices as of mid-2026 (always verify against current provider pricing pages before committing):

| Model | Input ($/1M) | Output ($/1M) | Cache write | Cache read | | --- | --- | --- | --- | --- | | Claude Sonnet 4.5 | $3 | $15 | 1.25x input | 0.1x input | | Claude Opus 4.5 | $15 | $75 | 1.25x input | 0.1x input | | GPT-5 | $5 | $20 | 1x input | 0.5x input | | GPT-5 mini | $0.50 | $2 | 1x input | 0.5x input | | Gemini 2.5 Pro | $2.50 | $10 | implicit | implicit discount | | Gemini 2.5 Flash | $0.30 | $1.20 | implicit | implicit discount |

The cache columns matter more than most teams realize. Anthropic's prompt cache gives you roughly a 90% discount on cached input tokens. OpenAI's cache gives roughly 50%. Gemini does implicit caching with automatic detection. If your agent has a 15,000-token system prompt and tool definitions that do not change across requests, caching turns that fixed cost from $0.045 per request (Sonnet) into $0.0045 per request. Across 50,000 requests, that is $2,250 versus $225.

Per-request cost components

A real agent request has more cost surface than just the LLM call. Sketch it out:

Input tokens for system prompt and tool definitions (cacheable)
Input tokens for user request and dynamic context (mostly not cacheable)
Output tokens for reasoning and final response
Each tool call: input tokens (the model's tool args) + output tokens (the tool's response, fed back as input on next turn)
Retrieval cost: embedding generation, vector search, re-rank
Storage and egress: minor for most workloads, real at petabyte scale

A typical multi-tool agent task with 5 tool calls might look like:

1st turn: 12k input (cached system) + 2k input (user) + 500 output = ~$0.043
2nd-6th turns: 14k input (each tool result adds context) + 500 output each
Final answer: 800 output

On Claude Sonnet 4.5 with caching enabled, that totals roughly $0.18-$0.24 per task. Without caching, $0.35-$0.45. Without caching and on Opus, $1.50-$2.20.

Latency budgets

Cost is half the picture. Latency drives user experience and limits how many agent invocations you can chain.

Rough latency expectations (single inference call, ignoring tool round trips):

Claude Sonnet 4.5: p50 ~1.5s, p99 ~6s for 500-token output
Claude Opus 4.5: p50 ~3s, p99 ~12s for 500-token output
GPT-5: p50 ~2s, p99 ~8s
GPT-5 mini: p50 ~700ms, p99 ~3s
Gemini 2.5 Flash: p50 ~600ms, p99 ~2.5s
Gemini 2.5 Pro: p50 ~2s, p99 ~7s

Tool calls add round-trip latency: model decides, returns tool call, your code executes the tool (often 100ms-2s for an API call), result is fed back to model. A 5-tool-call agent on Sonnet might have a 15-30 second p50 wall-clock time. Users will tolerate that only if you stream intermediate state.

Streaming changes the perceived latency. Time-to-first-token on most providers is under a second. If you can stream tokens or stream tool call events to the user, the agent feels responsive even when total latency is high.

Routing strategies that cut cost

The cheapest model that gets the job done wins. The trick is knowing which job needs which model.

Cheap model first, escalate on uncertainty. Run Haiku, GPT-5 mini, or Gemini Flash first. If the cheap model produces a confidence signal (structured output with confidence score, or refusal, or "I am not sure"), escalate to a frontier model. For workloads where 70% of cases are routine, this can cut cost 60-80%.

Task-typed routing. Classify the incoming request and route by class. Summarization and classification tasks go to Flash or mini. Reasoning, coding, and long-context tasks go to Sonnet, GPT-5, or Pro. Edge cases or executive escalations go to Opus.

Cascaded fallback. Primary model, cheap fallback if primary is rate limited, expensive fallback if quality is critical. Use circuit breakers to flip between tiers based on observed quality and availability.

Cached system prompt across routes. If you can keep the system prompt structure identical across cheap and expensive routes (just changing the model behind it), you preserve cache hit rates.

Batching for async workloads

If your workload is not user-facing real-time, batch APIs cut cost in half or more.

Anthropic Message Batches API: 50% discount, 24-hour SLA
OpenAI Batch API: 50% discount, 24-hour SLA
Gemini Batch: 50% discount, similar SLA

Batch is the right answer for:

Nightly enrichment of CRM records
Backfilling embeddings or summaries
Document classification at scale
Eval suite runs against golden sets

Batch is the wrong answer for:

User-facing chat
Real-time agent decisions
Anything with a sub-minute SLA

Build your platform so the same agent code can run synchronously or via batch with a config flag. The cost savings compound across many use cases.

Prompt caching for system prompts and tool definitions

The single highest-leverage optimization for most agent platforms. If you have a stable system prompt and stable tool definitions, mark them as cacheable.

Cache hits require the prefix to be identical, byte for byte. Practical rules:

Put stable content (system instructions, tool schemas, few-shot examples, constant context) at the front of the prompt.
Mark cache breakpoints at the boundary between stable and variable content.
Reuse the same prompt structure across requests so prefixes match.
Keep cache TTL in mind. Anthropic caches expire after 5 minutes by default with an extended 1-hour option. If your traffic is too sparse for the cache to be warm, the math changes.

A 15,000-token cached system prompt costs roughly:

First request (cache write): 1.25x normal input cost
Subsequent requests within TTL: 0.1x normal input cost (Anthropic) or 0.5x (OpenAI)

Across 50,000 requests per month with a warm cache, the savings on Sonnet are roughly $4,000-$6,000 per month versus no caching, for a single agent.

Real example: 50k ticket triage agent

A customer service triage agent: read a support ticket, classify it (billing, technical, account, churn risk), generate a suggested response, and either send it directly (low-risk classes) or route to a human (high-risk).

Assumptions:

50,000 tickets per month
800-token average ticket content
12,000-token system prompt + tool schemas (cacheable)
1,500-token customer history (per-ticket variable)
3 tool calls average (lookup customer, lookup recent orders, check entitlement)
600-token final output (classification + suggested response)
80% of cases handled by cheap tier, 20% escalated to frontier

Tier 1: Gemini 2.5 Flash, 40,000 tickets

Per ticket:

Input: 12k cached + 2.3k variable + ~3k tool outputs across turns = ~17.3k tokens. With implicit caching, effective cost approximately ~$0.0035 per ticket.
Output: ~700 tokens across turns = ~$0.00084 per ticket.
Total: ~$0.0043 per ticket.

40,000 tickets * $0.0043 = $172 per month.

Tier 2: Claude Sonnet 4.5, 10,000 tickets

Per ticket:

Input cached: 12k * $0.30/1M = $0.0036
Input variable: 5.3k * $3/1M = $0.0159
Output: 700 * $15/1M = $0.0105
Total: ~$0.030 per ticket

10,000 tickets * $0.030 = $300 per month.

Plus retrieval, tool execution, observability

Embedding lookups: ~$50/month
Tool execution compute: ~$200/month
Observability stack (Langfuse / Helicone tier): ~$300/month

Total

~$1,020 per month for 50,000 tickets = $0.0204 per ticket.

For comparison, if you ran every ticket on Claude Opus 4.5 with no caching and no routing:

~$1.80 per ticket
50,000 tickets * $1.80 = $90,000 per month

The routing and caching choices are not nice-to-haves. They are the difference between a viable business case and an immediate shutdown.

Cost optimization checklist

| Lever | Typical savings | | --- | --- | | Prompt caching for stable prefix | 40-70% on input tokens | | Cheap-tier routing for routine cases | 50-80% overall | | Batch API for async workloads | 50% | | Output token discipline (concise schemas, stop sequences) | 20-40% on output | | Context compression / summarization | 20-40% on input growth | | Tool result trimming (drop unused fields) | 10-30% on multi-turn input | | Eval-driven model downgrade | 30-60% when feasible |

Observability is the prerequisite

You cannot optimize what you cannot measure. Every recommendation above requires per-task visibility into input tokens, output tokens, cache hits, model used, latency, and outcome. If you have not stood up agent observability metrics, do that before you start optimizing. Otherwise you are guessing.

Cost engineering for agents borrows directly from the broader discipline of cost optimization strategies cloud infrastructure: right-sizing, reserved capacity, request consolidation, and continuous review.

Latency optimization beyond model choice

Model selection is one lever. The rest of the latency budget is yours to spend or waste.

Parallel tool calls. If two tools are independent (lookup customer + lookup recent orders), fire them in parallel. OpenAI's parallel tool calls and Anthropic's parallel tool use both support this. You can shave 30-50% off the agent's wall-clock time on tool-heavy workflows.

Speculative execution. For predictable next steps, kick off the likely tool call before the model fully decides. If the model picks a different path, you waste the speculative call. If it picks the predicted one, you skip the round-trip wait.

Edge inference. For very latency-sensitive use cases, smaller models hosted closer to the user (Cloudflare Workers AI, regional inference endpoints) cut tens to hundreds of milliseconds. Not every workload tolerates the quality drop.

Connection pooling and HTTP/2. Keep persistent connections to the model provider. Cold TLS handshakes add 100-300ms per request that you should not be paying repeatedly.

Aggressive output schemas. If you constrain output to a 200-token JSON schema, you spend 200 tokens of latency, not 2,000. Force-stop with stop sequences. Use structured output where the provider supports it.

Cost monitoring you should have on day one

Token economics are not a one-time analysis. They drift. A prompt change adds 500 tokens to every request. A new tool returns verbose JSON. A retrieval system starts pulling 30 chunks instead of 10. Without continuous monitoring, your per-task cost creeps up unnoticed.

Minimum monitoring set:

Cost per task by agent, by model, by tier
Token count distribution by request type (p50, p90, p99 input and output)
Cache hit rate for prompt caching enabled agents
Cost per outcome (cost per resolved ticket, cost per completed task)
Budget alerts at 50%, 80%, 100% of monthly budget per agent

Most teams underestimate the value of "cost per outcome" until they have it. A more expensive model that doubles success rate may be cheaper per resolved ticket. A cheaper model that gets the right answer 70% of the time costs you twice in retries, escalations, and customer dissatisfaction.

Common cost pitfalls

A short list of things that quietly burn money:

Tool outputs that include kilobytes of irrelevant fields. Trim before returning.
Retries that compound on the same expensive call. Use exponential backoff with caps.
Streaming consumers that never close, holding context alive. Set hard timeouts.
Eval suites running against frontier models on every CI commit. Use cheap models for fast evals, frontier for nightly.
Long context windows kept full when summarization would do. Treat context like RAM.
Multiple agents calling the same expensive tool. Cache tool results within a task.

Next steps

If you are about to scale an agent past 10,000 tasks per day and have not modeled the unit economics, this is the right week to do it. We help teams build cost-aware agent platforms with routing, caching, and observability from day one. Talk to us before the bill surprises your finance team.

View All Insights