Architecture

The Real Cost of Running AI Agents (And How to Cut It by 90%)

We broke down the actual infrastructure costs of running AI agents in production. The numbers are worse than you think — but fixable.

Maritime Team·January 28, 20267 min read

Everyone talks about the cost of LLM API calls. Few people talk about the infrastructure cost of keeping agents running. That's where the real money goes.

Where the Money Actually Goes

We surveyed 200 teams running AI agents in production. Here's where their money goes:

Cost Category	% of Total Spend
LLM API calls	25-35%
Compute (VMs/containers)	40-50%
Storage & databases	10-15%
Networking & egress	5-10%
DevOps & monitoring	5-10%

The surprise: compute costs are larger than LLM costs for most teams. Why? Because agents need to be running to receive requests, even when no requests are coming in.

The Idle Tax

A typical AI agent on a t3.medium EC2 instance costs about $30/month. If that agent handles 100 requests per day, each taking 30 seconds, it's actively working for about 50 minutes out of 1,440 minutes in a day.

That's 3.5% utilization. You're paying full price for a machine that's idle 96.5% of the time.

Scale that to 10 agents and you're spending $300/month for infrastructure that's mostly doing nothing. Scale to 50 agents and it's $1,500/month — the cost of a junior developer.

Why Traditional Serverless Doesn't Work

The obvious answer is serverless — Lambda, Cloud Run, Cloud Functions. But AI agents aren't stateless HTTP handlers. They have:

Large dependencies — ML libraries, model weights, vector stores
Long execution times — Agent workflows can run for minutes
State requirements — Conversation history, tool state, memory
Cold start sensitivity — Loading models takes 10-30 seconds

Lambda's 15-minute timeout, 10GB package limit, and cold start latency make it a poor fit for most agents. Cloud Run is better but still charges per second of uptime, not per invocation.

The Sleep/Wake Model

Maritime's approach targets the specific economics of AI agents:

Checkpoint on idle — After a configurable timeout, your agent's entire state is serialized and stored
Restore on invoke — When a request arrives, the checkpoint is loaded in under 2 seconds
Bill per invocation — You pay for the time your agent is actually processing, not the time it's waiting

The math changes dramatically:

Setup	10 Agents	50 Agents	100 Agents
EC2 (t3.medium)	$300/mo	$1,500/mo	$3,000/mo
ECS Fargate	$250/mo	$1,250/mo	$2,500/mo
Maritime Smart	$10/mo	$50/mo	$100/mo

That's not a marginal improvement. It's a 95%+ reduction in infrastructure costs.

What You Give Up

Sleep/wake isn't free lunch. There are tradeoffs:

Wake latency — 1-2 seconds of cold start when waking from sleep. For most agent use cases (webhooks, scheduled tasks, async processing), this is invisible. For real-time chat, it's noticeable on the first message.
Memory limits — Checkpoint size is bounded. If your agent holds gigabytes of in-memory state, you'll need the Always-On tier.
Concurrent requests — While waking, incoming requests are queued. High-concurrency agents should use the Extended or Always-On tiers.

Quick Wins to Reduce Costs Today

Even without Maritime, you can reduce your agent infrastructure costs:

1. Right-size Your Instances

Most agents don't need 4 vCPUs and 16GB of RAM. Profile your agent's actual resource usage and downsize. A t3.small ($15/mo) handles many agent workloads just fine.

2. Use Spot Instances for Non-Critical Agents

If your agent can tolerate occasional restarts, spot instances cut compute costs by 60-90%. This works well for batch processing agents and non-customer-facing workflows.

3. Implement Request Batching

Instead of processing events one at a time, batch them. If your agent processes webhooks, queue incoming events and process them in batches every few minutes. Fewer wake cycles, lower costs.

4. Cache LLM Responses

Many agent queries are repetitive. A simple cache layer in front of your LLM calls can reduce API costs by 30-50% for agents that handle common questions.

5. Set Token Budgets

Without hard limits, a single runaway agent execution can burn through your monthly LLM budget in hours. Set per-request and per-day token budgets.

The Bottom Line

Infrastructure costs dominate AI agent budgets, not LLM costs. The fix is architectural — stop paying for idle compute. Whether you use Maritime's sleep/wake model or optimize your existing setup, the goal is the same: pay for work, not for waiting.