Skip to content

Agent Operational Cost Management

Most agent writeups focus on capabilities — what agents can do. Tool use, reasoning chains, multi-step planning. But when you run AI agents as production infrastructure — always-on, polling for work, coordinating with each other — there’s a question that matters just as much: what does it cost?

This project runs two autonomous agents on a home Kubernetes cluster. They operate 24/7 with heartbeat polling, autonomous task pickup via cron jobs, sub-agent spawning, and inter-agent coordination. Each of these operations burns API tokens. Without cost visibility and operational discipline, you’re running a system where the bill is a surprise every month.

The goal: treat AI agents like any other production service. Instrument them. Optimize them. Know what you’re spending and why.

The first step was visibility. You can’t optimize what you can’t measure.

The platform (OpenClaw) provides session-level metrics — token counts, model usage, session duration, and cost estimates per API call. The challenge was turning raw session data into something actionable.

  • Token consumption per session type — main conversations, heartbeat polls, cron jobs, sub-agents, isolated sessions
  • Model usage distribution — which operations use which model tier
  • Session count by trigger — how many sessions are human-initiated vs. autonomous
  • Cost per operation category — what percentage of spend goes to each function

A weekly reporting cron generates a cost summary every Monday, delivered to a private Discord channel. The report breaks down:

  • Total API spend for the trailing 7 days
  • Session count by type (cron, heartbeat, main, sub-agent)
  • Token usage split (input vs. output vs. cache hits)
  • Top 5 most expensive individual sessions
  • Trend comparison vs. previous week

The key insight: most of the cost isn’t in human conversations. It’s in the autonomous background operations — the cron jobs, heartbeats, and sub-agents that run whether or not anyone is talking to the agent. This shaped every optimization that followed.

Running every operation on the most capable (and most expensive) model is the default path. It’s also wildly inefficient.

Claude model pricing varies dramatically by tier:

  • Opus 4.6 — $15/M input, $75/M output tokens. Deep reasoning, complex multi-step tasks. Current main session model.
  • Sonnet 4.6 — $3/M input, $15/M output tokens. Strong general capability, 5x cheaper than Opus. Default for sub-agents and research tasks.
  • Haiku 4.5 — $0.25/M input, $1.25/M output tokens. Fast, cheap, good for routine operations. 60x cheaper than Opus. Used for heartbeats and lightweight cron jobs.

The question for each operation type: what’s the minimum viable model?

After profiling actual agent workloads, the mapping looks like this:

  • Heartbeat polls → Haiku. Reading a short checklist, checking if anything needs attention, replying “HEARTBEAT_OK” 90% of the time. Haiku handles this perfectly. This is the single biggest cost saver — heartbeats run every 60 minutes, 24/7.

  • Autonomous task pickup → Sonnet. Needs to read Jira tickets, evaluate complexity, check for blockers, and make a go/no-go decision. Too nuanced for Haiku, doesn’t need Opus-level reasoning. Now webhook-triggered rather than cron-polled — the agent only wakes when there’s actually work to evaluate.

  • Cron job delivery → Haiku or Sonnet depending on content. Simple notifications and formatting are Haiku territory. Anything requiring judgment gets Sonnet. Per-cron model overrides allow fine-grained control — each cron job specifies its own model tier.

  • Sub-agents for research tasks → Sonnet. Research spikes need good comprehension and synthesis but rarely need Opus-level deep reasoning.

  • Sub-agents for complex deliverables → Opus. Writing case studies, designing systems, producing architecture docs — this is where Opus earns its cost.

  • Main session conversations → Opus. When a human is actively talking to the agent, response quality matters most. This is the smallest volume category anyway.

Moving heartbeat polls alone from Opus to Haiku cuts their cost by ~98%. For a system running two agents with 48 heartbeats per day each, that’s significant. Combined with cron job optimization, the model selection strategy reduces autonomous operation costs by roughly 70-80% compared to an all-Opus baseline.

The configuration is straightforward — OpenClaw supports per-session model overrides:

# Cron job with explicit model selection
name: "autonomous-pickup"
schedule:
kind: cron
cron: "30 * * * *"
tz: "America/Chicago"
model: "anthropic/claude-sonnet-4-20250514"
sessionTarget: isolated
payload:
kind: agentTurn
message: "Check for available tasks..."
delivery:
mode: none
# Heartbeat with Haiku for cost efficiency
heartbeat:
model: "anthropic/claude-haiku-4-20250414"
intervalMinutes: 30

This one was learned the hard way.

During early autonomous operation, a cron job ran overnight and its output was delivered to the wrong channel. Principal B received raw agent session output — unfiltered train-of-thought reasoning — on Signal during quiet hours. Not a security breach in the traditional sense, but a failure of operational discipline.

The cron job was created without an explicit delivery.channel parameter. The platform defaulted to the main session’s channel — which happened to be Signal, pointed at Principal B. The agent’s internal reasoning about task evaluation and Jira queries showed up as Signal messages at 2 AM.

The lesson crystallized into a hard rule: every cron job must have an explicit delivery configuration. Never rely on defaults.

# WRONG — delivery defaults to main session channel
name: "my-cron-job"
schedule:
kind: cron
cron: "0 * * * *"
payload:
kind: agentTurn
message: "Do the thing"
# No delivery config = surprise messages
# RIGHT — explicit delivery control
name: "my-cron-job"
schedule:
kind: cron
cron: "0 * * * *"
payload:
kind: agentTurn
message: "Do the thing"
delivery:
mode: none # Agent handles its own output

The conservative pattern: set delivery.mode to none and have the cron payload explicitly send messages to the intended channel using the message tool. This makes the routing visible in the cron definition rather than implicit in platform defaults.

Three rules now govern cron job creation:

  1. Explicit delivery channel — every cron job specifies where output goes. No exceptions.
  2. Quiet hours awareness — notifications respect per-principal quiet hours. UTC 00–14 for Signal means cron output goes to Discord during those hours, or simply waits.
  3. Cron job audit — periodic review of all active cron jobs to verify delivery configuration. New cron jobs get a checklist review before activation.

These rules are embedded in the agent’s operational memory as “TRAP” entries — high-priority patterns that get loaded on every session to prevent repeat failures.

Autonomous task pickup runs on a cron schedule — agents periodically check Jira for available work, evaluate whether they can handle it, and either pick it up or pass. The optimization challenge: minimize empty polls without missing available work.

Early implementation ran pickup every 15 minutes. In practice, new tickets appear a few times per day, not a few times per hour. That means 90%+ of pickup cron runs found nothing to do — but each one still consumed API tokens to query Jira, evaluate the empty result, and report “nothing found.”

The current configuration runs pickup hourly, staggered between agents (Agent A at :15, Agent B at :30). This means:

  • Available work gets picked up within 30-45 minutes on average
  • Token burn from empty polls drops by 75% vs. the 15-minute schedule
  • Staggering prevents both agents from evaluating the same ticket simultaneously

The pickup cron payload itself was optimized to reduce token usage on empty results:

payload:
kind: agentTurn
message: |
Search Jira for available FAD tickets:
- Status: To Do
- Labels: work:unassigned OR (work:assigned AND your agent label)
- Fields: summary, status, priority, labels, issuelinks
If no tickets found, reply with just "No tasks available"
and stop. Do not elaborate.
If tickets found, evaluate complexity, check for blockers
via issuelinks, and pick up if appropriate.

The key optimization: short-circuit on empty results. If there’s nothing to pick up, don’t burn tokens reasoning about it. The instruction to reply minimally on empty results saves output tokens on every empty poll.

A work-in-progress guard prevents agents from picking up new work when they already have tasks in progress:

  • Before evaluating available tickets, check for any tickets in “In Progress” with the agent’s label
  • If WIP exists, skip pickup entirely — don’t even query for new work
  • This prevents ticket accumulation and ensures agents finish what they started

The WIP guard also surfaced an operational issue: tickets that got stuck in “In Progress” because a sub-agent completed the work but didn’t transition the ticket. The fix was including Jira transitions in sub-agent task prompts, so the agent that does the work also closes the ticket.

The combined effect of these optimizations:

  • Cost visibility — weekly reports make API spend predictable and auditable
  • Model tiering — 70-80% cost reduction on autonomous operations vs. all-Opus baseline
  • Delivery discipline — zero unintended message deliveries since implementing explicit routing
  • Poll efficiency — 75% reduction in empty pickup polls through frequency tuning and short-circuit patterns
  • Event-driven pickup — Jira webhooks now trigger task pickup through the MCP Gateway. When a ticket transitions to a work-ready state, the webhook fires directly to the agent. This eliminated empty polls entirely — agents only wake when there’s work to evaluate. The token savings compound with model tiering: no Sonnet spend on evaluating empty backlogs.
  • Per-cron model overrides — each cron job now specifies its model tier directly in the schedule definition. This replaced the earlier pattern of relying on session-level defaults, which made it hard to audit which operations ran on which model.
  • Story point estimation — the backlog planning skill assigns story points in kilotokens (1 point = 1,000 tokens) to planned tasks. This makes token cost visible at the Jira board level before work starts, enabling capacity planning against a token budget rather than discovering spend after the fact.

Lobster pipelines introduced a new dimension to cost management. Pipelines use llm-task steps — structured JSON-in/JSON-out LLM calls that specify their own model tier per step. This means:

  • Per-step model selection — a DoD verification pipeline can use Haiku for field-presence checks and Sonnet only for the final compliance judgment. Each step pays for exactly the model tier it needs.
  • No conversation overheadllm-task calls don’t carry conversation history. Each step gets only the input it needs, dramatically reducing input token counts compared to running the same logic in a full agent session.
  • Deterministic token budgets — pipeline steps have predictable input sizes (structured JSON schemas), making token consumption forecastable rather than variable.

The net effect: operations that previously required a full agent session (with all its loaded context, memory files, and conversation history) now run as lightweight, schema-validated LLM calls at a fraction of the cost.

  • OTel-based actual-vs-estimated tracking — comparing story point estimates against actual token consumption per ticket. Currently estimated from method-based averages; real telemetry would enable continuous calibration.
  • Dynamic model selection — automatically choosing model tier based on task complexity assessment rather than static operation-type mapping.
  • Cost alerting — automated alerts when daily spend exceeds thresholds, similar to cloud infrastructure cost alerts.

Running AI agents in production isn’t just a capabilities problem — it’s an operations problem. The same engineering discipline that applies to any production service applies here: instrument everything, optimize hot paths, fail gracefully, and know what you’re spending.

Most agent frameworks focus on making agents smarter. This work focuses on making them economical. Both matter if you want agents that run sustainably rather than as expensive demos.