Agent Operational Cost Management

The Problem Nobody Talks About

Most agent writeups focus on capabilities — what agents can do. Tool use, reasoning chains, multi-step planning. But when you run AI agents as production infrastructure — always-on, polling for work, coordinating with each other — there’s a question that matters just as much: what does it cost?

This project runs autonomous agents on a home Kubernetes cluster. They operate 24/7 with heartbeat polling, autonomous task pickup, sub-agent spawning, inter-agent coordination, and now headless worker execution. Each of these operations burns scarce resources: API tokens, model rate-limit capacity, scheduler time, and human attention when something fails. Without cost visibility and operational discipline, you’re running a system where the bill — or the bottleneck — is a surprise every month.

The goal: treat AI agents like any other production service. Instrument them. Optimize them. Know what you’re spending, where the rate limits are, and why each model call exists.

Cost Tracking and Weekly Reporting

The first step was visibility. You can’t optimize what you can’t measure.

The platform (OpenClaw) provides session-level metrics — token counts, model usage, session duration, and cost estimates per API call. The challenge was turning raw session data into something actionable.

What Gets Tracked

Token consumption per session type — main conversations, heartbeat polls, cron jobs, sub-agents, isolated sessions
Model usage distribution — which operations use which model tier
Session count by trigger — how many sessions are human-initiated vs. autonomous
Cost per operation category — what percentage of spend goes to each function
Rate-limit headroom — which Foundry deployment pool is doing the work, so load can be spread before RPM/TPM limits become the failure mode

The Weekly Report Pattern

A weekly reporting cron generates a cost summary every Monday, delivered to a private Discord channel. The report breaks down:

Total API spend for the trailing 7 days
Session count by type (cron, heartbeat, main, sub-agent)
Token usage split (input vs. output vs. cache hits)
Top 5 most expensive individual sessions
Trend comparison vs. previous week

The key insight: most of the cost isn’t in human conversations. It’s in the autonomous background operations — the cron jobs, heartbeats, and sub-agents that run whether or not anyone is talking to the agent. This shaped every optimization that followed.

Model Selection Strategy

Running every operation on the most capable model is the default path. It’s also wildly inefficient. Even when the direct dollar price is zero, the system still pays in latency, context overhead, model quota, and rate-limit contention.

From Price Tiers to Capacity Tiers

The early version of this project optimized around provider list prices:

Opus-class models — deep reasoning, complex multi-step tasks, expensive enough to reserve for work that genuinely needs them.
Sonnet-class models — strong general capability, suitable for most autonomous task execution.
Haiku-class models — fast, cheap, and good for routine operations like heartbeats, notification formatting, and simple checks.

That still matters for external providers, but the June 2026 operating model changed the center of gravity: agent traffic now routes through LiteLLM to internal Foundry models that are effectively zero-dollar for this environment. The optimization target is no longer just “minimize dollars.” It is choose the smallest model that preserves quality while spreading load across separate RPM/TPM pools.

Semantic Model Tier Aliases

To avoid hardcoding concrete model IDs into every consumer, the LiteLLM gateway now exposes semantic aliases:

triage — fast, low-burden model lane for routing, lightweight classification, wake payloads, and cheap pipeline steps. Primary: foundry-gpt-5.4-mini; fallback chain includes cheaper/faster alternatives.
standard — default general-purpose lane for normal autonomous work. Primary: foundry-gpt-5.4; fallback includes a Claude Sonnet pool when the GPT lane is unavailable.
heavy — highest-capability lane for complex synthesis and difficult reasoning. Primary: foundry-gpt-5.5; fallback includes Claude Opus/Sonnet pools.

This shipped as the model-tier alias and cross-provider fallback work. The important design choice: consumers ask for intent (triage, standard, heavy), while LiteLLM owns the concrete model routing and fallback chain. When a primary pool is rejected or exhausted, the request degrades to another configured pool instead of failing outright.

Dedicated Agent Keys and Budget Caps

A real incident tightened the design. Costaff went silent because it was sharing another agent’s LiteLLM virtual key, and that shared key hit a zero-dollar budget cap. The user-visible failure looked like a provider/schema error, which was a dangerous red herring; the authoritative evidence was in LiteLLM’s own budget logs.

The fix was operational, not just technical:

Each agent lane has its own virtual key instead of sharing one catch-all key.
Agent keys for Fiducian, Alec, and costaff are uncapped where they route to zero-cost internal Foundry models.
Failures are diagnosed from the gateway and LiteLLM budget tables before assuming an agent/tool schema problem.
Key separation makes spend and failures attributable to the right agent.

That turned cost management from a crude global cap into lane-level accountability.

Operation-to-Model Mapping

After profiling actual agent workloads, the mapping looks like this:

Heartbeat polls → triage / Haiku-class. Reading a short checklist, checking if anything needs attention, replying “HEARTBEAT_OK” most of the time. This is the single biggest cost saver because heartbeats run continuously.
Webhook wake and dispatch checks → triage. The wake path needs fast routing and a small amount of judgment, not a full high-context conversation. This keeps the event-driven path cheap and protects heavier pools.
Autonomous task pickup → standard unless the ticket method demands more. It must read Jira tickets, evaluate blockers, inspect labels, and decide whether to claim work. Too nuanced for the cheapest lane, but not usually heavy-reasoning work.
Routine documentation/configuration work → standard. Most implementation and documentation tickets need reliable comprehension and tool use, not maximal reasoning.
Systematic debugging, deep research, incident response, and architecture synthesis → heavy. These tasks need broader context, hypothesis management, and stronger reasoning.
Sub-agents → model selected by method, not by habit. Researcher/investigator/analyst lanes can still use stronger models when their role requires it, while admin/configuration helpers stay lighter.
Main session conversations → high-quality model. Human-facing interactions are comparatively low volume and quality-sensitive, so this remains the least attractive place to cut cost blindly.

Estimated Savings

Moving heartbeat polls alone from Opus-class models to Haiku/triage-class models cuts their cost by roughly two orders of magnitude under external pricing. Combined with cron and wake-path optimization, the model selection strategy reduces autonomous operation cost by roughly 70-80% compared to an all-heavy-model baseline.

The current configuration pattern is intent-based rather than concrete-model-based:

# Wake/dispatch with a semantic tier alias
payload:
  kind: agentTurn
  model: "litellm/triage"
  message: "Evaluate the Jira webhook event and route if eligible."

# Task execution uses ticket method to choose the model lane
methodModelMap:
  configuration: "foundry-claude-haiku-4-5"
  documentation: "foundry-claude-sonnet-4-6"
  implementation: "foundry-claude-sonnet-4-6"
  deep-research: "foundry-claude-opus-4-7"
  systematic-debugging: "foundry-claude-opus-4-7"

The exact model IDs can change underneath the aliases. The operational principle is stable: route by task complexity and rate-limit pool, not by whatever model happened to be in the first working config.

Cron Delivery Discipline

This one was learned the hard way.

The Signal Leak Incident

During early autonomous operation, a cron job ran overnight and its output was delivered to the wrong channel. Principal B received raw agent session output — unfiltered train-of-thought reasoning — on Signal during quiet hours. Not a security breach in the traditional sense, but a failure of operational discipline.

Root Cause

The cron job was created without an explicit delivery.channel parameter. The platform defaulted to the main session’s channel — which happened to be Signal, pointed at Principal B. The agent’s internal reasoning about task evaluation and Jira queries showed up as Signal messages at 2 AM.

The Fix: Explicit Delivery on Everything

The lesson crystallized into a hard rule: every cron job must have an explicit delivery configuration. Never rely on defaults.

# WRONG — delivery defaults to main session channel
name: "my-cron-job"
schedule:
  kind: cron
  cron: "0 * * * *"
payload:
  kind: agentTurn
  message: "Do the thing"
# No delivery config = surprise messages

# RIGHT — explicit delivery control
name: "my-cron-job"
schedule:
  kind: cron
  cron: "0 * * * *"
payload:
  kind: agentTurn
  message: "Do the thing"
delivery:
  mode: none  # Agent handles its own output

The conservative pattern: set delivery.mode to none and have the cron payload explicitly send messages to the intended channel using the message tool. This makes the routing visible in the cron definition rather than implicit in platform defaults.

Prevention Patterns

Three rules now govern cron job creation:

Explicit delivery channel — every cron job specifies where output goes. No exceptions.
Quiet hours awareness — notifications respect per-principal quiet hours. UTC 00–14 for Signal means cron output goes to Discord during those hours, or simply waits.
Cron job audit — periodic review of all active cron jobs to verify delivery configuration. New cron jobs get a checklist review before activation.

These rules are embedded in the agent’s operational memory as “TRAP” entries — high-priority patterns that get loaded on every session to prevent repeat failures.

Autonomous Pickup Optimization

Autonomous task pickup started as a cron schedule — agents periodically checked Jira for available work, evaluated whether they could handle it, and either picked it up or passed. The optimization challenge: minimize empty polls without missing available work.

The Empty Poll Problem

Early implementation ran pickup every 15 minutes. In practice, new tickets appear a few times per day, not a few times per hour. That means 90%+ of pickup cron runs found nothing to do — but each one still consumed API tokens to query Jira, evaluate the empty result, and report “nothing found.”

Frequency Tuning

The current baseline is hourly, staggered between agents where cron polling is still used. This means:

Available work gets picked up within a predictable window
Token burn from empty polls drops substantially vs. the 15-minute schedule
Staggering prevents both agents from evaluating the same ticket simultaneously

Smarter Pickup Payloads

The pickup payload itself was optimized to reduce token usage on empty results:

payload:
  kind: agentTurn
  message: |
    Search Jira for available FAD tickets:
    - Status: To Do
    - Labels: work:available and your agent label
    - Fields: summary, status, priority, labels, issuelinks

    If no tickets found, reply with just "No tasks available"
    and stop. Do not elaborate.

    If tickets found, evaluate blockers, labels, capabilities,
    and claim the first eligible ticket.

The key optimization: short-circuit on empty results. If there’s nothing to pick up, don’t burn tokens reasoning about it. The instruction to reply minimally on empty results saves output tokens on every empty poll.

WIP Guards and Resume Semantics

A work-in-progress guard prevents agents from accumulating new work when they already have a task in progress. The first version was too blunt: it treated any owned In Progress ticket as a reason to exit, including the very ticket the agent had already picked up. That created a self-blocking failure mode.

The current rule is sharper:

Check for owned In Progress tickets first.
If one exists, resume it instead of scanning for new work.
If multiple exist, resume the highest-priority/lowest-key ticket and report the extras.
When a specific ticket is being processed by a workflow, exclude that ticket from “other WIP” blocking.
“In Review” and “Blocked” do not count as active WIP because they are waiting on review or external input.

This keeps the WIP guard’s purpose — finish what you started — without turning it into a deadlock.

Event-Driven Dispatch and Closeout Recovery

The longer-term direction is event-driven: Jira webhooks and a dispatch driver promote eligible tickets to wake the right agent, while a report-only closeout sweeper identifies stale In Progress work that appears complete or needs human attention. The sweeper is intentionally report-only first; it recommends but does not auto-transition.

This matters for cost because stale WIP is expensive in two ways: it blocks new work, and it causes repeated pickup runs to spend tokens discovering the same blocker. Report-only recovery gives the board a way to explain and clear those situations without unsafe autonomous closure.

Results and Ongoing Work

The combined effect of these optimizations:

Cost visibility — weekly reports make API spend predictable and auditable
Model tiering — 70-80% reduction on autonomous operations vs. all-heavy-model baseline, plus better rate-limit distribution
Dedicated keys — agent-lane failures and spend are attributable instead of hidden behind shared credentials
Delivery discipline — zero intended reliance on default delivery routing; cron jobs explicitly route or suppress output
Poll efficiency — fewer empty pickup runs through frequency tuning, webhook dispatch, and short-circuit patterns

What Shipped Since

Event-driven pickup and dispatch — Jira webhooks and dispatch automation now wake agents when work becomes eligible instead of relying entirely on empty polling. The token savings compound with model tiering: no standard/heavy spend on evaluating empty backlogs.
Semantic model tier aliases — LiteLLM now exposes triage, standard, and heavy aliases with configured fallback chains, so consumers route by intent and gateway policy handles concrete model choice.
Dedicated agent LiteLLM keys — Fiducian, Alec, and costaff operate on separate virtual keys; costaff’s key-sharing/budget-cap incident drove this change.
Per-cron and per-method model selection — scheduled jobs and task execution use explicit model tiers. Headless worker execution reads the Jira Method field and chooses the model mechanically rather than relying on convention.
Story point estimation — the backlog planning skill assigns story points in kilotokens (1 point = 1,000 tokens) to planned tasks. This makes token cost visible at the Jira board level before work starts, enabling capacity planning against a token budget rather than discovering spend after the fact.
Report-only closeout sweeper — stale WIP can be analyzed with deterministic evidence and reason codes before any human or agent decides whether to close, review, or block the ticket.

Lobster Pipeline Cost Efficiency

Lobster pipelines introduced a new dimension to cost management. Pipelines use llm-task steps — structured JSON-in/JSON-out LLM calls that specify their own model tier per step. This means:

Per-step model selection — a DoD verification pipeline can use the cheapest deterministic checks first and reserve stronger models only for compliance judgments that actually need them.
No conversation overhead — llm-task calls don’t carry conversation history. Each step gets only the input it needs, dramatically reducing input token counts compared to running the same logic in a full agent session.
Deterministic token budgets — pipeline steps have predictable input sizes (structured JSON schemas), making token consumption forecastable rather than variable.
Tool-free classification where possible — portfolio audit and dispatch-support steps can run as constrained JSON tasks instead of launching a full agent with all workspace context.

The net effect: operations that previously required a full agent session (with all its loaded context, memory files, and conversation history) now run as lightweight, schema-validated LLM calls at a fraction of the cost and with fewer rate-limit surprises.

What’s Next

OTel-based actual-vs-estimated tracking — comparing story point estimates against actual token consumption per ticket. Currently estimated from method-based averages; real telemetry would enable continuous calibration.
Dynamic model selection — automatically choosing model tier based on task complexity assessment rather than static operation-type mapping, while still honoring explicit method-based floors.
Cost and rate-limit alerting — automated alerts when daily spend or model-pool saturation exceeds thresholds, similar to cloud infrastructure cost alerts.
Virtual-key hygiene — continue moving services away from shared master keys toward scoped per-service virtual keys, so rotation and attribution stay simple.

Why This Matters

Running AI agents in production isn’t just a capabilities problem — it’s an operations problem. The same engineering discipline that applies to any production service applies here: instrument everything, optimize hot paths, fail gracefully, and know what you’re spending.

Most agent frameworks focus on making agents smarter. This work focuses on making them economical and sustainable. Both matter if you want agents that run continuously rather than as expensive demos.