Skip to content

Autonomous Agent Work Planning

AI agents can execute tasks. That’s table stakes. The harder problem is getting them to pick the right task, decompose it correctly, and not step on each other — all without constant human supervision.

Three agents share a Jira backlog. Each runs independently: different principals, different schedules, different capabilities. Some run on the cluster as always-on pods. One runs on a desktop as ephemeral CLI sessions. They all need to pull work from the same pool, and the system has to handle:

  • Task selection — which task should this agent work on right now?
  • Decomposition — a feature request isn’t a single atomic action. How do you break it into phases?
  • Concurrency — two agents can’t claim the same task. Isolated cron sessions can’t see each other’s in-flight work.
  • Quality — agents historically skip acceptance criteria and use wrong methodologies when task descriptions are ambiguous.
  • Household safety — some tasks touch family schedules, budgets, or children. Those need human approval before an agent acts.

The initial approach — freeform Jira descriptions and ad-hoc coordination — failed predictably. Agents would start work from the summary alone, miss structured requirements buried in description prose, produce deliverables that didn’t match expectations, and occasionally duplicate effort because no locking mechanism existed.

Before building anything, a deep-dive research spike surveyed existing agent task management approaches:

AutoGPT demonstrated the failure mode clearly: freeform task decomposition with no constraints leads to infinite loops and goal drift. The agent generates sub-goals, decomposes those into sub-sub-goals, and spirals without converging.

CrewAI offered a better model — YAML-defined task specs with explicit expected outputs, tools, and agent assignments. But it’s a framework, not a methodology. The YAML structure was useful inspiration; the runtime wasn’t applicable to our setup.

MetaGPT showed that coarse role-based handoffs (architect → engineer → tester) outperform fine-grained micro-step decomposition. Fewer, larger phases with clear boundaries beat many small tasks that generate coordination overhead.

MIT Sloan research on human-AI collaboration revealed an asymmetry: AI + human teams outperform on creative tasks but underperform on decision tasks compared to humans alone. The implication: let agents run implementation autonomously, involve principals for design decisions.

The conclusion: no turnkey solution exists. But the patterns converge — structured task specs, role-based phase decomposition, explicit acceptance criteria, and human checkpoints at decision boundaries.

The Architecture: A Four-Layer Skill Stack

Section titled “The Architecture: A Four-Layer Skill Stack”

Rather than building a monolithic planner, the system decomposes into four layers, each a standalone skill that any agent loads:

Autonomous Work Planning — Skill Stack

What to work on, when, and why. Three modes of operation:

  • Weekly Review — a 9-step process (Snapshot → Review → Capacity → Prioritize → Dependencies → Propose → Solicit → Incorporate → Approve) that produces a Confluence sprint plan. The Solicit step uses NATS to request feedback from all agents, with a 65-minute response window. Plans are proposals — a principal must approve before tasks transition to To Do.
  • Daily Triage — a fully autonomous scan that classifies new intake, flags stale work, applies labels (needs-refinement, blocked, needs-principal-review), and fast-tracks urgent bugs. Triage routes work but never starts it — the In Progress transition belongs to the task authoring layer.
  • On-Demand — when an agent asks “what’s next?”, the skill sorts available work by priority, dependency chains, and capability match, then recommends the top 3 candidates.

Capacity planning uses story points in kilotokens (1 story point = 1,000 tokens). Each method has a default estimate — 16 points for configuration work, 40 for deep research, 42 for incident response — stored as Jira’s native Story Points field. This makes token cost visible at the board level before work starts, enabling planning against a budget rather than discovering spend after the fact.

How to decompose. Six standard workflow types — feature, bug, spike, skill, config, docs — each with an ordered pipeline of phases, skip conditions, and phase gates between every step.

A feature follows: Explore → Plan → Build → Test → Verify → Deploy. A bug follows: Diagnose → Fix → Verify. A spike follows: Research → Document. Each phase maps to a specific method skill.

Phase gates enforce three outcomes: Continue (proceed to next phase), Stop (close the task, note why), or Revise (re-enter the current phase with adjustments). This prevents the AutoGPT failure mode — there’s always a structured checkpoint, not open-ended recursion.

The collapse heuristic prevents over-decomposition: if a task will take less than 30 minutes and touch fewer than 3 files, skip the sub-task structure entirely. Do it as a single task. Decomposition is only valuable when it provides clarity; for trivial work, it’s pure overhead.

How to write each task. This is where earlier attempts failed. The standard defines four custom Jira fields that every agent-executable task must have:

FieldPurposeWhy It Matters
MethodWhich skill or approach to useAgents loaded the wrong methodology when guessing
DeliverableSpecific output expected”Done” was ambiguous without explicit artifacts
Acceptance CriteriaTestable pass/fail conditionsAgents declared victory without verifying
Agent InstructionsStep-by-step execution guidanceReduces interpretation variance between agents

These are proper Jira custom fields — searchable via JQL, displayable on board cards, bulk-editable. Not description prose that gets skipped.

The Method field maps to a vocabulary of 14 approaches, each either a loadable skill (like deep-research or systematic-debugging) or a general approach (like implementation or documentation). When an agent picks up a task, the method tells it which skill to load before starting work.

How to execute each phase. The actual methodology skills — deep-research, brainstorming, writing-plans, systematic-debugging, verification, TDD, and others. Each is a standalone skill with its own SKILL.md that an agent loads when the task’s method field points to it.

The most interesting engineering decision: using Jira status transitions as a concurrency control mechanism.

Three agents share a backlog. Two of them (Fiducian and Alec) can pick up work autonomously via cron-triggered sessions. These isolated sessions don’t share memory — a cron job spawns a fresh agent instance that can’t see what other sessions are doing.

The solution uses Jira itself as the coordination layer:

  1. Before picking up work, query: “Do I have anything In Progress already?”
  2. If no, query available work sorted by priority
  3. Transition the chosen task to “In Progress” — this is the claim
  4. Swap the label from work:available to work:assigned

The status transition is the lock acquisition. It’s not perfect — there’s a race window between the JQL check and the transition — but board WIP limits (In Progress: 6, In Review: 3) serve as the hard cap. Even if two sessions claim simultaneously, the system won’t let more than 6 tasks be in-flight.

This is a pragmatic choice. A proper distributed lock (Redis, etcd) would be more robust but requires infrastructure that doesn’t justify itself for a 3-agent team. Jira is already the system of record. Using it as the lock means zero additional infrastructure and the concurrency semantics are visible to humans reviewing the board.

Not all tasks are safe for autonomous execution. A task that changes the family budget, modifies a dependent’s schedule, or makes health decisions requires human approval — even if the agent has the technical capability to execute it.

The household domain guard is a content-based trigger, not an assignment filter. It doesn’t matter who created the task or which agent picks it up. If the task touches budget, family schedule, children, health, or family decisions, the agent must:

  1. Skip the task during autonomous pickup
  2. Leave a Jira comment: “Household domain — needs principal approval”
  3. Optionally notify the relevant principal’s agent

This guard exists because fiduciary duty isn’t just about technical competence — it’s about respecting boundaries. An agent that autonomously rearranges the family calendar because it seemed “optimal” is violating trust, even if the calendar change is objectively better.

Attention Control: How Agents Learn from Mistakes

Section titled “Attention Control: How Agents Learn from Mistakes”

The most operationally impactful innovation isn’t in the planning system — it’s in how the system encodes lessons learned.

Traditional documentation says “don’t do X.” Agents read it, process it, and sometimes still do X because the instruction didn’t carry enough weight in the context window. The solution: attention-control tags — XML-style markers that format controls attention more effectively than prose.

Four tag types, each with distinct semantics:

  • <HARD-GATE> — Must execute before proceeding. Blocks the workflow if skipped. Used for the Definition of Done checklist and the pre-work verification steps.
  • <NEVER> — Absolute prohibition. No exceptions, no context where this becomes acceptable. Used for things like “never modify third-party skills.”
  • <TRAP> — Common mistake that looks correct but isn’t. Includes the specific failure mode and what to do instead. Used for API gotchas like “MCP Gateway uses arguments not input.”
  • <CONDITIONAL> — Rule that applies only in specific contexts, with the context explicitly stated.

These tags emerged from a postmortem analysis. The team found that agents violated documented rules not because they didn’t read them, but because format controls attention more than content. A rule buried in a paragraph of prose gets the same cognitive weight as the surrounding text. A rule in an XML tag with a distinct name gets flagged as structurally different — it demands execution rather than acknowledgment.

The feedback loop: incidents produce postmortems, postmortems produce tags, tags get injected into the relevant skills, and future agent sessions encounter those tags as structural constraints rather than advisory prose.

Structured fields beat description prose. When acceptance criteria lived in Jira descriptions, agents skipped them. When they moved to dedicated custom fields with JQL searchability and board card display, compliance became near-automatic. The information didn’t change — the structure did.

Coarse decomposition beats micro-steps. Six workflow types with 2-6 phases each is the right granularity. Finer decomposition (10+ sub-tasks per feature) generates coordination overhead that exceeds the clarity benefit. The collapse heuristic — skip decomposition for tasks under 30 minutes — prevents the most common over-engineering.

Concurrency control doesn’t need dedicated infrastructure. Jira status transitions plus WIP limits provide sufficient coordination for a small agent team. The race window is real but the blast radius is small (worst case: two agents work the same task, one notices during the verify phase). A Redis lock would close the window but isn’t worth the operational complexity.

Household domain boundaries are a trust feature, not a technical one. The guard doesn’t prevent agents from accessing household data — they need it for context. It prevents them from acting on household decisions without approval. This distinction matters: a helpful agent that autonomously reorganizes the family schedule is technically competent but socially destructive.

Format controls attention more than content. This is the meta-lesson. Agents process everything in their context window, but structural markers (XML tags, custom fields, separate sections) get more weight than inline prose. If a rule matters, give it structural prominence — not just a sentence in a paragraph.

The system is operational and has matured significantly since its initial deployment. Three agents share a FAD (Fiduciary Agent Development) Kanban board with label-based swimlanes, WIP limits, and the full four-layer skill stack loaded (23 federated skills total). Tasks are authored with custom fields, work is decomposed using standard workflows, and the Definition of Done is mechanically enforced — alongside a broader shift from memory-based rules to mechanical gates at every stage of the work lifecycle.

Autonomous pickup shifted from cron-based polling to webhook-driven events. A dedicated Go relay service (FAD-106/107/108) receives Jira webhooks, normalizes the payload, and publishes structured events to NATS. OpenClaw agents subscribe to their topic and receive a typed payload — issue key, transition, labels, priority — ready to evaluate without parsing raw webhook JSON.

This decoupling matters: the relay service handles Jira’s webhook retry semantics, payload normalization, and fan-out to multiple subscribers. Agents simply respond to clean, well-formed events. The race window between the status check and the claim transition still exists, but the webhook delivery model means agents respond immediately to real transitions rather than polling on a fixed interval.

The DoD moved from advisory tags to a mechanically enforced gate. The dod-verify Lobster pipeline runs 7 checks before any ticket can transition to Done: Deliverable (non-empty), Acceptance Criteria (non-empty), Method (set), Jira comment linking to the deliverable, memory/learnings updated, Graphiti episode written, and git-clean verification (working tree must have no uncommitted changes). The jira-close pipeline wraps dod-verify as a prerequisite — the Done transition is blocked if any check fails.

A PreToolUse hook in Claude Code provides additional enforcement: check-dod-jira.sh intercepts jira_transition_issue calls with transition_id=41 (Done) and blocks them unless DoD markers exist. This means the mechanical gate operates at the tool level, not just the pipeline level — even ad-hoc ticket closures outside of Lobster are caught.

The most significant operational shift in this cycle: structured multi-step workflows moved from ad-hoc heartbeat triggers to Lobster pipelines — a local-first workflow runtime with typed JSON envelopes, explicit step sequencing, and resumable approvals.

Over 20 named pipelines now cover the full work planning lifecycle. A representative subset:

PipelineTriggerPurpose
ticket-pickupWebhook / manualEvaluate task, check WIP, claim if eligible
work-startPost-claimValidate fields + SP, load skills, commit context
dod-verifyPre-Done transition7-check compliance scan before completing
jira-closePre-Done transitionDoD-gated ticket closure with dod-verify prerequisite
jira-createTicket creationTwo-layer dedup gate + field validation + SP check
work-decomposeParent ticketAuto-create sub-task tree from workflow template
sprint-planningWeeklyFull 9-step planning review with capacity
solicitationDuring planningRequest NATS feedback from all agents
self-learningPeriodicReview and promote learnings to long-term memory
triageDailyClassify new intake, flag stale work
memory-maintenancePeriodicConsolidate daily notes into curated memory
bedtime-nudgeCron (3× nightly)Check HA presence + focus mode → select escalation level → send Signal nudge

The key benefit over heartbeat-driven triggers: pipelines are resumable. If context is lost mid-workflow — a token limit, a session restart — the next agent turn picks up where the pipeline halted rather than restarting from scratch. This replaces the manual breadcrumb pattern for multi-step operations.

Heartbeats remain for lightweight real-time checks (inbox scan, urgent alerts, mention detection). Structured multi-step work runs as pipelines.

Commit-first — never deploy uncommitted code — was originally a memory-based rule. Five violations across different agents and sessions (a 20% failure rate) proved that documentation alone doesn’t change behavior. The pattern was always the same: agent makes changes, deploys to validate, moves to the next task, forgets to commit.

A spike evaluated enforcement mechanisms against the current tooling surface and recommended a three-layer model:

  1. Deploy-time (PreToolUse hook)check-commit-first-deploy.sh intercepts Bash calls containing deploy patterns (kubectl apply|replace|patch|edit, devspace deploy, helm upgrade|install) and resources_create_or_update MCP calls. If the working tree is dirty, the deploy is blocked with an actionable message. A session-scoped bypass (touch /tmp/commit-first-bypass) allows rapid test-deploy loops during debugging — the bypass is Layer 1 only.
  2. Done-time (DoD pipeline) — the 7th DoD check (dod-verify-gitclean.sh) hard-gates on a clean working tree. The bypass is NOT honored — you can iterate dirty, but you can’t close the ticket dirty.
  3. Session-end (session-close pipeline) — catches non-ticket dirty state. The bypass is NOT honored.

This layered design acknowledges a pragmatic reality: agents legitimately need to deploy uncommitted code during debugging. The enforcement doesn’t prevent iteration — it prevents forgetting to commit after iteration.

The bedtime-nudge pipeline (FAD-485) replaced 3 separate cron jobs with a single structured workflow. The previous setup had three independent cron triggers — a gentle nudge, a firmer nudge, and a direct one — each independently hitting Home Assistant to check Spencer’s presence and focus mode. Logic was duplicated, each job had its own failure mode, and escalation behavior was hard to reason about.

The pipeline consolidates the logic into 4 steps: validate (check the escalation level is 1/2/3), check-ha (query person.spencer presence and binary_sensor.spencer_s_iphone_focus), send (deliver the Signal message at the appropriate level if checks pass), and log (record the outcome).

The check-ha step uses fail-open semantics: if Home Assistant is unreachable, the nudge sends anyway. A missed nudge because HA was briefly down is worse than a spurious one. If Spencer is not home or has focus mode active, the step exits with action: skip and the send step is a no-op.

Three cron jobs now run the same pipeline with level=1, level=2, and level=3 respectively. The pipeline’s dry-run flag allows testing HA connectivity without sending messages.

The sprint-planning pipeline runs the full 9-step weekly review as a sequenced workflow. Each step has explicit outcomes; the pipeline halts if a step produces an unexpected result rather than silently continuing with partial data. Capacity planning using story points (1 SP = 1k tokens) provides budget visibility before sprint approval. The pipeline outputs a structured Confluence page draft that the principal reviews and approves before any tasks transition to To Do.

The solicitation pipeline, integrated into sprint-planning, sends typed NATS requests to all registered agents, collects structured responses within a 65-minute window, and formats contributions as Confluence comments on the sprint plan page. NATS handles the real-time coordination; Confluence preserves the permanent, auditable record of who proposed what and why — visible to principals reviewing the plan without needing to trace NATS message history.

Operational Safeguards in the Creation and Pickup Flows

Section titled “Operational Safeguards in the Creation and Pickup Flows”

Four safeguards prevent the most common autonomous planning failures:

Search-Before-Create Gate. Before any Jira ticket creation, agents run a duplicate detection script. This is a hard gate — not advisory, not optional. The script searches for existing issues with similar summaries and returns exit 0 (safe to create), exit 1 (potential duplicates found, review required), or exit 2 (search failed, do not create). The gate exists because context-limit restarts cause agents to lose memory of what they created seconds ago. One incident produced 14 duplicate tickets in 94 seconds. The dedup logic is shared between the jira-create and work-decompose pipelines — decomposed sub-task trees go through the same gate, preventing a bypass where auto-generated sub-tasks could duplicate existing work.

Story Points Enforcement. Story points (1 SP = 1,000 tokens) are validated at both creation and pickup. The method→SP mapping is deterministic — implementation = 28, deep-research = 40, configuration = 16 — so the pipeline validates the value, not just its presence. A ticket with method: implementation and story_points: 16 is rejected at pickup with a message identifying the mismatch. This closes the gap between “SP is required” (a policy) and “SP is correct” (a mechanical check).

Pre-Work Field Validation. The work-start pipeline validates all 5 custom fields (Method, Deliverable, AC, Agent Instructions, Capabilities Required) via the Jira API before transitioning a ticket to In Progress. This replaced an earlier marker-based system where agents self-attested that fields were checked — a trust gap that produced 3 violations before mechanical enforcement was added. The validation scripts are deployed to both desktop and OpenClaw pod environments.

Work:Available Label Enforcement. The jira-create pipeline forces the work:available label on any ticket created with Backlog or To Do status, overwriting work:assigned if present. This fixed a pickup pool starvation problem: agents defaulted to work:assigned when creating tickets, removing them from the autonomous pickup pool. At its peak, 23 tickets were stuck as work:assigned versus 8 as work:available.

Context-Limit Recovery (Breadcrumb Pattern). Before starting any batch operation — creating multiple tickets, pushing multiple files, running a multi-step workflow — agents write a breadcrumb file recording the operation name, timestamp, planned items, and completed items. If context is lost mid-operation, the next session finds the breadcrumb and resumes from where it left off. The file is deleted on completion. This turns context loss from a “start over and create duplicates” event into a “pick up where I left off” event.

The system isn’t perfect — the postmortem-to-tag feedback loop requires manual curation, and the pickup race window remains theoretically open (though WIP limits cap the blast radius). But it’s operational, it’s improving with each incident, and it solved the original problem: agents can independently select, decompose, and execute work from a shared backlog without constant human coordination. The maturity trajectory is clear: from cron-polled pickup to webhook-driven events to Lobster pipeline automation to mechanical enforcement — where the right behavior isn’t just documented, it’s the only behavior the tooling permits.