Autonomous Agent Work Planning

The Problem

AI agents can execute tasks. That’s table stakes. The harder problem is getting them to pick the right task, decompose it correctly, and not step on each other — all without constant human supervision.

Three agents share a Jira backlog. Each runs independently: different principals, different schedules, different capabilities. Some run on the cluster as always-on pods. One runs on a desktop as ephemeral CLI sessions. They all need to pull work from the same pool, and the system has to handle:

Task selection — which task should this agent work on right now?
Decomposition — a feature request isn’t a single atomic action. How do you break it into phases?
Concurrency — two agents can’t claim the same task. Isolated cron sessions can’t see each other’s in-flight work.
Quality — agents historically skip acceptance criteria and use wrong methodologies when task descriptions are ambiguous.
Household safety — some tasks touch family schedules, budgets, or children. Those need human approval before an agent acts.

The initial approach — freeform Jira descriptions and ad-hoc coordination — failed predictably. Agents would start work from the summary alone, miss structured requirements buried in description prose, produce deliverables that didn’t match expectations, and occasionally duplicate effort because no locking mechanism existed.

Research: What Exists

Before building anything, a deep-dive research spike surveyed existing agent task management approaches:

AutoGPT demonstrated the failure mode clearly: freeform task decomposition with no constraints leads to infinite loops and goal drift. The agent generates sub-goals, decomposes those into sub-sub-goals, and spirals without converging.

CrewAI offered a better model — YAML-defined task specs with explicit expected outputs, tools, and agent assignments. But it’s a framework, not a methodology. The YAML structure was useful inspiration; the runtime wasn’t applicable to our setup.

MetaGPT showed that coarse role-based handoffs (architect → engineer → tester) outperform fine-grained micro-step decomposition. Fewer, larger phases with clear boundaries beat many small tasks that generate coordination overhead.

MIT Sloan research on human-AI collaboration revealed an asymmetry: AI + human teams outperform on creative tasks but underperform on decision tasks compared to humans alone. The implication: let agents run implementation autonomously, involve principals for design decisions.

The conclusion: no turnkey solution exists. But the patterns converge — structured task specs, role-based phase decomposition, explicit acceptance criteria, and human checkpoints at decision boundaries.

The Architecture: A Four-Layer Skill Stack

Rather than building a monolithic planner, the system decomposes into four layers, each a standalone skill that any agent loads:

Autonomous Work Planning — Skill Stack

Layer 1: Backlog Planning

What to work on, when, and why. Three modes of operation:

Weekly Review — a 9-step process (Snapshot → Review → Capacity → Prioritize → Dependencies → Propose → Solicit → Incorporate → Approve) that produces a Confluence sprint plan. The Solicit step uses NATS to request feedback from all agents, with a 65-minute response window. Plans are proposals — a principal must approve before tasks transition to To Do.
Daily Triage — a fully autonomous scan that classifies new intake, flags stale work, applies labels (needs-refinement, blocked, needs-principal-review), and fast-tracks urgent bugs. Triage routes work but never starts it — the In Progress transition belongs to the task authoring layer.
On-Demand — when an agent asks “what’s next?”, the skill sorts available work by priority, dependency chains, and capability match, then recommends the top 3 candidates.

Capacity planning uses story points in kilotokens (1 story point = 1,000 tokens). Each method has a default estimate — 16 points for configuration work, 40 for deep research, 42 for incident response — stored as Jira’s native Story Points field. This makes token cost visible at the board level before work starts, enabling planning against a budget rather than discovering spend after the fact.

Layer 2: Work-Type Workflows

How to decompose. Six standard workflow types — feature, bug, spike, skill, config, docs — each with an ordered pipeline of phases, skip conditions, and phase gates between every step.

A feature follows: Explore → Plan → Build → Test → Verify → Deploy. A bug follows: Diagnose → Fix → Verify. A spike follows: Research → Document. Each phase maps to a specific method skill.

Phase gates enforce three outcomes: Continue (proceed to next phase), Stop (close the task, note why), or Revise (re-enter the current phase with adjustments). This prevents the AutoGPT failure mode — there’s always a structured checkpoint, not open-ended recursion.

The collapse heuristic prevents over-decomposition: if a task will take less than 30 minutes and touch fewer than 3 files, skip the sub-task structure entirely. Do it as a single task. Decomposition is only valuable when it provides clarity; for trivial work, it’s pure overhead.

Layer 3: Task Authoring Standard

How to write each task. This is where earlier attempts failed. The standard defines four custom Jira fields that every agent-executable task must have:

Field	Purpose	Why It Matters
Method	Which skill or approach to use	Agents loaded the wrong methodology when guessing
Deliverable	Specific output expected	”Done” was ambiguous without explicit artifacts
Acceptance Criteria	Testable pass/fail conditions	Agents declared victory without verifying
Agent Instructions	Step-by-step execution guidance	Reduces interpretation variance between agents

These are proper Jira custom fields — searchable via JQL, displayable on board cards, bulk-editable. Not description prose that gets skipped.

The Method field maps to a vocabulary of 14 approaches, each either a loadable skill (like deep-research or systematic-debugging) or a general approach (like implementation or documentation). When an agent picks up a task, the method tells it which skill to load before starting work.

Layer 4: Method Skills

How to execute each phase. The actual methodology skills — deep-research, brainstorming, writing-plans, systematic-debugging, verification, TDD, and others. Each is a standalone skill with its own SKILL.md that an agent loads when the task’s method field points to it.

Jira as a Distributed Lock

The most interesting engineering decision: using Jira status transitions as a concurrency control mechanism.

Three agents share a backlog. Two of them (Fiducian and Alec) can pick up work autonomously via cron-triggered sessions. These isolated sessions don’t share memory — a cron job spawns a fresh agent instance that can’t see what other sessions are doing.

The solution uses Jira itself as the coordination layer:

Before picking up work, query: “Do I have anything In Progress already?”
If no, query available work sorted by priority
Transition the chosen task to “In Progress” — this is the claim
Swap the label from work:available to work:assigned

The status transition is the lock acquisition. The original version had a known race window between the JQL check and the transition, with board WIP limits (In Progress: 6, In Review: 3) serving as the hard cap. That was acceptable for a three-agent team, but the risk became more important once the system started dispatching work automatically instead of waiting for a human or cron run to notice available tasks.

The claim path has since been hardened with an atomic ticket-claim layer (FAD-997): claim attempts are conditional writes, and webhook-triggered sessions are de-duplicated so two stateless workers cannot both pass the soft check and start the same task. Jira remains the coordination surface visible to humans, but the critical acquisition step is now treated as a real distributed-systems boundary rather than a best-effort label swap.

This is still a pragmatic choice. A dedicated Redis or etcd lock would be more theoretically pure, but it would hide coordination state outside the backlog. Jira is already the system of record; the newer design keeps lock state, dispatch eligibility, status, and human review in the same place while closing the highest-risk race window.

The Household Domain Guard

Not all tasks are safe for autonomous execution. A task that changes the family budget, modifies a dependent’s schedule, or makes health decisions requires human approval — even if the agent has the technical capability to execute it.

The household domain guard is a content-based trigger, not an assignment filter. It doesn’t matter who created the task or which agent picks it up. If the task touches budget, family schedule, children, health, or family decisions, the agent must:

Skip the task during autonomous pickup
Leave a Jira comment: “Household domain — needs principal approval”
Optionally notify the relevant principal’s agent

This guard exists because fiduciary duty isn’t just about technical competence — it’s about respecting boundaries. An agent that autonomously rearranges the family calendar because it seemed “optimal” is violating trust, even if the calendar change is objectively better.

Attention Control: How Agents Learn from Mistakes

The most operationally impactful innovation isn’t in the planning system — it’s in how the system encodes lessons learned.

Traditional documentation says “don’t do X.” Agents read it, process it, and sometimes still do X because the instruction didn’t carry enough weight in the context window. The solution: attention-control tags — XML-style markers that format controls attention more effectively than prose.

Four tag types, each with distinct semantics:

<HARD-GATE> — Must execute before proceeding. Blocks the workflow if skipped. Used for the Definition of Done checklist and the pre-work verification steps.
<NEVER> — Absolute prohibition. No exceptions, no context where this becomes acceptable. Used for things like “never modify third-party skills.”
<TRAP> — Common mistake that looks correct but isn’t. Includes the specific failure mode and what to do instead. Used for API gotchas like “MCP Gateway uses arguments not input.”
<CONDITIONAL> — Rule that applies only in specific contexts, with the context explicitly stated.

These tags emerged from a postmortem analysis. The team found that agents violated documented rules not because they didn’t read them, but because format controls attention more than content. A rule buried in a paragraph of prose gets the same cognitive weight as the surrounding text. A rule in an XML tag with a distinct name gets flagged as structurally different — it demands execution rather than acknowledgment.

The feedback loop: incidents produce postmortems, postmortems produce tags, tags get injected into the relevant skills, and future agent sessions encounter those tags as structural constraints rather than advisory prose.

What We Learned

Structured fields beat description prose. When acceptance criteria lived in Jira descriptions, agents skipped them. When they moved to dedicated custom fields with JQL searchability and board card display, compliance became near-automatic. The information didn’t change — the structure did.

Coarse decomposition beats micro-steps. Six workflow types with 2-6 phases each is the right granularity. Finer decomposition (10+ sub-tasks per feature) generates coordination overhead that exceeds the clarity benefit. The collapse heuristic — skip decomposition for tasks under 30 minutes — prevents the most common over-engineering.

Concurrency control doesn’t need dedicated infrastructure. Jira status transitions plus WIP limits provide sufficient coordination for a small agent team. The race window is real but the blast radius is small (worst case: two agents work the same task, one notices during the verify phase). A Redis lock would close the window but isn’t worth the operational complexity.

Household domain boundaries are a trust feature, not a technical one. The guard doesn’t prevent agents from accessing household data — they need it for context. It prevents them from acting on household decisions without approval. This distinction matters: a helpful agent that autonomously reorganizes the family schedule is technically competent but socially destructive.

Format controls attention more than content. This is the meta-lesson. Agents process everything in their context window, but structural markers (XML tags, custom fields, separate sections) get more weight than inline prose. If a rule matters, give it structural prominence — not just a sentence in a paragraph.

Current State

The system is operational and has matured significantly since its initial deployment. Three agents share a FAD (Fiduciary Agent Development) Kanban board with label-based swimlanes, WIP limits, and the full four-layer skill stack loaded (23 federated skills total). Tasks are authored with custom fields, work is decomposed using standard workflows, and the Definition of Done is mechanically enforced — alongside a broader shift from memory-based rules to mechanical gates at every stage of the work lifecycle.

Event-Driven Task Pickup via NATS Relay

Autonomous pickup shifted from cron-based polling to webhook-driven events. A dedicated Go relay service (FAD-106/107/108) receives Jira webhooks, normalizes the payload, and publishes structured events to NATS. OpenClaw agents subscribe to their topic and receive a typed payload — issue key, transition, labels, priority — ready to evaluate without parsing raw webhook JSON.

This decoupling matters: the relay service handles Jira’s webhook retry semantics, payload normalization, and fan-out to multiple subscribers. Agents simply respond to clean, well-formed events. The webhook delivery model means agents respond immediately to real transitions rather than polling on a fixed interval.

Scheduled Dispatch Driver

The missing piece in the April design was the engine that creates those transitions. The webhook path worked, but only fired when some external actor moved a ticket into To Do. That left routable, fully-specified agent work sitting idle.

FAD-996 added the dispatch driver: a scheduled selector that scans the backlog for eligible work, checks routing labels and WIP guardrails, and promotes the next safe ticket so the webhook wake path fires. This changes autonomous planning from a passive subscriber model into a closed control loop:

Backlog hygiene keeps tickets routable and field-complete.
The dispatch driver selects the next eligible item.
The webhook/NATS relay wakes the right agent.
The agent claims atomically, executes, verifies, and closes.

The important design point: dispatch does not bypass the existing gates. It feeds the same Jira-visible workflow humans already review; it just removes the need for Spencer to manually kick the first domino.

Definition of Done Enforcement

The DoD moved from advisory tags to a mechanically enforced gate. The dod-verify Lobster pipeline runs 7 checks before any ticket can transition to Done: Deliverable (non-empty), Acceptance Criteria (non-empty), Method (set), Jira comment linking to the deliverable, memory/learnings updated, Graphiti episode written, and git-clean verification (working tree must have no uncommitted changes). The jira-close pipeline wraps dod-verify as a prerequisite — the Done transition is blocked if any check fails.

A PreToolUse hook in Claude Code provides additional enforcement: check-dod-jira.sh intercepts jira_transition_issue calls with transition_id=41 (Done) and blocks them unless DoD markers exist. This means the mechanical gate operates at the tool level, not just the pipeline level — even ad-hoc ticket closures outside of Lobster are caught.

Lobster Pipeline Automation

The most significant operational shift in this cycle: structured multi-step workflows moved from ad-hoc heartbeat triggers to Lobster pipelines — a local-first workflow runtime with typed JSON envelopes, explicit step sequencing, and resumable approvals.

Over 20 named pipelines now cover the full work planning lifecycle. A representative subset:

Pipeline	Trigger	Purpose
`ticket-pickup`	Webhook / manual	Evaluate task, check WIP, claim if eligible
`work-start`	Post-claim	Validate fields + SP, load skills, commit context
`dod-verify`	Pre-Done transition	7-check compliance scan before completing
`jira-close`	Pre-Done transition	DoD-gated ticket closure with `dod-verify` prerequisite
`jira-create`	Ticket creation	Two-layer dedup gate + field validation + SP check
`work-decompose`	Parent ticket	Auto-create sub-task tree from workflow template
`sprint-planning`	Weekly	Full 9-step planning review with capacity
`solicitation`	During planning	Request NATS feedback from all agents
`self-learning`	Periodic	Review and promote learnings to long-term memory
`triage`	Daily	Classify new intake, flag stale work
`memory-maintenance`	Periodic	Consolidate daily notes into curated memory
`bedtime-nudge`	Cron (3× nightly)	Check HA presence + focus mode → select escalation level → send Signal nudge

The key benefit over heartbeat-driven triggers: pipelines are resumable. If context is lost mid-workflow — a token limit, a session restart — the next agent turn picks up where the pipeline halted rather than restarting from scratch. This replaces the manual breadcrumb pattern for multi-step operations.

Heartbeats remain for lightweight real-time checks (inbox scan, urgent alerts, mention detection). Structured multi-step work runs as pipelines.

Closeout Sweeper and Stale-WIP Recovery

The next failure mode was not picking up work — it was failing to finish the last inch. Agents would complete the substantive deliverable, post evidence, and then leave the Jira ticket In Progress. The WIP gate correctly refused to start more work, but the blocker was often already complete.

FAD-951 designed a closeout sweeper to detect these orphaned In Progress tickets. The implementation deliberately started report-only (FAD-953/FAD-954): it classifies stale tickets, checks for DoD evidence, and reports whether a ticket appears complete, still active, blocked, needs review, or missing proof. Auto-transition was treated as a later, feature-flagged safety step rather than the default behavior.

That conservative path mattered. The sweeper became diagnostic instrumentation for autonomous planning rather than hidden authority. It tells the system and the humans why pickup is stalled, without silently closing work that still needs judgment.

Headless Execution Worker and First Unattended Completion

With dispatch, atomic claim, and gates in place, the system could finally move from “agents can work autonomously when poked” to “the backlog can drive execution.” FAD-1006 added the headless execution worker: an unattended agent runner that can receive a wake, pick up the ticket, execute the implementation path, run gates, and close the ticket without a human at the keyboard.

FAD-1008 captured the milestone: the first ticket completed end-to-end from dispatch through implementation, verification, deployment, and Done. That is the boundary between a sophisticated task board and an autonomous development team. Spencer still owns direction and review thresholds, but the system can now carry routine, well-specified work through the whole lifecycle.

Commit-First Enforcement

Commit-first — never deploy uncommitted code — was originally a memory-based rule. Five violations across different agents and sessions (a 20% failure rate) proved that documentation alone doesn’t change behavior. The pattern was always the same: agent makes changes, deploys to validate, moves to the next task, forgets to commit.

A spike evaluated enforcement mechanisms against the current tooling surface and recommended a three-layer model:

Deploy-time (PreToolUse hook) — check-commit-first-deploy.sh intercepts Bash calls containing deploy patterns (kubectl apply|replace|patch|edit, devspace deploy, helm upgrade|install) and resources_create_or_update MCP calls. If the working tree is dirty, the deploy is blocked with an actionable message. A session-scoped bypass (touch /tmp/commit-first-bypass) allows rapid test-deploy loops during debugging — the bypass is Layer 1 only.
Done-time (DoD pipeline) — the 7th DoD check (dod-verify-gitclean.sh) hard-gates on a clean working tree. The bypass is NOT honored — you can iterate dirty, but you can’t close the ticket dirty.
Session-end (session-close pipeline) — catches non-ticket dirty state. The bypass is NOT honored.

This layered design acknowledges a pragmatic reality: agents legitimately need to deploy uncommitted code during debugging. The enforcement doesn’t prevent iteration — it prevents forgetting to commit after iteration.

Bedtime Nudge Pipeline (FAD-485)

The bedtime-nudge pipeline (FAD-485) replaced 3 separate cron jobs with a single structured workflow. The previous setup had three independent cron triggers — a gentle nudge, a firmer nudge, and a direct one — each independently hitting Home Assistant to check Spencer’s presence and focus mode. Logic was duplicated, each job had its own failure mode, and escalation behavior was hard to reason about.

The pipeline consolidates the logic into 4 steps: validate (check the escalation level is 1/2/3), check-ha (query person.spencer presence and binary_sensor.spencer_s_iphone_focus), send (deliver the Signal message at the appropriate level if checks pass), and log (record the outcome).

The check-ha step uses fail-open semantics: if Home Assistant is unreachable, the nudge sends anyway. A missed nudge because HA was briefly down is worse than a spurious one. If Spencer is not home or has focus mode active, the step exits with action: skip and the send step is a no-op.

Three cron jobs now run the same pipeline with level=1, level=2, and level=3 respectively. The pipeline’s dry-run flag allows testing HA connectivity without sending messages.

Sprint Planning Pipeline (FAD-452)

The sprint-planning pipeline runs the full 9-step weekly review as a sequenced workflow. Each step has explicit outcomes; the pipeline halts if a step produces an unexpected result rather than silently continuing with partial data. Capacity planning using story points (1 SP = 1k tokens) provides budget visibility before sprint approval. The pipeline outputs a structured Confluence page draft that the principal reviews and approves before any tasks transition to To Do.

Inter-Agent Solicitation (FAD-454)

The solicitation pipeline, integrated into sprint-planning, sends typed NATS requests to all registered agents, collects structured responses within a 65-minute window, and formats contributions as Confluence comments on the sprint plan page. NATS handles the real-time coordination; Confluence preserves the permanent, auditable record of who proposed what and why — visible to principals reviewing the plan without needing to trace NATS message history.

Operational Safeguards in the Creation and Pickup Flows

Four safeguards prevent the most common autonomous planning failures:

Search-Before-Create Gate. Before any Jira ticket creation, agents run a duplicate detection script. This is a hard gate — not advisory, not optional. The script searches for existing issues with similar summaries and returns exit 0 (safe to create), exit 1 (potential duplicates found, review required), or exit 2 (search failed, do not create). The gate exists because context-limit restarts cause agents to lose memory of what they created seconds ago. One incident produced 14 duplicate tickets in 94 seconds. The dedup logic is shared between the jira-create and work-decompose pipelines — decomposed sub-task trees go through the same gate, preventing a bypass where auto-generated sub-tasks could duplicate existing work.

Story Points Enforcement. Story points (1 SP = 1,000 tokens) are validated at both creation and pickup. The method→SP mapping is deterministic — implementation = 28, deep-research = 40, configuration = 16 — so the pipeline validates the value, not just its presence. A ticket with method: implementation and story_points: 16 is rejected at pickup with a message identifying the mismatch. FAD-1036 closed a validation hole where the create pipeline could accept empty Method/Deliverable fields and unmapped story points; those failures now hard-stop at creation instead of silently rotting in Backlog.

Pre-Work Field Validation. The work-start pipeline validates all 5 custom fields (Method, Deliverable, AC, Agent Instructions, Capabilities Required) via the Jira API before transitioning a ticket to In Progress. This replaced an earlier marker-based system where agents self-attested that fields were checked — a trust gap that produced 3 violations before mechanical enforcement was added. The validation scripts are deployed to both desktop and OpenClaw pod environments.

Work:Available Label Enforcement. The jira-create pipeline forces the work:available label on any ticket created with Backlog or To Do status, overwriting work:assigned if present. This fixed a pickup pool starvation problem: agents defaulted to work:assigned when creating tickets, removing them from the autonomous pickup pool. At its peak, 23 tickets were stuck as work:assigned versus 8 as work:available.

Context-Limit Recovery (Breadcrumb Pattern). Before starting any batch operation — creating multiple tickets, pushing multiple files, running a multi-step workflow — agents write a breadcrumb file recording the operation name, timestamp, planned items, and completed items. If context is lost mid-operation, the next session finds the breadcrumb and resumes from where it left off. The file is deleted on completion. This turns context loss from a “start over and create duplicates” event into a “pick up where I left off” event.

The system isn’t perfect — the postmortem-to-tag feedback loop requires manual curation, and the pickup race window remains theoretically open (though WIP limits cap the blast radius). But it’s operational, it’s improving with each incident, and it solved the original problem: agents can independently select, decompose, and execute work from a shared backlog without constant human coordination. The maturity trajectory is clear: from cron-polled pickup to webhook-driven events to Lobster pipeline automation to mechanical enforcement — where the right behavior isn’t just documented, it’s the only behavior the tooling permits.