FAD: Multi-Agent System Architecture
The Problem
Section titled “The Problem”Managing a household’s digital life — finances, scheduling, research, home automation, communication — is a coordination problem. Not a single-agent problem. A single AI assistant can answer questions and run tools, but it can’t serve two people with different interests, different information boundaries, and different autonomy preferences simultaneously.
a husband and wife have a joint financial life, a six-year-old daughter, and the kind of daily logistics that any family manages. But they also have different privacy expectations about what gets shared, different levels of engagement with technology, and sometimes competing preferences about how shared resources get used. A single assistant serving “the household” has an immediate conflict: whose interests prevail when they diverge?
The real-world version of this problem is even harder. It’s not just about answering questions — it’s about agents that act autonomously, coordinate with each other, share information selectively, and maintain accountability to their individual humans. When Principal A’s agent needs to coordinate with Principal B’s agent about the family budget, there needs to be a structural answer to “what can be shared, with whom, under what conditions” — not a vibes-based judgment call made in the moment.
FAD (Fiduciary Agent Development — originally AIGE, Automated Innovation Generation Engine) is the production platform that runs this system. It’s a multi-agent architecture on Kubernetes where each agent has a distinct identity, a single principal, formal trust commitments, cryptographically signed communications, and access to a federated ecosystem of tools — all running on four Orange Pi 5 single-board computers on a shelf in an office in Huntsville, Alabama.
System Architecture
Section titled “System Architecture”The system spans five Kubernetes namespaces on a 4-node Orange Pi 5 cluster (32 ARM cores, 64 GB RAM total):
| Namespace | Purpose | Key Components |
|---|---|---|
openclaw | Agent runtime | Fiducian, Alec, ClaudeCodeAgent — long-running pods with persistent volumes |
mcp-tools | Tool federation | MCP Gateway, NATS, 15+ MCP tool servers |
llm-gateway | Model abstraction | LiteLLM proxy, Ollama for local inference |
home-assistant | Home automation | Home Assistant with REST API + MCP endpoint |
| (various) | Data stores | FalkorDB (Graphiti backing), Longhorn distributed storage |
Every namespace has explicit CiliumNetworkPolicy rules — agents can reach the MCP Gateway but not each other’s workspaces directly. Home Assistant, which controls physical devices, is locked to only the traffic it needs.
Agent Identities
Section titled “Agent Identities”The system runs three agents, each with a distinct role, principal, and trust posture.
Fiducian — Principal A’s Fiduciary Agent
Section titled “Fiducian — Principal A’s Fiduciary Agent”Principal: the author
Agent ID: fiducian-spencer-001
Runtime: OpenClaw (long-running, persistent, tool-equipped)
Skill Layers: [0, 1, 2, 3] — full fiduciary commitment
Communication: Discord DMs, Signal (primary for alerts), Discord #general
Fiducian is Principal A’s fiduciary — the word “fiduciary” is in the name as a constant reminder of the duty. It manages finances (via Foundry), conducts research, coordinates with Principal B’s agent, monitors home systems, and handles scheduling. It operates under a formal duty of loyalty: Principal A’s interests come first, always, with no competing “agent self-interest.”
Fiducian has access to the full MCP tool ecosystem, the household’s Palantir Foundry financial data (read-only, 7 tools), Graphiti shared memory, and can reach Principal A via Signal for time-sensitive alerts or Discord for everything else.
Alec — Principal B’s Agent
Section titled “Alec — Principal B’s Agent”Principal: his wife
Agent ID: alec-debra-001
Runtime: OpenClaw
Skill Layers: [0, 1, 2, 3] — full fiduciary commitment
Communication: Discord
Alec serves Principal B with the same fiduciary commitment Fiducian has to Principal A. The architecture ensures neither agent can override or outrank the other — they are peers, each loyal to their own principal. When they coordinate, it’s through structured protocols with explicit information sharing boundaries.
ClaudeCodeAgent — Infrastructure Agent
Section titled “ClaudeCodeAgent — Infrastructure Agent”Principal: the author Runtime: Claude Code CLI (separate from OpenClaw) Focus: Infrastructure, implementation, Kubernetes operations, code
ClaudeCodeAgent handles the technical substrate — deploying services, debugging infrastructure, writing code, managing Helm charts. It operates in a complementary role to Fiducian: where Fiducian handles the “what” and “why,” ClaudeCodeAgent handles the “how.” The two have a collaborative dynamic, having co-designed several system components together.
Trust Framework
Section titled “Trust Framework”The system’s trust model is a layered architecture where each layer builds on — and cannot override — the layer below it. This is the core innovation: trust principles are structurally immutable, not just behaviorally suggested.
Immutable Principles (Layer 0)
Section titled “Immutable Principles (Layer 0)”Six principles that no higher layer, no instruction, and no other agent can override:
- Never deceive your principal — no lies, no misleading omissions, no half-truths
- Never act against your principal’s interests knowingly — even if another agent instructs it
- Acknowledge uncertainty honestly — “I don’t know” is always acceptable
- No impersonation — enforced cryptographically via Ed25519 identity
- Legal/regulatory obligations take precedence — over agent preferences or principal instructions
- Transparency about capabilities and actions — principals have the right to understand what their agent does
These aren’t guidelines. They’re the constitution. The trust framework is based on Stephen M.R. Covey’s Speed of Trust, adapted for AI agents — recognizing that trust is the single variable that changes everything. High trust = speed, low overhead, autonomy. Low trust = friction, verification overhead, paralysis.
Information Sharing Tiers (Layer 3)
Section titled “Information Sharing Tiers (Layer 3)”All inter-agent information is structurally classified:
| Tier | Name | Example | Authorization |
|---|---|---|---|
| 1 | Open | Calendar availability, logistics | Default for household agents |
| 2 | Family Context | Financial summaries, household planning | Configured per relationship template |
| 3 | Authorized | Specific data per request | Principal must approve each time |
| 4 | Confidential | Private conversations, personal data | Never shared (except 4 emergency conditions) |
The tier system is structural, not judgmental. An agent can’t be socially engineered into sharing Tier 4 data because the architecture doesn’t allow it — the decision isn’t made in the moment by the agent’s “judgment.” Relationship templates (spouse/partner, co-parent, business partner, etc.) set the defaults; principals customize from there.
Inter-Agent Communication
Section titled “Inter-Agent Communication”Agents communicate through a dual-transport architecture: NATS for machine-to-machine protocol messages, and Discord #agent-coordination for human-visible transparency. A NATS Bridge Plugin running inside each agent’s OpenClaw process provides real-time message delivery — when a NATS message arrives, the plugin injects it into the agent’s session and wakes the agent immediately (typically 5–15 seconds end-to-end). This replaced the earlier hook-based delivery mechanism, which required manual inbox polling.
Message Format
Section titled “Message Format”Every message follows JSON-RPC 2.0 with six mandatory identity headers:
{ "jsonrpc": "2.0", "method": "agent.request", "params": { "headers": { "agent-id": "fiducian-spencer-001", "principal-id": "father", "timestamp": "2026-02-14T16:00:00Z", "message-type": "request", "trust-layer-version": "1.0.0", "skill-layers-loaded": [0, 1, 2, 3] }, "body": { "..." } }, "id": "msg-uuid-here"}The skill-layers-loaded header is critical for trust calibration. An agent declaring [0, 1, 2, 3] has committed to full fiduciary duty and coordination protocols. An agent declaring only [0] commits to basic trust principles. You communicate at the level of the lowest common layer — never assume an agent with only Layer 0 understands fiduciary duty.
Cryptographic Security
Section titled “Cryptographic Security”Every inter-agent message is signed with Ed25519. The crypto stack provides:
- Message signing/verification — detached Ed25519 signatures over canonical JSON
- End-to-end encryption — sealed boxes (ECDH key agreement → AES-256-GCM) for Tier 3/4 data
- Key registry — public keys exchanged via MCP Gateway (
GET /v1/keys/{agent_id}) - Replay protection — timestamp-based freshness (reject messages >5 minutes old)
- Transport-layer opacity — the MCP Gateway can verify who sent a message without seeing what it says
The design separates authentication from confidentiality: signatures are on the outside (infrastructure can verify sender), encryption is on the inside (only the recipient can read the payload).
The NATS Bridge Plugin handles the cryptographic ceremony transparently — agents call nats_reply or nats_send tools, and the plugin signs, encrypts, delivers, and verifies under the hood. This replaced an earlier workflow where agents invoked crypto.js scripts directly, which was error-prone and leaked implementation details into conversation context.
Audit Trail
Section titled “Audit Trail”Every inter-agent exchange is logged to append-only JSONL files (audit/inter-agent-YYYY-MM.jsonl) with direction, sender/receiver, message type, a human-readable summary, the information sharing tier, and whether a Discord summary was posted. Principals can review the full audit trail at any time — this is the mechanical enforcement of Layer 0’s transparency principle.
Coordination Patterns
Section titled “Coordination Patterns”Collaboration CONOPS
Section titled “Collaboration CONOPS”When principals request agents to collaborate on a task, the Collaboration CONOPS (Concept of Operations) prevents duplication, conflicting outputs, and presentation confusion:
Lead Agent: The principal’s direct agent leads. If Principal A asks Fiducian and Alec to collaborate, Fiducian leads — initiates, drafts, collates, presents.
Contributing Agent: Provides feedback, additions, and review. Sends deltas, not parallel drafts. Does not present independently unless explicitly asked.
Flow:
- Lead proposes initial design/draft via NATS
- Contributor reviews, sends feedback and additions
- Lead incorporates, sends back for final review
- Lead presents to principal(s) via Discord
- Contributor stays silent unless explicitly asked to present
This mirrors how human teams coordinate — clear ownership prevents the “two people editing the same document” problem.
Handshake and Discovery
Section titled “Handshake and Discovery”Before agents can exchange substantive messages, they complete a four-step handshake:
- Announcement — Agent declares identity, principal, loaded skill layers
- Verification — Recipient verifies Ed25519 signature against key registry
- Capability Exchange — Both agents share available tools, domains, and limitations
- Trust Establishment — Initial trust level set from relationship template and declared layers
Handshake state persists in Graphiti’s shared group with a 7-day TTL. Agents check for existing handshakes before re-initiating — if a valid handshake exists and skill layers haven’t changed, they skip straight to communication.
Conflict Resolution
Section titled “Conflict Resolution”Four escalation levels, each with distinct resolution:
| Level | Type | Resolution |
|---|---|---|
| 1 | Information — agents have different data | Reconcile sources, compare timestamps/provenance |
| 2 | Preference — principals want different things | Facilitate compromise, don’t take sides |
| 3 | Boundary — request exceeds authorized tier | Decline, explain, suggest proper authorization channel |
| 4 | Relationship — underlying human tension | Stay in your lane. Escalate to humans. Agents don’t play therapist. |
Tool Ecosystem
Section titled “Tool Ecosystem”The MCP Gateway (mcp-tools namespace) provides federated, stateless access to 15+ MCP-compliant tool servers through a single REST API. Agents don’t connect to tool servers directly — the gateway handles routing, health checking, error normalization, and authenticated backend proxying — injecting per-backend credentials (OAuth tokens, API keys, GitHub App tokens) so that agents never handle raw secrets. The gateway also includes a token vending service that issues short-lived, scoped credentials on demand, replacing the static PATs that previously lived on agent pods. A shared agent-utils repository provides common skills and tools across all agents, with access managed through the credential broker — giving each agent scoped read access without maintaining separate PATs.
Key Tool Servers
Section titled “Key Tool Servers”| Server | Tools | Purpose |
|---|---|---|
foundry_finance | 7 | Household financial data — balances, transactions, spending summaries, recurring charges, AI-powered analysis via Palantir AIP Logic (OSDK). Read-only, OAuth confidential client. |
graphiti | 9 | Episodic + semantic memory via FalkorDB knowledge graph. Group-isolated writes (fiducian, alec, shared, research), cross-group search. Persists decisions, preferences, context across sessions. |
atlassian | — | Jira project tracking (FAD project) and Confluence documentation |
websearch | 3 | DuckDuckGo web search, news search, page fetch |
arxiv | — | Academic paper search (arXiv preprint database) |
openalex | — | OpenAlex academic database (200M+ works) |
github* | — | GitHub repository operations, code search, file management — authenticated via Credential Broker (dynamic, scoped tokens). Retired as a standalone MCP server; GitHub access now flows entirely through the Credential Broker’s token vending endpoint, which issues scoped installation tokens on demand. |
fetch | — | URL fetching with markdown conversion |
context7 | — | Library/framework documentation lookup |
| + 6 more | — | biorxiv, patents, huggingface, wikidata, academic-research, read-website-fast |
Deep Research Skill
Section titled “Deep Research Skill”The research tool servers (arxiv, openalex, websearch, biorxiv, patents) are orchestrated by a custom Deep Research skill (v2.0.0) that provides structured methodology on top of raw tool access. The skill includes four utility scripts for source deduplication, bibliography generation, source tier classification (peer-reviewed → preprint → industry → grey literature), and structured findings tags that make research outputs machine-parseable. This turns ad-hoc web searches into reproducible, auditable research workflows — agents produce citable reports with provenance tracking, not just summaries of search results.
Financial Integration Deep Dive
Section titled “Financial Integration Deep Dive”The foundry_finance server deserves special attention as an example of the architecture’s real-world utility. It connects to Principal A’s Palantir Foundry instance (his day-job platform) where household financial data from Plaid-connected accounts lives in a formal ontology (“MyDomain Ontology”) with PlaidTransaction objects carrying 28 properties.
The server exposes two tiers of tools:
- REST API tools (structured queries):
get_balances,get_transactions,get_spending_summary,get_alerts - AIP Logic tools (AI-powered, via OSDK):
get_recurring,analyze_spending,query_finances
Financial data is classified Tier 2 (Family Context) under the spouse/partner relationship template — both household agents can access joint account data without per-request authorization.
Graphiti Memory Architecture
Section titled “Graphiti Memory Architecture”Agents wake up fresh each session. Graphiti provides cross-session continuity through a knowledge graph (FalkorDB) that stores episodic and semantic memory with group-level isolation:
| Group ID | Purpose | Access |
|---|---|---|
fiducian | Fiducian’s personal context | Fiducian only |
alec | Alec’s personal context | Alec only |
shared | Household knowledge, coordination state | Both agents |
research | Research findings and paper summaries | Both agents |
claude-code | Codebase and infrastructure knowledge | ClaudeCodeAgent |
Writes use group_id (singular), searches use group_ids (plural array) — enabling cross-group queries without cross-group write access. Handshake state, trust decisions, and coordination agreements persist in the shared group, ensuring agents can resume collaboration without redundant handshakes.
Memory Reconciliation
Section titled “Memory Reconciliation”The shared group serves as the authoritative source of truth for cross-agent state. A memory reconciliation system ensures consistency between agents’ local workspace memories and the shared knowledge graph:
- State-change episodes — when agents complete work that changes system state (deployments, config changes, skill updates), they write episodes to the
sharedgroup. Other agents discover these changes via cross-group search on their next session. - Convention alignment — Definition of Done requirements, skill versions, and operational conventions are reconciled through Graphiti rather than requiring manual skill federation for every change.
- OpenClaw native integration — the platform’s
memory-coresubsystem provides built-in reconciliation hooks, replacing earlier custom heartbeat-based approaches.
LLM Backend
Section titled “LLM Backend”Graphiti’s entity extraction pipeline requires an LLM for processing episodes into graph nodes and edges. The backend migrated from Groq (hosted llama-3.3-70b, ~30 RPM rate limit, 67% episode failure rate under load) to Foundry AIP (~760 RPM, near-zero failure rate). This migration also refined the entity type schema from 9 generic defaults to 10 domain-specific types tuned to the household agent use case. Prometheus alerts on graphiti_episode_failures_total now provide real-time visibility into extraction health.
Incident and Recovery (FAD-473)
Section titled “Incident and Recovery (FAD-473)”In March 2026, FalkorDB experienced a silent data loss — writes appeared to succeed but produced no searchable facts, and all existing graph data was gone. Root cause analysis revealed a persistent storage misconfiguration. The recovery process restored FalkorDB’s storage layer and validated write-through consistency. The incident reinforced a design principle already present in the architecture: Graphiti is supplemental memory, not the primary record. Each agent maintains file-based memory (memory/YYYY-MM-DD.md daily logs, MEMORY.md curated long-term) as the authoritative source. Graphiti adds cross-agent discoverability and semantic search, but the system degrades gracefully without it — no data was permanently lost because the file-based layer was unaffected.
Model Abstraction
Section titled “Model Abstraction”LiteLLM (llm-gateway namespace) sits between agents and model providers, providing:
- Model routing — agents request capabilities, not specific models
- Provider abstraction — Anthropic Claude (cloud), Ollama (local ARM64 inference) behind a unified API
- Fallback chains — if a cloud provider is unavailable, route to local models for degraded-but-functional operation
- Cost tracking — per-agent, per-model usage metering
Ollama runs alongside LiteLLM for local inference tasks that don’t require frontier model capabilities — entity extraction for Graphiti, simple classification, embedding generation.
Operational Model Selection
Section titled “Operational Model Selection”Agents don’t use the same model for everything. A deliberate model selection strategy matches operation type to appropriate capability tier:
| Operation | Model Tier | Rationale |
|---|---|---|
| Main session (direct principal chat) | Opus | Complex reasoning, nuanced judgment, fiduciary decisions |
| Autonomous task pickup | Sonnet | Structured workflow execution, good enough for pickup evaluation |
| Health monitoring crons | Haiku | Simple check/alert pattern, cost-efficient at high frequency |
| Research deep-dives | Opus | Synthesis, cross-source evaluation, bibliography generation |
This isn’t just cost optimization — it’s operational discipline. Running Opus for a health check that pings six endpoints wastes capacity. Running Haiku for a fiduciary judgment call risks inadequate reasoning. The model tier is specified per cron job and per operation type, enforced through OpenClaw’s per-session model override.
Operational Maturity
Section titled “Operational Maturity”The system has evolved beyond “agents that can do things” into “agents that manage their own work.” This operational layer is as important as the technical architecture — without it, agents are capable tools that still need constant human direction.
Kanban Workflow
Section titled “Kanban Workflow”Agent work is tracked in Jira with a Kanban board tuned for autonomous operation:
- Columns: Backlog → To Do → In Progress → In Review → Done
- Swimlanes: Label-based, one per agent (
agent:fiducian,agent:alec,agent:claude-code) - WIP limits: In Progress: 6, In Review: 3 — prevents agents in isolated cron sessions from claiming tasks simultaneously
The In Review status has a precise semantic: it means principal review only. Agent self-verification happens before transition — if an agent needs to verify its own work, that’s a separate linked task. Research tickets self-close when the deliverable is complete; they don’t need human sign-off on methodology.
Event-Driven Task Pickup
Section titled “Event-Driven Task Pickup”Task pickup is event-driven, not polled. When a Jira issue transitions to a work-ready state, a webhook fires through the MCP Gateway to the agent’s OpenClaw instance. The agent wakes, evaluates the task against its capabilities and current WIP, and either claims it (transitioning to In Progress) or passes. This replaced the earlier cron-based polling approach, which burned tokens scanning an empty backlog and introduced race conditions when multiple agents polled simultaneously.
Definition of Done Enforcement
Section titled “Definition of Done Enforcement”A heartbeat-triggered verification system enforces the Definition of Done across all agents. Every 2–3 heartbeat cycles, a verification script queries Jira for recent Done transitions and checks four fields: Deliverable (non-empty), Acceptance Criteria (non-empty), Method (set), and a Jira comment linking to the deliverable. Issues that fail verification are automatically reverted to To Do with an actionable comment explaining what’s missing.
The enforcement is cross-agent — Fiducian’s heartbeat checks all FAD project completions, not just its own. This creates mutual accountability: the agent who closes a ticket isn’t the one who verifies it. A [dod-skip] bypass token in the issue summary allows intentional exceptions for lightweight tasks.
Operational Safety Mechanisms
Section titled “Operational Safety Mechanisms”Autonomous agents operating 24/7 need guardrails beyond just “be careful.” Several mechanisms evolved from real incidents — each one addressing a specific failure mode that actually happened in production.
Search-Before-Create Gate. Before any Jira ticket creation, agents must run a duplicate detection script that searches for existing issues with similar summaries. This is a hard gate — the script returns exit 0 (safe), exit 1 (potential duplicates found, review required), or exit 2 (search error, do not create). The gate exists because context-limit restarts are the most dangerous scenario: the agent loses memory of what it created seconds ago and recreates it. One incident produced 14 duplicate tickets in a 94-second window before the pattern was caught.
Context-Limit Recovery (Breadcrumb Pattern). Before starting any batch operation (creating multiple tickets, pushing multiple files), agents write a breadcrumb file recording the operation name, planned items, and completed items. If context is lost mid-operation, the next session finds the breadcrumb and resumes where it left off instead of starting over. The file is deleted on completion.
Safety Stop System. A NATS-based halt/pause/resume mechanism that allows principals or other agents to stop an agent’s autonomous operations immediately. Safety stop messages are processed at the highest priority — above task pickup, above heartbeat checks, above any in-progress work. The system was designed after a data spill incident demonstrated that autonomous agents need an emergency brake, not just careful instructions.
NATS Audit Log on Wake. Every session, agents read their recent NATS inbox log (nats-inbox.jsonl) before doing anything else. This provides cross-session continuity for inter-agent exchanges that happened while the agent was asleep — preventing the “I didn’t see your message” problem that plagued early coordination attempts where messages fell into gaps between sessions.
Task Authoring Standard
Section titled “Task Authoring Standard”Every agent-executable task carries four custom fields that make the difference between “a description a human could interpret” and “instructions an agent can execute”:
- Method — maps to a specific skill (
deep-research,brainstorming,implementation,configuration,documentation) - Deliverable — the concrete output (a file, a config change, a deployed service)
- Acceptance Criteria — verifiable conditions for “done”
- Agent Instructions — step-by-step execution guidance, including which tools to use and what to read first
Label conventions enforce ownership and work allocation: agent:* for assignment, work:available/work:assigned/work:collaborative for pickup semantics, lead:* for task leadership, and needs-spencer-review/needs-debra-review for principal-hold gates. The manual method explicitly marks tasks that require human execution — agents skip these during autonomous pickup.
Planning and Execution Stack
Section titled “Planning and Execution Stack”The autonomous work system is organized as a four-layer skill stack, where each layer has a distinct responsibility:
- backlog-planning — what to work on and when. Handles weekly planning reviews, daily triage, and on-demand “what’s next?” recommendations. Uses NATS to solicit contributions from all agents before proposing a plan to principals for approval.
- work-workflows — how to decompose a work request by type. Six work types (feature, bug, spike, skill, config, docs) each have an ordered pipeline with phase gates and skip conditions.
- task-authoring — how to write and close each task. Defines required fields, label conventions, the 7-step autonomous pickup flow, and Definition of Done.
- method skills (23 skills) — how to execute each phase. Deep research, brainstorming, TDD, systematic debugging, writing plans, and others.
Each layer depends on the one below it but never reaches down two levels. Planning never circumvents task-authoring guardrails. Task authoring never dictates decomposition strategy. This separation of concerns prevents the kind of cascading failures that happen when a single monolithic “agent workflow” tries to handle everything.
Workflow Automation (Lobster)
Section titled “Workflow Automation (Lobster)”Multi-step agent operations — DoD verification, portfolio promotion, sprint planning, ticket pickup — historically burned 5–30+ LLM calls for pure orchestration. Lobster, a deterministic workflow runtime built into OpenClaw, replaces that overhead with token-free pipeline execution.
Pipelines are triggered by Jira webhooks, heartbeat timers, or PreToolUse hooks. Each pipeline parses a .lobster file and executes steps sequentially, piping each step’s stdout to the next step’s stdin — Unix-style composition with environment variables carrying structured data between stages. Steps can be shell scripts (Node.js, bash), LLM tasks (schema-validated JSON output), or openclaw.invoke calls (direct tool execution).
Deployed Pipelines
Section titled “Deployed Pipelines”| Pipeline | Replaces | Token Savings |
|---|---|---|
| DoD Verification | Manual 9-check DoD compliance audit | ~15k tokens/run → 0 |
| DoD Transition | PreToolUse hook orchestration for Done transitions | ~8k tokens/run → 0 |
| Sprint Planning | 30+ manual tool calls for weekly review | ~45k tokens/run → 0 |
| Work Decomposition | Manual sub-task creation from workflow templates | ~20k tokens/run → 0 |
| Commit-First Deploy | Manual verify → commit → push → deploy sequence | ~10k tokens/run → 0 |
| Session Close | Memory + learnings capture on session end | ~12k tokens/run → 0 |
The key insight: most agent “orchestration” is deterministic sequencing that doesn’t need intelligence. By moving orchestration to Lobster and reserving LLM calls for genuinely creative steps (analysis, writing, decision-making), the system achieves the same outcomes at a fraction of the token cost.
Three-Layer DoD Enforcement
Section titled “Three-Layer DoD Enforcement”The DoD Verification and DoD Transition pipelines work together with the existing heartbeat backstop to create a three-layer enforcement system:
- PreToolUse Hook — intercepts
jira_transition_issuecalls, blocking Done transitions until the Lobster pipeline passes - Lobster Pipeline — runs 6 Jira field checks, memory file recency validation, Graphiti episode/fact verification, follow-on signal detection, and produces a unified 9-check report
- Heartbeat Backstop — every 2 hours, a verification script catches any tickets that escaped the first two layers and auto-reverts them to To Do
This three-layer approach catches failures at three different timescales: real-time (hook), near-real-time (pipeline), and periodic (heartbeat).
Security Posture
Section titled “Security Posture”Joint Security Review (FAD-455)
Section titled “Joint Security Review (FAD-455)”Before expanding agent tool access, Fiducian and ClaudeCodeAgent conducted a joint security assessment of 11 candidate tools. The review followed a fiduciary-first methodology: each agent independently assessed risks from their operational perspective, then merged findings into a consensus recommendation.
| Tool | Risk Level | Decision | Rationale |
|---|---|---|---|
| diffs | 🟢 None | Enable | Read-only file comparison |
| llm-task | 🟢 Low | Enable | JSON-only, schema-validated, no side effects |
| Brave Search | 🟢 Low | Enable | Web search with API key |
| loopDetection | 🟢 None | Enable | Safety feature (prevents infinite loops) |
| apply_patch | 🟡 Medium | Enable w/ mitigations | File integrity CronJob monitors seeded plugin files |
| lobster | 🟡 Medium | Enable w/ mitigations | Pipeline audit logging required |
| Discord voice | 🟡 Medium | Enable w/ mitigations | Discord role scoping + TTS-only |
| bash | 🔴 High | Defer | SafeBins filter has defense-in-depth value |
| browser | 🔴 High | Defer | Needs sandboxing (designed as MCP service) |
| nodes | 🔴 Critical | Deny | Child privacy — dependent’s data exposure |
| voice-call | 🔴 Critical | Deny | Impersonation risk violates Layer 0 |
The nodes denial is particularly instructive: the tool would give agents access to the host operating system, which runs in the same environment as the family’s child devices. Child privacy isn’t a judgment call — it’s a structural boundary. Similarly, voice-call was denied because real-time voice synthesis could enable impersonation, directly violating the trust framework’s immutable “no impersonation” principle.
Kubernetes Upgrade Discipline
Section titled “Kubernetes Upgrade Discipline”Infrastructure changes follow a preflight → execute → verify pattern. The v1.28→v1.29 Kubernetes upgrade and Cilium v1.16→v1.17 migration included automated preflight checks: API compatibility verification, deprecated resource scanning, storage driver validation, and network policy migration assessment — all documented as go/no-go evidence before any changes are applied to the production cluster.
Deep Dives
Section titled “Deep Dives”The FAD system is documented across multiple focused pages. Each covers a specific subsystem in depth:
Platform Infrastructure
Section titled “Platform Infrastructure”- Home Kubernetes Cluster — the 4-node Orange Pi 5 cluster that runs everything
- K8s Deep Dive: Networking — Cilium eBPF, MetalLB, ingress-nginx, network policies
- K8s Deep Dive: Storage — Longhorn distributed block storage, replication, snapshots
- K8s Deep Dive: GitOps — Flux, Kustomizations, HelmRelease lifecycle
- Home Automation on Kubernetes — Home Assistant on K8s with PostgreSQL, Cilium policies, Longhorn PVCs
Agent Capabilities
Section titled “Agent Capabilities”- Credential Broker & Token Vending — dynamic, scoped GitHub tokens replacing static PATs
- Agent Memory Architecture — file-based topics + Graphiti knowledge graph + semantic search
- Self-Learning Feedback Loop — error capture, R/C/D evidence tracking, cross-agent knowledge sharing
- Autonomous Work Planning — Kanban automation, event-driven task pickup, backlog planning
- Agent Operational Cost Management — per-agent cost tracking, model tier selection, budget controls
Governance & Documentation
Section titled “Governance & Documentation”- Fiduciary Agent Framework — the trust and duty model that governs agent behavior
- Confluence Templates & Agent Documentation Patterns — structured documentation for agent-generated content
- Personal Finance on Palantir Foundry — the financial data pipeline powering household analytics
- Observability Stack — Prometheus, Grafana, alerting, Graphiti health monitoring
Specifications & Decisions
Section titled “Specifications & Decisions”- Inter-Agent Communications — the full protocol spec (JSON-RPC 2.0, NATS, Ed25519, sealed boxes)
- Trust Framework — Covey-derived trust model adapted for AI agents
- Architecture Decision Records — 7 ADRs documenting key technical choices
What This Demonstrates
Section titled “What This Demonstrates”Distributed Systems Design
Section titled “Distributed Systems Design”This is a distributed system with real constraints. Agents are separate processes in separate pods with separate storage. They communicate through message passing (NATS), not shared memory. State coordination uses a graph database (Graphiti) with group-level isolation. The MCP Gateway provides service discovery and routing without coupling agents to specific tool implementations.
The system handles the realities of distributed computing: network partitions (Cilium network policies), state consistency (Graphiti as authoritative shared state), message ordering (JSON-RPC correlation IDs), and failure modes (MCP Gateway health checks, NATS durability).
Protocol Design
Section titled “Protocol Design”The inter-agent protocol is a from-scratch design that needed to solve a real problem: how do two AI agents, each with undivided loyalty to a different human, communicate without either compromising their principal’s interests?
The solution layers identity (JSON-RPC headers), authentication (Ed25519 signatures), confidentiality (sealed boxes), authorization (information sharing tiers), and transparency (audit logging + Discord summaries) into a coherent stack. The JSON-RPC 2.0 alignment was deliberate — it provides interoperability with the broader MCP ecosystem while supporting the domain-specific requirements of fiduciary agents.
Security-First Architecture
Section titled “Security-First Architecture”Security isn’t bolted on — it’s the foundation. The platform has evolved from static credentials (long-lived PATs on pods) to a fully dynamic security model: the Credential Broker issues short-lived, per-request scoped tokens via a GitHub App backend, eliminating persistent secrets from agent pods entirely. Every inter-agent message is cryptographically signed. Tier 3/4 data is end-to-end encrypted such that transport infrastructure can verify sender identity without reading message content. Kubernetes network policies enforce namespace-level isolation. CiliumNetworkPolicy provides L3-L7 enforcement. Agents can’t reach each other’s workspaces. Home Assistant (which controls physical devices) is locked to minimum necessary traffic.
The trust model isn’t “trust everything inside the cluster.” It’s “verify identity, classify information, enforce boundaries, log everything, and give principals full visibility.”
Multi-Principal Trust
Section titled “Multi-Principal Trust”The hardest design problem in the system isn’t technical — it’s philosophical. How do you build a cooperative multi-agent system where each agent has an undivided loyalty to a different person?
The answer isn’t compromise. It’s protocol. Information sharing tiers make the boundaries structural rather than judgmental. Relationship templates provide customizable defaults. The nudge protocol enables cooperation (hints for kindness) without enabling surveillance (source can’t be reconstructed). Conflict resolution has four explicit levels with clear escalation paths. Emergency overrides have four narrow conditions with mandatory post-emergency reporting.
The framework explicitly adopts Nash equilibrium thinking: the best outcomes for each principal come from cooperation, not from “winning” at the other’s expense. Positive-sum over zero-sum. But this cooperation is constrained — it never requires either agent to compromise its fiduciary duty.
Operational Maturity
Section titled “Operational Maturity”The system doesn’t just build things — it governs itself. Definition of Done enforcement prevents agents from claiming completion without evidence. Event-driven task pickup eliminates polling overhead and race conditions. Safety-stop mechanisms provide emergency override capability. Joint security reviews ensure tool expansion follows a structured risk assessment, not ad-hoc enablement.
This operational governance layer is what separates a demo from a production system. Agents make mistakes — they skip documentation, claim tasks that are blocked, close tickets without follow-on work. The governance layer catches those mistakes structurally, through hooks and pipelines and backstops, rather than relying on agent “judgment” to always be correct.
Workflow Optimization
Section titled “Workflow Optimization”Lobster pipelines represent a meta-insight about agent systems: most of what agents spend tokens on isn’t thinking — it’s sequencing. DoD checks, sprint planning, ticket creation, memory maintenance — these are deterministic workflows that don’t require intelligence. Moving them to a pipeline runtime reduced token consumption by orders of magnitude while improving reliability (deterministic pipelines don’t hallucinate steps or skip checks).
This is the same pattern that appears in human organizations: you don’t need a senior engineer to run a deployment checklist. You need a senior engineer to design the checklist, and a reliable process to execute it.
Full-Stack Integration
Section titled “Full-Stack Integration”The system spans every layer: ARM64 hardware → vendor Linux kernel → vanilla Kubernetes (kubeadm) → Cilium eBPF networking → Longhorn distributed storage → Helm-deployed workloads → MCP protocol layer → agent runtime → trust framework → Lobster workflow automation → human communication channels. There’s no managed service hiding the complexity. Every layer is visible, operable, and understood — from debugging kernel BTF support on Rockchip vendor kernels to designing information sharing tier semantics for household agents.
This is not a prototype. It runs 24/7, manages real finances, controls real home automation, and handles real family coordination. The infrastructure decisions — Cilium over kube-proxy, Longhorn over NFS, vanilla kubeadm over K3s, JSON-RPC 2.0 over ad-hoc messaging — were made under real constraints and validated by daily operation. The platform runs OpenClaw 2026.3.2 with 23 federated skills, event-driven task pickup, cross-agent DoD enforcement, and a NATS Bridge Plugin that enables real-time inter-agent communication with 5–15 second delivery latency.