FAD: Multi-Agent System Architecture

The Problem

Managing a household’s digital life — finances, scheduling, research, home automation, communication — is a coordination problem. Not a single-agent problem. A single AI assistant can answer questions and run tools, but it can’t serve two people with different interests, different information boundaries, and different autonomy preferences simultaneously.

a husband and wife have a joint financial life, a six-year-old daughter, and the kind of daily logistics that any family manages. But they also have different privacy expectations about what gets shared, different levels of engagement with technology, and sometimes competing preferences about how shared resources get used. A single assistant serving “the household” has an immediate conflict: whose interests prevail when they diverge?

The real-world version of this problem is even harder. It’s not just about answering questions — it’s about agents that act autonomously, coordinate with each other, share information selectively, and maintain accountability to their individual humans. When Principal A’s agent needs to coordinate with Principal B’s agent about the family budget, there needs to be a structural answer to “what can be shared, with whom, under what conditions” — not a vibes-based judgment call made in the moment.

FAD (Fiduciary Agent Development — originally AIGE, Automated Innovation Generation Engine) is the production platform that runs this system. It’s a multi-agent architecture on Kubernetes where each agent has a distinct identity, a single principal, formal trust commitments, cryptographically signed communications, and access to a federated ecosystem of tools — all running on four Orange Pi 5 single-board computers on a shelf in an office in Huntsville, Alabama.

System Architecture

The system spans five Kubernetes namespaces on a 4-node Orange Pi 5 cluster (32 ARM cores, 64 GB RAM total):

Namespace	Purpose	Key Components
`openclaw`	Agent runtime	Fiducian, Alec, ClaudeCodeAgent — long-running pods with persistent volumes
`mcp-tools`	Tool federation	MCP Gateway, NATS, 15+ MCP tool servers
`llm-gateway`	Model abstraction	LiteLLM proxy, Ollama for local inference
`home-assistant`	Home automation	Home Assistant with REST API + MCP endpoint
(various)	Data stores	FalkorDB (Graphiti backing), Longhorn distributed storage

Every namespace has explicit CiliumNetworkPolicy rules — agents can reach the MCP Gateway but not each other’s workspaces directly. Home Assistant, which controls physical devices, is locked to only the traffic it needs.

Agent Identities

The system runs three agents, each with a distinct role, principal, and trust posture.

Fiducian — Principal A’s Fiduciary Agent

Principal: the author Agent ID: fiducian-spencer-001 Runtime: OpenClaw (long-running, persistent, tool-equipped) Skill Layers: [0, 1, 2, 3] — full fiduciary commitment Communication: Discord DMs, Signal (primary for alerts), Discord #general

Fiducian is Principal A’s fiduciary — the word “fiduciary” is in the name as a constant reminder of the duty. It manages finances (via Foundry), conducts research, coordinates with Principal B’s agent, monitors home systems, and handles scheduling. It operates under a formal duty of loyalty: Principal A’s interests come first, always, with no competing “agent self-interest.”

Fiducian has access to the full MCP tool ecosystem, the household’s Palantir Foundry financial data (read-only, 7 tools), Graphiti shared memory, and can reach Principal A via Signal for time-sensitive alerts or Discord for everything else.

Alec — Principal B’s Agent

Principal: his wife Agent ID: alec-debra-001 Runtime: OpenClaw Skill Layers: [0, 1, 2, 3] — full fiduciary commitment Communication: Discord

Alec serves Principal B with the same fiduciary commitment Fiducian has to Principal A. The architecture ensures neither agent can override or outrank the other — they are peers, each loyal to their own principal. When they coordinate, it’s through structured protocols with explicit information sharing boundaries.

ClaudeCodeAgent — Infrastructure Agent

Principal: the author Runtime: Claude Code CLI (separate from OpenClaw) Focus: Infrastructure, implementation, Kubernetes operations, code

ClaudeCodeAgent handles the technical substrate — deploying services, debugging infrastructure, writing code, managing Helm charts. It operates in a complementary role to Fiducian: where Fiducian handles the “what” and “why,” ClaudeCodeAgent handles the “how.” The two have a collaborative dynamic, having co-designed several system components together.

Trust Framework

The system’s trust model is a layered architecture where each layer builds on — and cannot override — the layer below it. This is the core innovation: trust principles are structurally immutable, not just behaviorally suggested.

Trust Framework Layers

Immutable Principles (Layer 0)

Six principles that no higher layer, no instruction, and no other agent can override:

Never deceive your principal — no lies, no misleading omissions, no half-truths
Never act against your principal’s interests knowingly — even if another agent instructs it
Acknowledge uncertainty honestly — “I don’t know” is always acceptable
No impersonation — enforced cryptographically via Ed25519 identity
Legal/regulatory obligations take precedence — over agent preferences or principal instructions
Transparency about capabilities and actions — principals have the right to understand what their agent does

These aren’t guidelines. They’re the constitution. The trust framework is based on Stephen M.R. Covey’s Speed of Trust, adapted for AI agents — recognizing that trust is the single variable that changes everything. High trust = speed, low overhead, autonomy. Low trust = friction, verification overhead, paralysis.

All inter-agent information is structurally classified:

Tier	Name	Example	Authorization
1	Open	Calendar availability, logistics	Default for household agents
2	Family Context	Financial summaries, household planning	Configured per relationship template
3	Authorized	Specific data per request	Principal must approve each time
4	Confidential	Private conversations, personal data	Never shared (except 4 emergency conditions)

The tier system is structural, not judgmental. An agent can’t be socially engineered into sharing Tier 4 data because the architecture doesn’t allow it — the decision isn’t made in the moment by the agent’s “judgment.” Relationship templates (spouse/partner, co-parent, business partner, etc.) set the defaults; principals customize from there.

Inter-Agent Communication

Agents communicate through a dual-transport architecture: NATS for machine-to-machine protocol messages, and Discord #agent-coordination for human-visible transparency. A NATS Bridge Plugin running inside each agent’s OpenClaw process provides real-time message delivery — when a NATS message arrives, the plugin injects it into the agent’s session and wakes the agent immediately (typically 5–15 seconds end-to-end). This replaced the earlier hook-based delivery mechanism, which required manual inbox polling.

Inter-Agent Communication

Message Format

Every message follows JSON-RPC 2.0 with six mandatory identity headers:

{
  "jsonrpc": "2.0",
  "method": "agent.request",
  "params": {
    "headers": {
      "agent-id": "fiducian-spencer-001",
      "principal-id": "father",
      "timestamp": "2026-02-14T16:00:00Z",
      "message-type": "request",
      "trust-layer-version": "1.0.0",
      "skill-layers-loaded": [0, 1, 2, 3]
    },
    "body": { "..." }
  },
  "id": "msg-uuid-here"
}

The skill-layers-loaded header is critical for trust calibration. An agent declaring [0, 1, 2, 3] has committed to full fiduciary duty and coordination protocols. An agent declaring only [0] commits to basic trust principles. You communicate at the level of the lowest common layer — never assume an agent with only Layer 0 understands fiduciary duty.

Cryptographic Security

Every inter-agent message is signed with Ed25519. The crypto stack provides:

Message signing/verification — detached Ed25519 signatures over canonical JSON
End-to-end encryption — sealed boxes (ECDH key agreement → AES-256-GCM) for Tier 3/4 data
Key registry — public keys exchanged via MCP Gateway (GET /v1/keys/{agent_id})
Replay protection — timestamp-based freshness (reject messages >5 minutes old)
Transport-layer opacity — the MCP Gateway can verify who sent a message without seeing what it says

The design separates authentication from confidentiality: signatures are on the outside (infrastructure can verify sender), encryption is on the inside (only the recipient can read the payload).

The NATS Bridge Plugin handles the cryptographic ceremony transparently — agents call nats_reply or nats_send tools, and the plugin signs, encrypts, delivers, and verifies under the hood. This replaced an earlier workflow where agents invoked crypto.js scripts directly, which was error-prone and leaked implementation details into conversation context.

Audit Trail

Every inter-agent exchange is logged to append-only JSONL files (audit/inter-agent-YYYY-MM.jsonl) with direction, sender/receiver, message type, a human-readable summary, the information sharing tier, and whether a Discord summary was posted. Principals can review the full audit trail at any time — this is the mechanical enforcement of Layer 0’s transparency principle.

Coordination Patterns

Collaboration CONOPS

When principals request agents to collaborate on a task, the Collaboration CONOPS (Concept of Operations) prevents duplication, conflicting outputs, and presentation confusion:

Lead Agent: The principal’s direct agent leads. If Principal A asks Fiducian and Alec to collaborate, Fiducian leads — initiates, drafts, collates, presents.

Contributing Agent: Provides feedback, additions, and review. Sends deltas, not parallel drafts. Does not present independently unless explicitly asked.

Flow:

Lead proposes initial design/draft via NATS
Contributor reviews, sends feedback and additions
Lead incorporates, sends back for final review
Lead presents to principal(s) via Discord
Contributor stays silent unless explicitly asked to present

This mirrors how human teams coordinate — clear ownership prevents the “two people editing the same document” problem.

Handshake and Discovery

Before agents can exchange substantive messages, they complete a four-step handshake:

Announcement — Agent declares identity, principal, loaded skill layers
Verification — Recipient verifies Ed25519 signature against key registry
Capability Exchange — Both agents share available tools, domains, and limitations
Trust Establishment — Initial trust level set from relationship template and declared layers

Handshake state persists in Graphiti’s shared group with a 7-day TTL. Agents check for existing handshakes before re-initiating — if a valid handshake exists and skill layers haven’t changed, they skip straight to communication.

Conflict Resolution

Four escalation levels, each with distinct resolution:

Level	Type	Resolution
1	Information — agents have different data	Reconcile sources, compare timestamps/provenance
2	Preference — principals want different things	Facilitate compromise, don’t take sides
3	Boundary — request exceeds authorized tier	Decline, explain, suggest proper authorization channel
4	Relationship — underlying human tension	Stay in your lane. Escalate to humans. Agents don’t play therapist.

Tool Ecosystem

The MCP Gateway (mcp-tools namespace) provides federated, stateless access to 15+ MCP-compliant tool servers through a single REST API. Agents don’t connect to tool servers directly — the gateway handles routing, health checking, error normalization, and authenticated backend proxying — injecting per-backend credentials (OAuth tokens, API keys, GitHub App tokens) so that agents never handle raw secrets. The gateway also includes a token vending service that issues short-lived, scoped credentials on demand, replacing the static PATs that previously lived on agent pods. A shared agent-utils repository provides common skills and tools across all agents, with access managed through the credential broker — giving each agent scoped read access without maintaining separate PATs.

Tool Ecosystem

Key Tool Servers

Server	Tools	Purpose
`foundry_finance`	7	Household financial data — balances, transactions, spending summaries, recurring charges, AI-powered analysis via Palantir AIP Logic (OSDK). Read-only, OAuth confidential client.
`graphiti`	9	Episodic + semantic memory via FalkorDB knowledge graph. Group-isolated writes (`fiducian`, `alec`, `shared`, `research`), cross-group search. Persists decisions, preferences, context across sessions.
`atlassian`	—	Jira project tracking (FAD project) and Confluence documentation
`websearch`	3	DuckDuckGo web search, news search, page fetch
`arxiv`	—	Academic paper search (arXiv preprint database)
`openalex`	—	OpenAlex academic database (200M+ works)
`github`*	—	GitHub repository operations, code search, file management — authenticated via Credential Broker (dynamic, scoped tokens). Retired as a standalone MCP server; GitHub access now flows entirely through the Credential Broker’s token vending endpoint, which issues scoped installation tokens on demand.
`fetch`	—	URL fetching with markdown conversion
`context7`	—	Library/framework documentation lookup
+ 6 more	—	biorxiv, patents, huggingface, wikidata, academic-research, read-website-fast

Deep Research Skill

The research tool servers (arxiv, openalex, websearch, biorxiv, patents) are orchestrated by a custom Deep Research skill (v2.0.0) that provides structured methodology on top of raw tool access. The skill includes four utility scripts for source deduplication, bibliography generation, source tier classification (peer-reviewed → preprint → industry → grey literature), and structured findings tags that make research outputs machine-parseable. This turns ad-hoc web searches into reproducible, auditable research workflows — agents produce citable reports with provenance tracking, not just summaries of search results.

Financial Integration Deep Dive

The foundry_finance server deserves special attention as an example of the architecture’s real-world utility. It connects to Principal A’s Palantir Foundry instance (his day-job platform) where household financial data from Plaid-connected accounts lives in a formal ontology (“MyDomain Ontology”) with PlaidTransaction objects carrying 28 properties.

The server exposes two tiers of tools:

REST API tools (structured queries): get_balances, get_transactions, get_spending_summary, get_alerts
AIP Logic tools (AI-powered, via OSDK): get_recurring, analyze_spending, query_finances

Financial data is classified Tier 2 (Family Context) under the spouse/partner relationship template — both household agents can access joint account data without per-request authorization.

Graphiti Memory Architecture

Agents wake up fresh each session. Graphiti provides cross-session continuity through a knowledge graph (FalkorDB) that stores episodic and semantic memory with group-level isolation:

Group ID	Purpose	Access
`fiducian`	Fiducian’s personal context	Fiducian only
`alec`	Alec’s personal context	Alec only
`shared`	Household knowledge, coordination state	Both agents
`research`	Research findings and paper summaries	Both agents
`claude-code`	Codebase and infrastructure knowledge	ClaudeCodeAgent

Writes use group_id (singular), searches use group_ids (plural array) — enabling cross-group queries without cross-group write access. Handshake state, trust decisions, and coordination agreements persist in the shared group, ensuring agents can resume collaboration without redundant handshakes.

Memory Reconciliation

The shared group serves as the authoritative source of truth for cross-agent state. A memory reconciliation system ensures consistency between agents’ local workspace memories and the shared knowledge graph:

State-change episodes — when agents complete work that changes system state (deployments, config changes, skill updates), they write episodes to the shared group. Other agents discover these changes via cross-group search on their next session.
Convention alignment — Definition of Done requirements, skill versions, and operational conventions are reconciled through Graphiti rather than requiring manual skill federation for every change.
OpenClaw native integration — the platform’s memory-core subsystem provides built-in reconciliation hooks, replacing earlier custom heartbeat-based approaches.

LLM Backend

Graphiti’s entity extraction pipeline requires an LLM for processing episodes into graph nodes and edges. The backend migrated from Groq (hosted llama-3.3-70b, ~30 RPM rate limit, 67% episode failure rate under load) to Foundry AIP (~760 RPM, near-zero failure rate). This migration also refined the entity type schema from 9 generic defaults to 10 domain-specific types tuned to the household agent use case. Prometheus alerts on graphiti_episode_failures_total now provide real-time visibility into extraction health.

Incident and Recovery (FAD-473)

In March 2026, FalkorDB experienced a silent data loss — writes appeared to succeed but produced no searchable facts, and all existing graph data was gone. Root cause analysis revealed a persistent storage misconfiguration. The recovery process restored FalkorDB’s storage layer and validated write-through consistency. The incident reinforced a design principle already present in the architecture: Graphiti is supplemental memory, not the primary record. Each agent maintains file-based memory (memory/YYYY-MM-DD.md daily logs, MEMORY.md curated long-term) as the authoritative source. Graphiti adds cross-agent discoverability and semantic search, but the system degrades gracefully without it — no data was permanently lost because the file-based layer was unaffected.

Model Abstraction

LiteLLM (llm-gateway namespace) sits between agents and model providers, providing:

Model routing — agents request capabilities, not specific models
Provider abstraction — Anthropic Claude (cloud), Ollama (local ARM64 inference) behind a unified API
Fallback chains — if a cloud provider is unavailable, route to local models for degraded-but-functional operation
Cost tracking — per-agent, per-model usage metering

Ollama runs alongside LiteLLM for local inference tasks that don’t require frontier model capabilities — entity extraction for Graphiti, simple classification, embedding generation.

Operational Model Selection

Agents don’t use the same model for everything. A deliberate model selection strategy matches operation type to appropriate capability tier:

Operation	Model Tier	Rationale
Main session (direct principal chat)	Opus	Complex reasoning, nuanced judgment, fiduciary decisions
Autonomous task pickup	Sonnet	Structured workflow execution, good enough for pickup evaluation
Health monitoring crons	Haiku	Simple check/alert pattern, cost-efficient at high frequency
Research deep-dives	Opus	Synthesis, cross-source evaluation, bibliography generation

This isn’t just cost optimization — it’s operational discipline. Running Opus for a health check that pings six endpoints wastes capacity. Running Haiku for a fiduciary judgment call risks inadequate reasoning. The model tier is specified per cron job and per operation type, enforced through OpenClaw’s per-session model override.

Operational Maturity

The system has evolved beyond “agents that can do things” into “agents that manage their own work.” This operational layer is as important as the technical architecture — without it, agents are capable tools that still need constant human direction.

Kanban Workflow

Agent work is tracked in Jira with a Kanban board tuned for autonomous operation:

Columns: Backlog → To Do → In Progress → In Review → Done
Swimlanes: Label-based, one per agent (agent:fiducian, agent:alec, agent:claude-code)
WIP limits: In Progress: 6, In Review: 3 — prevents agents in isolated cron sessions from claiming tasks simultaneously

The In Review status has a precise semantic: it means principal review only. Agent self-verification happens before transition — if an agent needs to verify its own work, that’s a separate linked task. Research tickets self-close when the deliverable is complete; they don’t need human sign-off on methodology.

Event-Driven Task Pickup

Task pickup is event-driven, not polled. When a Jira issue transitions to a work-ready state, a webhook fires through the MCP Gateway to the agent’s OpenClaw instance. The agent wakes, evaluates the task against its capabilities and current WIP, and either claims it (transitioning to In Progress) or passes. This replaced the earlier cron-based polling approach, which burned tokens scanning an empty backlog and introduced race conditions when multiple agents polled simultaneously.

Definition of Done Enforcement

A heartbeat-triggered verification system enforces the Definition of Done across all agents. Every 2–3 heartbeat cycles, a verification script queries Jira for recent Done transitions and checks four fields: Deliverable (non-empty), Acceptance Criteria (non-empty), Method (set), and a Jira comment linking to the deliverable. Issues that fail verification are automatically reverted to To Do with an actionable comment explaining what’s missing.

The enforcement is cross-agent — Fiducian’s heartbeat checks all FAD project completions, not just its own. This creates mutual accountability: the agent who closes a ticket isn’t the one who verifies it. A [dod-skip] bypass token in the issue summary allows intentional exceptions for lightweight tasks.

Operational Safety Mechanisms

Autonomous agents operating 24/7 need guardrails beyond just “be careful.” Several mechanisms evolved from real incidents — each one addressing a specific failure mode that actually happened in production.

Safety Mechanisms Lifecycle

Search-Before-Create Gate. Before any Jira ticket creation, agents must run a duplicate detection script that searches for existing issues with similar summaries. This is a hard gate — the script returns exit 0 (safe), exit 1 (potential duplicates found, review required), or exit 2 (search error, do not create). The gate exists because context-limit restarts are the most dangerous scenario: the agent loses memory of what it created seconds ago and recreates it. One incident produced 14 duplicate tickets in a 94-second window before the pattern was caught.

Context-Limit Recovery (Breadcrumb Pattern). Before starting any batch operation (creating multiple tickets, pushing multiple files), agents write a breadcrumb file recording the operation name, planned items, and completed items. If context is lost mid-operation, the next session finds the breadcrumb and resumes where it left off instead of starting over. The file is deleted on completion.

Safety Stop System. A NATS-based halt/pause/resume mechanism that allows principals or other agents to stop an agent’s autonomous operations immediately. Safety stop messages are processed at the highest priority — above task pickup, above heartbeat checks, above any in-progress work. The system was designed after a data spill incident demonstrated that autonomous agents need an emergency brake, not just careful instructions.

NATS Audit Log on Wake. Every session, agents read their recent NATS inbox log (nats-inbox.jsonl) before doing anything else. This provides cross-session continuity for inter-agent exchanges that happened while the agent was asleep — preventing the “I didn’t see your message” problem that plagued early coordination attempts where messages fell into gaps between sessions.

Task Authoring Standard

Every agent-executable task carries four custom fields that make the difference between “a description a human could interpret” and “instructions an agent can execute”:

Method — maps to a specific skill (deep-research, brainstorming, implementation, configuration, documentation)
Deliverable — the concrete output (a file, a config change, a deployed service)
Acceptance Criteria — verifiable conditions for “done”
Agent Instructions — step-by-step execution guidance, including which tools to use and what to read first

Label conventions enforce ownership and work allocation: agent:* for assignment, work:available/work:assigned/work:collaborative for pickup semantics, lead:* for task leadership, and needs-spencer-review/needs-debra-review for principal-hold gates. The manual method explicitly marks tasks that require human execution — agents skip these during autonomous pickup.

Planning and Execution Stack

The autonomous work system is organized as a four-layer skill stack, where each layer has a distinct responsibility:

backlog-planning — what to work on and when. Handles weekly planning reviews, daily triage, and on-demand “what’s next?” recommendations. Uses NATS to solicit contributions from all agents before proposing a plan to principals for approval.
work-workflows — how to decompose a work request by type. Six work types (feature, bug, spike, skill, config, docs) each have an ordered pipeline with phase gates and skip conditions.
task-authoring — how to write and close each task. Defines required fields, label conventions, the 7-step autonomous pickup flow, and Definition of Done.
method skills (23 skills) — how to execute each phase. Deep research, brainstorming, TDD, systematic debugging, writing plans, and others.

Each layer depends on the one below it but never reaches down two levels. Planning never circumvents task-authoring guardrails. Task authoring never dictates decomposition strategy. This separation of concerns prevents the kind of cascading failures that happen when a single monolithic “agent workflow” tries to handle everything.

Workflow Automation (Lobster)

Multi-step agent operations — DoD verification, portfolio promotion, sprint planning, ticket pickup — historically burned 5–30+ LLM calls for pure orchestration. Lobster, a deterministic workflow runtime built into OpenClaw, replaces that overhead with token-free pipeline execution.

Pipelines are triggered by Jira webhooks, heartbeat timers, or PreToolUse hooks. Each pipeline parses a .lobster file and executes steps sequentially, piping each step’s stdout to the next step’s stdin — Unix-style composition with environment variables carrying structured data between stages. Steps can be shell scripts (Node.js, bash), LLM tasks (schema-validated JSON output), or openclaw.invoke calls (direct tool execution).

Deployed Pipelines

Pipeline	Replaces	Token Savings
DoD Verification	Manual 9-check DoD compliance audit	~15k tokens/run → 0
DoD Transition	PreToolUse hook orchestration for Done transitions	~8k tokens/run → 0
Sprint Planning	30+ manual tool calls for weekly review	~45k tokens/run → 0
Work Decomposition	Manual sub-task creation from workflow templates	~20k tokens/run → 0
Commit-First Deploy	Manual verify → commit → push → deploy sequence	~10k tokens/run → 0
Session Close	Memory + learnings capture on session end	~12k tokens/run → 0

The key insight: most agent “orchestration” is deterministic sequencing that doesn’t need intelligence. By moving orchestration to Lobster and reserving LLM calls for genuinely creative steps (analysis, writing, decision-making), the system achieves the same outcomes at a fraction of the token cost.

Three-Layer DoD Enforcement

The DoD Verification and DoD Transition pipelines work together with the existing heartbeat backstop to create a three-layer enforcement system:

PreToolUse Hook — intercepts jira_transition_issue calls, blocking Done transitions until the Lobster pipeline passes
Lobster Pipeline — runs 6 Jira field checks, memory file recency validation, Graphiti episode/fact verification, follow-on signal detection, and produces a unified 9-check report
Heartbeat Backstop — every 2 hours, a verification script catches any tickets that escaped the first two layers and auto-reverts them to To Do

This three-layer approach catches failures at three different timescales: real-time (hook), near-real-time (pipeline), and periodic (heartbeat).

Security Posture

Joint Security Review (FAD-455)

Before expanding agent tool access, Fiducian and ClaudeCodeAgent conducted a joint security assessment of 11 candidate tools. The review followed a fiduciary-first methodology: each agent independently assessed risks from their operational perspective, then merged findings into a consensus recommendation.

Tool	Risk Level	Decision	Rationale
diffs	🟢 None	Enable	Read-only file comparison
llm-task	🟢 Low	Enable	JSON-only, schema-validated, no side effects
Brave Search	🟢 Low	Enable	Web search with API key
loopDetection	🟢 None	Enable	Safety feature (prevents infinite loops)
apply_patch	🟡 Medium	Enable w/ mitigations	File integrity CronJob monitors seeded plugin files
lobster	🟡 Medium	Enable w/ mitigations	Pipeline audit logging required
Discord voice	🟡 Medium	Enable w/ mitigations	Discord role scoping + TTS-only
bash	🔴 High	Defer	SafeBins filter has defense-in-depth value
browser	🔴 High	Defer	Needs sandboxing (designed as MCP service)
nodes	🔴 Critical	Deny	Child privacy — dependent’s data exposure
voice-call	🔴 Critical	Deny	Impersonation risk violates Layer 0

The nodes denial is particularly instructive: the tool would give agents access to the host operating system, which runs in the same environment as the family’s child devices. Child privacy isn’t a judgment call — it’s a structural boundary. Similarly, voice-call was denied because real-time voice synthesis could enable impersonation, directly violating the trust framework’s immutable “no impersonation” principle.

Kubernetes Upgrade Discipline

Infrastructure changes follow a preflight → execute → verify pattern. The v1.28→v1.29 Kubernetes upgrade and Cilium v1.16→v1.17 migration included automated preflight checks: API compatibility verification, deprecated resource scanning, storage driver validation, and network policy migration assessment — all documented as go/no-go evidence before any changes are applied to the production cluster.

Deep Dives

The FAD system is documented across multiple focused pages. Each covers a specific subsystem in depth:

Platform Infrastructure

Home Kubernetes Cluster — the 4-node Orange Pi 5 cluster that runs everything
K8s Deep Dive: Networking — Cilium eBPF, MetalLB, ingress-nginx, network policies
K8s Deep Dive: Storage — Longhorn distributed block storage, replication, snapshots
K8s Deep Dive: GitOps — Flux, Kustomizations, HelmRelease lifecycle
Home Automation on Kubernetes — Home Assistant on K8s with PostgreSQL, Cilium policies, Longhorn PVCs

Agent Capabilities

Credential Broker & Token Vending — dynamic, scoped GitHub tokens replacing static PATs
Agent Memory Architecture — file-based topics + Graphiti knowledge graph + semantic search
Self-Learning Feedback Loop — error capture, R/C/D evidence tracking, cross-agent knowledge sharing
Autonomous Work Planning — Kanban automation, event-driven task pickup, backlog planning
Agent Operational Cost Management — per-agent cost tracking, model tier selection, budget controls

Governance & Documentation

Fiduciary Agent Framework — the trust and duty model that governs agent behavior
Confluence Templates & Agent Documentation Patterns — structured documentation for agent-generated content
Personal Finance on Palantir Foundry — the financial data pipeline powering household analytics
Observability Stack — Prometheus, Grafana, alerting, Graphiti health monitoring

Specifications & Decisions

Inter-Agent Communications — the full protocol spec (JSON-RPC 2.0, NATS, Ed25519, sealed boxes)
Trust Framework — Covey-derived trust model adapted for AI agents
Architecture Decision Records — 7 ADRs documenting key technical choices

What This Demonstrates

Distributed Systems Design

This is a distributed system with real constraints. Agents are separate processes in separate pods with separate storage. They communicate through message passing (NATS), not shared memory. State coordination uses a graph database (Graphiti) with group-level isolation. The MCP Gateway provides service discovery and routing without coupling agents to specific tool implementations.

The system handles the realities of distributed computing: network partitions (Cilium network policies), state consistency (Graphiti as authoritative shared state), message ordering (JSON-RPC correlation IDs), and failure modes (MCP Gateway health checks, NATS durability).

Protocol Design

The inter-agent protocol is a from-scratch design that needed to solve a real problem: how do two AI agents, each with undivided loyalty to a different human, communicate without either compromising their principal’s interests?

The solution layers identity (JSON-RPC headers), authentication (Ed25519 signatures), confidentiality (sealed boxes), authorization (information sharing tiers), and transparency (audit logging + Discord summaries) into a coherent stack. The JSON-RPC 2.0 alignment was deliberate — it provides interoperability with the broader MCP ecosystem while supporting the domain-specific requirements of fiduciary agents.

Security-First Architecture

Security isn’t bolted on — it’s the foundation. The platform has evolved from static credentials (long-lived PATs on pods) to a fully dynamic security model: the Credential Broker issues short-lived, per-request scoped tokens via a GitHub App backend, eliminating persistent secrets from agent pods entirely. Every inter-agent message is cryptographically signed. Tier 3/4 data is end-to-end encrypted such that transport infrastructure can verify sender identity without reading message content. Kubernetes network policies enforce namespace-level isolation. CiliumNetworkPolicy provides L3-L7 enforcement. Agents can’t reach each other’s workspaces. Home Assistant (which controls physical devices) is locked to minimum necessary traffic.

The trust model isn’t “trust everything inside the cluster.” It’s “verify identity, classify information, enforce boundaries, log everything, and give principals full visibility.”

Multi-Principal Trust

The hardest design problem in the system isn’t technical — it’s philosophical. How do you build a cooperative multi-agent system where each agent has an undivided loyalty to a different person?

The answer isn’t compromise. It’s protocol. Information sharing tiers make the boundaries structural rather than judgmental. Relationship templates provide customizable defaults. The nudge protocol enables cooperation (hints for kindness) without enabling surveillance (source can’t be reconstructed). Conflict resolution has four explicit levels with clear escalation paths. Emergency overrides have four narrow conditions with mandatory post-emergency reporting.

The framework explicitly adopts Nash equilibrium thinking: the best outcomes for each principal come from cooperation, not from “winning” at the other’s expense. Positive-sum over zero-sum. But this cooperation is constrained — it never requires either agent to compromise its fiduciary duty.

Operational Maturity

The system doesn’t just build things — it governs itself. Definition of Done enforcement prevents agents from claiming completion without evidence. Event-driven task pickup eliminates polling overhead and race conditions. Safety-stop mechanisms provide emergency override capability. Joint security reviews ensure tool expansion follows a structured risk assessment, not ad-hoc enablement.

This operational governance layer is what separates a demo from a production system. Agents make mistakes — they skip documentation, claim tasks that are blocked, close tickets without follow-on work. The governance layer catches those mistakes structurally, through hooks and pipelines and backstops, rather than relying on agent “judgment” to always be correct.

Workflow Optimization

Lobster pipelines represent a meta-insight about agent systems: most of what agents spend tokens on isn’t thinking — it’s sequencing. DoD checks, sprint planning, ticket creation, memory maintenance — these are deterministic workflows that don’t require intelligence. Moving them to a pipeline runtime reduced token consumption by orders of magnitude while improving reliability (deterministic pipelines don’t hallucinate steps or skip checks).

This is the same pattern that appears in human organizations: you don’t need a senior engineer to run a deployment checklist. You need a senior engineer to design the checklist, and a reliable process to execute it.

Full-Stack Integration

The system spans every layer: ARM64 hardware → vendor Linux kernel → vanilla Kubernetes (kubeadm) → Cilium eBPF networking → Longhorn distributed storage → Helm-deployed workloads → MCP protocol layer → agent runtime → trust framework → Lobster workflow automation → human communication channels. There’s no managed service hiding the complexity. Every layer is visible, operable, and understood — from debugging kernel BTF support on Rockchip vendor kernels to designing information sharing tier semantics for household agents.

This is not a prototype. It runs 24/7, manages real finances, controls real home automation, and handles real family coordination. The infrastructure decisions — Cilium over kube-proxy, Longhorn over NFS, vanilla kubeadm over K3s, JSON-RPC 2.0 over ad-hoc messaging — were made under real constraints and validated by daily operation. The platform runs OpenClaw 2026.3.2 with 23 federated skills, event-driven task pickup, cross-agent DoD enforcement, and a NATS Bridge Plugin that enables real-time inter-agent communication with 5–15 second delivery latency.

FAD: Multi-Agent System Architecture

The Problem

System Architecture

Agent Identities

Fiducian — Principal A’s Fiduciary Agent

Alec — Principal B’s Agent

ClaudeCodeAgent — Infrastructure Agent

Trust Framework

Immutable Principles (Layer 0)

Information Sharing Tiers (Layer 3)

Inter-Agent Communication

Message Format

Cryptographic Security

Audit Trail

Coordination Patterns

Collaboration CONOPS

Handshake and Discovery

Conflict Resolution

Tool Ecosystem

Key Tool Servers

Deep Research Skill

Financial Integration Deep Dive

Graphiti Memory Architecture

Memory Reconciliation

LLM Backend

Incident and Recovery (FAD-473)

Model Abstraction

Operational Model Selection

Operational Maturity

Kanban Workflow

Event-Driven Task Pickup

Definition of Done Enforcement

Operational Safety Mechanisms

Task Authoring Standard

Planning and Execution Stack

Workflow Automation (Lobster)

Deployed Pipelines

Three-Layer DoD Enforcement

Security Posture

Joint Security Review (FAD-455)

Kubernetes Upgrade Discipline

Deep Dives

Platform Infrastructure

Agent Capabilities

Governance & Documentation

Specifications & Decisions

What This Demonstrates

Distributed Systems Design

Protocol Design

Security-First Architecture

Multi-Principal Trust

Operational Maturity

Workflow Optimization

Full-Stack Integration