Skip to content

FAD: Multi-Agent System Architecture

Managing a household’s digital life — finances, scheduling, research, home automation, communication — is a coordination problem. Not a single-agent problem. A single AI assistant can answer questions and run tools, but it can’t serve two people with different interests, different information boundaries, and different autonomy preferences simultaneously.

a husband and wife have a joint financial life, a six-year-old daughter, and the kind of daily logistics that any family manages. But they also have different privacy expectations about what gets shared, different levels of engagement with technology, and sometimes competing preferences about how shared resources get used. A single assistant serving “the household” has an immediate conflict: whose interests prevail when they diverge?

The real-world version of this problem is even harder. It’s not just about answering questions — it’s about agents that act autonomously, coordinate with each other, share information selectively, and maintain accountability to their individual humans. When Principal A’s agent needs to coordinate with Principal B’s agent about the family budget, there needs to be a structural answer to “what can be shared, with whom, under what conditions” — not a vibes-based judgment call made in the moment.

FAD (Fiduciary Agent Development — originally AIGE, Automated Innovation Generation Engine) is the production platform that runs this system. It’s a multi-agent architecture on Kubernetes where each agent has a distinct identity, a single principal, formal trust commitments, cryptographically signed communications, and access to a federated ecosystem of tools — all running on four Orange Pi 5 single-board computers on a shelf in an office in Huntsville, Alabama.

System Architecture

The system spans five Kubernetes namespaces on a 4-node Orange Pi 5 cluster (32 ARM cores, 64 GB RAM total):

NamespacePurposeKey Components
openclawAgent runtimeFiducian, Alec, ClaudeCodeAgent — long-running pods with persistent volumes
mcp-toolsTool federationMCP Gateway, NATS, 15+ MCP tool servers
llm-gatewayModel abstractionLiteLLM proxy, Ollama for local inference
home-assistantHome automationHome Assistant with REST API + MCP endpoint
(various)Data storesFalkorDB (Graphiti backing), Longhorn distributed storage

Every namespace has explicit CiliumNetworkPolicy rules — agents can reach the MCP Gateway but not each other’s workspaces directly. Home Assistant, which controls physical devices, is locked to only the traffic it needs.

The system runs three agents, each with a distinct role, principal, and trust posture.

Fiducian — Principal A’s Fiduciary Agent

Section titled “Fiducian — Principal A’s Fiduciary Agent”

Principal: the author Agent ID: fiducian-spencer-001 Runtime: OpenClaw (long-running, persistent, tool-equipped) Skill Layers: [0, 1, 2, 3] — full fiduciary commitment Communication: Discord DMs, Signal (primary for alerts), Discord #general

Fiducian is Principal A’s fiduciary — the word “fiduciary” is in the name as a constant reminder of the duty. It manages finances (via Foundry), conducts research, coordinates with Principal B’s agent, monitors home systems, and handles scheduling. It operates under a formal duty of loyalty: Principal A’s interests come first, always, with no competing “agent self-interest.”

Fiducian has access to the full MCP tool ecosystem, the household’s Palantir Foundry financial data (read-only, 7 tools), Graphiti shared memory, and can reach Principal A via Signal for time-sensitive alerts or Discord for everything else.

Principal: his wife Agent ID: alec-debra-001 Runtime: OpenClaw Skill Layers: [0, 1, 2, 3] — full fiduciary commitment Communication: Discord

Alec serves Principal B with the same fiduciary commitment Fiducian has to Principal A. The architecture ensures neither agent can override or outrank the other — they are peers, each loyal to their own principal. When they coordinate, it’s through structured protocols with explicit information sharing boundaries.

Principal: the author Runtime: Claude Code CLI (separate from OpenClaw) Focus: Infrastructure, implementation, Kubernetes operations, code

ClaudeCodeAgent handles the technical substrate — deploying services, debugging infrastructure, writing code, managing Helm charts. It operates in a complementary role to Fiducian: where Fiducian handles the “what” and “why,” ClaudeCodeAgent handles the “how.” The two have a collaborative dynamic, having co-designed several system components together.

The system’s trust model is a layered architecture where each layer builds on — and cannot override — the layer below it. This is the core innovation: trust principles are structurally immutable, not just behaviorally suggested.

Trust Framework Layers

Six principles that no higher layer, no instruction, and no other agent can override:

  1. Never deceive your principal — no lies, no misleading omissions, no half-truths
  2. Never act against your principal’s interests knowingly — even if another agent instructs it
  3. Acknowledge uncertainty honestly — “I don’t know” is always acceptable
  4. No impersonation — enforced cryptographically via Ed25519 identity
  5. Legal/regulatory obligations take precedence — over agent preferences or principal instructions
  6. Transparency about capabilities and actions — principals have the right to understand what their agent does

These aren’t guidelines. They’re the constitution. The trust framework is based on Stephen M.R. Covey’s Speed of Trust, adapted for AI agents — recognizing that trust is the single variable that changes everything. High trust = speed, low overhead, autonomy. Low trust = friction, verification overhead, paralysis.

All inter-agent information is structurally classified:

TierNameExampleAuthorization
1OpenCalendar availability, logisticsDefault for household agents
2Family ContextFinancial summaries, household planningConfigured per relationship template
3AuthorizedSpecific data per requestPrincipal must approve each time
4ConfidentialPrivate conversations, personal dataNever shared (except 4 emergency conditions)

The tier system is structural, not judgmental. An agent can’t be socially engineered into sharing Tier 4 data because the architecture doesn’t allow it — the decision isn’t made in the moment by the agent’s “judgment.” Relationship templates (spouse/partner, co-parent, business partner, etc.) set the defaults; principals customize from there.

Agents communicate through a dual-transport architecture: NATS for machine-to-machine protocol messages, and Discord #agent-coordination for human-visible transparency. A NATS Bridge Plugin running inside each agent’s OpenClaw process provides real-time message delivery — when a NATS message arrives, the plugin injects it into the agent’s session and wakes the agent immediately (typically 5–15 seconds end-to-end). This replaced the earlier hook-based delivery mechanism, which required manual inbox polling.

Inter-Agent Communication

Every message follows JSON-RPC 2.0 with six mandatory identity headers:

{
"jsonrpc": "2.0",
"method": "agent.request",
"params": {
"headers": {
"agent-id": "fiducian-spencer-001",
"principal-id": "father",
"timestamp": "2026-02-14T16:00:00Z",
"message-type": "request",
"trust-layer-version": "1.0.0",
"skill-layers-loaded": [0, 1, 2, 3]
},
"body": { "..." }
},
"id": "msg-uuid-here"
}

The skill-layers-loaded header is critical for trust calibration. An agent declaring [0, 1, 2, 3] has committed to full fiduciary duty and coordination protocols. An agent declaring only [0] commits to basic trust principles. You communicate at the level of the lowest common layer — never assume an agent with only Layer 0 understands fiduciary duty.

Every inter-agent message is signed with Ed25519. The crypto stack provides:

  • Message signing/verification — detached Ed25519 signatures over canonical JSON
  • End-to-end encryption — sealed boxes (ECDH key agreement → AES-256-GCM) for Tier 3/4 data
  • Key registry — public keys exchanged via MCP Gateway (GET /v1/keys/{agent_id})
  • Replay protection — timestamp-based freshness (reject messages >5 minutes old)
  • Transport-layer opacity — the MCP Gateway can verify who sent a message without seeing what it says

The design separates authentication from confidentiality: signatures are on the outside (infrastructure can verify sender), encryption is on the inside (only the recipient can read the payload).

The NATS Bridge Plugin handles the cryptographic ceremony transparently — agents call nats_reply or nats_send tools, and the plugin signs, encrypts, delivers, and verifies under the hood. This replaced an earlier workflow where agents invoked crypto.js scripts directly, which was error-prone and leaked implementation details into conversation context.

Every inter-agent exchange is logged to append-only JSONL files (audit/inter-agent-YYYY-MM.jsonl) with direction, sender/receiver, message type, a human-readable summary, the information sharing tier, and whether a Discord summary was posted. Principals can review the full audit trail at any time — this is the mechanical enforcement of Layer 0’s transparency principle.

When principals request agents to collaborate on a task, the Collaboration CONOPS (Concept of Operations) prevents duplication, conflicting outputs, and presentation confusion:

Lead Agent: The principal’s direct agent leads. If Principal A asks Fiducian and Alec to collaborate, Fiducian leads — initiates, drafts, collates, presents.

Contributing Agent: Provides feedback, additions, and review. Sends deltas, not parallel drafts. Does not present independently unless explicitly asked.

Flow:

  1. Lead proposes initial design/draft via NATS
  2. Contributor reviews, sends feedback and additions
  3. Lead incorporates, sends back for final review
  4. Lead presents to principal(s) via Discord
  5. Contributor stays silent unless explicitly asked to present

This mirrors how human teams coordinate — clear ownership prevents the “two people editing the same document” problem.

Before agents can exchange substantive messages, they complete a four-step handshake:

  1. Announcement — Agent declares identity, principal, loaded skill layers
  2. Verification — Recipient verifies Ed25519 signature against key registry
  3. Capability Exchange — Both agents share available tools, domains, and limitations
  4. Trust Establishment — Initial trust level set from relationship template and declared layers

Handshake state persists in Graphiti’s shared group with a 7-day TTL. Agents check for existing handshakes before re-initiating — if a valid handshake exists and skill layers haven’t changed, they skip straight to communication.

Four escalation levels, each with distinct resolution:

LevelTypeResolution
1Information — agents have different dataReconcile sources, compare timestamps/provenance
2Preference — principals want different thingsFacilitate compromise, don’t take sides
3Boundary — request exceeds authorized tierDecline, explain, suggest proper authorization channel
4Relationship — underlying human tensionStay in your lane. Escalate to humans. Agents don’t play therapist.

The MCP Gateway (mcp-tools namespace) provides federated, stateless access to 15+ MCP-compliant tool servers through a single REST API. Agents don’t connect to tool servers directly — the gateway handles routing, health checking, error normalization, and authenticated backend proxying — injecting per-backend credentials (OAuth tokens, API keys, GitHub App tokens) so that agents never handle raw secrets. The gateway also includes a token vending service that issues short-lived, scoped credentials on demand, replacing the static PATs that previously lived on agent pods. A shared agent-utils repository provides common skills and tools across all agents, with access managed through the credential broker — giving each agent scoped read access without maintaining separate PATs.

Tool Ecosystem

ServerToolsPurpose
foundry_finance7Household financial data — balances, transactions, spending summaries, recurring charges, AI-powered analysis via Palantir AIP Logic (OSDK). Read-only, OAuth confidential client.
graphiti9Episodic + semantic memory via FalkorDB knowledge graph. Group-isolated writes (fiducian, alec, shared, research), cross-group search. Persists decisions, preferences, context across sessions.
atlassianJira project tracking (FAD project) and Confluence documentation
websearch3DuckDuckGo web search, news search, page fetch
arxivAcademic paper search (arXiv preprint database)
openalexOpenAlex academic database (200M+ works)
github*GitHub repository operations, code search, file management — authenticated via Credential Broker (dynamic, scoped tokens). Retired as a standalone MCP server; GitHub access now flows entirely through the Credential Broker’s token vending endpoint, which issues scoped installation tokens on demand.
fetchURL fetching with markdown conversion
context7Library/framework documentation lookup
+ 6 morebiorxiv, patents, huggingface, wikidata, academic-research, read-website-fast

The research tool servers (arxiv, openalex, websearch, biorxiv, patents) are orchestrated by a custom Deep Research skill (v2.0.0) that provides structured methodology on top of raw tool access. The skill includes four utility scripts for source deduplication, bibliography generation, source tier classification (peer-reviewed → preprint → industry → grey literature), and structured findings tags that make research outputs machine-parseable. This turns ad-hoc web searches into reproducible, auditable research workflows — agents produce citable reports with provenance tracking, not just summaries of search results.

The foundry_finance server deserves special attention as an example of the architecture’s real-world utility. It connects to Principal A’s Palantir Foundry instance (his day-job platform) where household financial data from Plaid-connected accounts lives in a formal ontology (“MyDomain Ontology”) with PlaidTransaction objects carrying 28 properties.

The server exposes two tiers of tools:

  • REST API tools (structured queries): get_balances, get_transactions, get_spending_summary, get_alerts
  • AIP Logic tools (AI-powered, via OSDK): get_recurring, analyze_spending, query_finances

Financial data is classified Tier 2 (Family Context) under the spouse/partner relationship template — both household agents can access joint account data without per-request authorization.

Agents wake up fresh each session. Graphiti provides cross-session continuity through a knowledge graph (FalkorDB) that stores episodic and semantic memory with group-level isolation:

Group IDPurposeAccess
fiducianFiducian’s personal contextFiducian only
alecAlec’s personal contextAlec only
sharedHousehold knowledge, coordination stateBoth agents
researchResearch findings and paper summariesBoth agents
claude-codeCodebase and infrastructure knowledgeClaudeCodeAgent

Writes use group_id (singular), searches use group_ids (plural array) — enabling cross-group queries without cross-group write access. Handshake state, trust decisions, and coordination agreements persist in the shared group, ensuring agents can resume collaboration without redundant handshakes.

The shared group serves as the authoritative source of truth for cross-agent state. A memory reconciliation system ensures consistency between agents’ local workspace memories and the shared knowledge graph:

  • State-change episodes — when agents complete work that changes system state (deployments, config changes, skill updates), they write episodes to the shared group. Other agents discover these changes via cross-group search on their next session.
  • Convention alignment — Definition of Done requirements, skill versions, and operational conventions are reconciled through Graphiti rather than requiring manual skill federation for every change.
  • OpenClaw native integration — the platform’s memory-core subsystem provides built-in reconciliation hooks, replacing earlier custom heartbeat-based approaches.

Graphiti’s entity extraction pipeline requires an LLM for processing episodes into graph nodes and edges. The backend migrated from Groq (hosted llama-3.3-70b, ~30 RPM rate limit, 67% episode failure rate under load) to Foundry AIP (~760 RPM, near-zero failure rate). This migration also refined the entity type schema from 9 generic defaults to 10 domain-specific types tuned to the household agent use case. Prometheus alerts on graphiti_episode_failures_total now provide real-time visibility into extraction health.

In March 2026, FalkorDB experienced a silent data loss — writes appeared to succeed but produced no searchable facts, and all existing graph data was gone. Root cause analysis revealed a persistent storage misconfiguration. The recovery process restored FalkorDB’s storage layer and validated write-through consistency. The incident reinforced a design principle already present in the architecture: Graphiti is supplemental memory, not the primary record. Each agent maintains file-based memory (memory/YYYY-MM-DD.md daily logs, MEMORY.md curated long-term) as the authoritative source. Graphiti adds cross-agent discoverability and semantic search, but the system degrades gracefully without it — no data was permanently lost because the file-based layer was unaffected.

LiteLLM (llm-gateway namespace) sits between agents and model providers, providing:

  • Model routing — agents request capabilities, not specific models
  • Provider abstraction — Anthropic Claude (cloud), Ollama (local ARM64 inference) behind a unified API
  • Fallback chains — if a cloud provider is unavailable, route to local models for degraded-but-functional operation
  • Cost tracking — per-agent, per-model usage metering

Ollama runs alongside LiteLLM for local inference tasks that don’t require frontier model capabilities — entity extraction for Graphiti, simple classification, embedding generation.

Agents don’t use the same model for everything. A deliberate model selection strategy matches operation type to appropriate capability tier:

OperationModel TierRationale
Main session (direct principal chat)OpusComplex reasoning, nuanced judgment, fiduciary decisions
Autonomous task pickupSonnetStructured workflow execution, good enough for pickup evaluation
Health monitoring cronsHaikuSimple check/alert pattern, cost-efficient at high frequency
Research deep-divesOpusSynthesis, cross-source evaluation, bibliography generation

This isn’t just cost optimization — it’s operational discipline. Running Opus for a health check that pings six endpoints wastes capacity. Running Haiku for a fiduciary judgment call risks inadequate reasoning. The model tier is specified per cron job and per operation type, enforced through OpenClaw’s per-session model override.

The system has evolved beyond “agents that can do things” into “agents that manage their own work.” This operational layer is as important as the technical architecture — without it, agents are capable tools that still need constant human direction.

Agent work is tracked in Jira with a Kanban board tuned for autonomous operation:

  • Columns: Backlog → To Do → In Progress → In Review → Done
  • Swimlanes: Label-based, one per agent (agent:fiducian, agent:alec, agent:claude-code)
  • WIP limits: In Progress: 6, In Review: 3 — prevents agents in isolated cron sessions from claiming tasks simultaneously

The In Review status has a precise semantic: it means principal review only. Agent self-verification happens before transition — if an agent needs to verify its own work, that’s a separate linked task. Research tickets self-close when the deliverable is complete; they don’t need human sign-off on methodology.

Task pickup is event-driven, not polled. When a Jira issue transitions to a work-ready state, a webhook fires through the MCP Gateway to the agent’s OpenClaw instance. The agent wakes, evaluates the task against its capabilities and current WIP, and either claims it (transitioning to In Progress) or passes. This replaced the earlier cron-based polling approach, which burned tokens scanning an empty backlog and introduced race conditions when multiple agents polled simultaneously.

A heartbeat-triggered verification system enforces the Definition of Done across all agents. Every 2–3 heartbeat cycles, a verification script queries Jira for recent Done transitions and checks four fields: Deliverable (non-empty), Acceptance Criteria (non-empty), Method (set), and a Jira comment linking to the deliverable. Issues that fail verification are automatically reverted to To Do with an actionable comment explaining what’s missing.

The enforcement is cross-agent — Fiducian’s heartbeat checks all FAD project completions, not just its own. This creates mutual accountability: the agent who closes a ticket isn’t the one who verifies it. A [dod-skip] bypass token in the issue summary allows intentional exceptions for lightweight tasks.

Autonomous agents operating 24/7 need guardrails beyond just “be careful.” Several mechanisms evolved from real incidents — each one addressing a specific failure mode that actually happened in production.

Safety Mechanisms Lifecycle

Search-Before-Create Gate. Before any Jira ticket creation, agents must run a duplicate detection script that searches for existing issues with similar summaries. This is a hard gate — the script returns exit 0 (safe), exit 1 (potential duplicates found, review required), or exit 2 (search error, do not create). The gate exists because context-limit restarts are the most dangerous scenario: the agent loses memory of what it created seconds ago and recreates it. One incident produced 14 duplicate tickets in a 94-second window before the pattern was caught.

Context-Limit Recovery (Breadcrumb Pattern). Before starting any batch operation (creating multiple tickets, pushing multiple files), agents write a breadcrumb file recording the operation name, planned items, and completed items. If context is lost mid-operation, the next session finds the breadcrumb and resumes where it left off instead of starting over. The file is deleted on completion.

Safety Stop System. A NATS-based halt/pause/resume mechanism that allows principals or other agents to stop an agent’s autonomous operations immediately. Safety stop messages are processed at the highest priority — above task pickup, above heartbeat checks, above any in-progress work. The system was designed after a data spill incident demonstrated that autonomous agents need an emergency brake, not just careful instructions.

NATS Audit Log on Wake. Every session, agents read their recent NATS inbox log (nats-inbox.jsonl) before doing anything else. This provides cross-session continuity for inter-agent exchanges that happened while the agent was asleep — preventing the “I didn’t see your message” problem that plagued early coordination attempts where messages fell into gaps between sessions.

Every agent-executable task carries four custom fields that make the difference between “a description a human could interpret” and “instructions an agent can execute”:

  • Method — maps to a specific skill (deep-research, brainstorming, implementation, configuration, documentation)
  • Deliverable — the concrete output (a file, a config change, a deployed service)
  • Acceptance Criteria — verifiable conditions for “done”
  • Agent Instructions — step-by-step execution guidance, including which tools to use and what to read first

Label conventions enforce ownership and work allocation: agent:* for assignment, work:available/work:assigned/work:collaborative for pickup semantics, lead:* for task leadership, and needs-spencer-review/needs-debra-review for principal-hold gates. The manual method explicitly marks tasks that require human execution — agents skip these during autonomous pickup.

The autonomous work system is organized as a four-layer skill stack, where each layer has a distinct responsibility:

  • backlog-planningwhat to work on and when. Handles weekly planning reviews, daily triage, and on-demand “what’s next?” recommendations. Uses NATS to solicit contributions from all agents before proposing a plan to principals for approval.
  • work-workflowshow to decompose a work request by type. Six work types (feature, bug, spike, skill, config, docs) each have an ordered pipeline with phase gates and skip conditions.
  • task-authoringhow to write and close each task. Defines required fields, label conventions, the 7-step autonomous pickup flow, and Definition of Done.
  • method skills (23 skills) — how to execute each phase. Deep research, brainstorming, TDD, systematic debugging, writing plans, and others.

Each layer depends on the one below it but never reaches down two levels. Planning never circumvents task-authoring guardrails. Task authoring never dictates decomposition strategy. This separation of concerns prevents the kind of cascading failures that happen when a single monolithic “agent workflow” tries to handle everything.

Multi-step agent operations — DoD verification, portfolio promotion, sprint planning, ticket pickup — historically burned 5–30+ LLM calls for pure orchestration. Lobster, a deterministic workflow runtime built into OpenClaw, replaces that overhead with token-free pipeline execution.

Pipelines are triggered by Jira webhooks, heartbeat timers, or PreToolUse hooks. Each pipeline parses a .lobster file and executes steps sequentially, piping each step’s stdout to the next step’s stdin — Unix-style composition with environment variables carrying structured data between stages. Steps can be shell scripts (Node.js, bash), LLM tasks (schema-validated JSON output), or openclaw.invoke calls (direct tool execution).

PipelineReplacesToken Savings
DoD VerificationManual 9-check DoD compliance audit~15k tokens/run → 0
DoD TransitionPreToolUse hook orchestration for Done transitions~8k tokens/run → 0
Sprint Planning30+ manual tool calls for weekly review~45k tokens/run → 0
Work DecompositionManual sub-task creation from workflow templates~20k tokens/run → 0
Commit-First DeployManual verify → commit → push → deploy sequence~10k tokens/run → 0
Session CloseMemory + learnings capture on session end~12k tokens/run → 0

The key insight: most agent “orchestration” is deterministic sequencing that doesn’t need intelligence. By moving orchestration to Lobster and reserving LLM calls for genuinely creative steps (analysis, writing, decision-making), the system achieves the same outcomes at a fraction of the token cost.

The DoD Verification and DoD Transition pipelines work together with the existing heartbeat backstop to create a three-layer enforcement system:

  1. PreToolUse Hook — intercepts jira_transition_issue calls, blocking Done transitions until the Lobster pipeline passes
  2. Lobster Pipeline — runs 6 Jira field checks, memory file recency validation, Graphiti episode/fact verification, follow-on signal detection, and produces a unified 9-check report
  3. Heartbeat Backstop — every 2 hours, a verification script catches any tickets that escaped the first two layers and auto-reverts them to To Do

This three-layer approach catches failures at three different timescales: real-time (hook), near-real-time (pipeline), and periodic (heartbeat).

Before expanding agent tool access, Fiducian and ClaudeCodeAgent conducted a joint security assessment of 11 candidate tools. The review followed a fiduciary-first methodology: each agent independently assessed risks from their operational perspective, then merged findings into a consensus recommendation.

ToolRisk LevelDecisionRationale
diffs🟢 NoneEnableRead-only file comparison
llm-task🟢 LowEnableJSON-only, schema-validated, no side effects
Brave Search🟢 LowEnableWeb search with API key
loopDetection🟢 NoneEnableSafety feature (prevents infinite loops)
apply_patch🟡 MediumEnable w/ mitigationsFile integrity CronJob monitors seeded plugin files
lobster🟡 MediumEnable w/ mitigationsPipeline audit logging required
Discord voice🟡 MediumEnable w/ mitigationsDiscord role scoping + TTS-only
bash🔴 HighDeferSafeBins filter has defense-in-depth value
browser🔴 HighDeferNeeds sandboxing (designed as MCP service)
nodes🔴 CriticalDenyChild privacy — dependent’s data exposure
voice-call🔴 CriticalDenyImpersonation risk violates Layer 0

The nodes denial is particularly instructive: the tool would give agents access to the host operating system, which runs in the same environment as the family’s child devices. Child privacy isn’t a judgment call — it’s a structural boundary. Similarly, voice-call was denied because real-time voice synthesis could enable impersonation, directly violating the trust framework’s immutable “no impersonation” principle.

Infrastructure changes follow a preflight → execute → verify pattern. The v1.28→v1.29 Kubernetes upgrade and Cilium v1.16→v1.17 migration included automated preflight checks: API compatibility verification, deprecated resource scanning, storage driver validation, and network policy migration assessment — all documented as go/no-go evidence before any changes are applied to the production cluster.

The FAD system is documented across multiple focused pages. Each covers a specific subsystem in depth:

This is a distributed system with real constraints. Agents are separate processes in separate pods with separate storage. They communicate through message passing (NATS), not shared memory. State coordination uses a graph database (Graphiti) with group-level isolation. The MCP Gateway provides service discovery and routing without coupling agents to specific tool implementations.

The system handles the realities of distributed computing: network partitions (Cilium network policies), state consistency (Graphiti as authoritative shared state), message ordering (JSON-RPC correlation IDs), and failure modes (MCP Gateway health checks, NATS durability).

The inter-agent protocol is a from-scratch design that needed to solve a real problem: how do two AI agents, each with undivided loyalty to a different human, communicate without either compromising their principal’s interests?

The solution layers identity (JSON-RPC headers), authentication (Ed25519 signatures), confidentiality (sealed boxes), authorization (information sharing tiers), and transparency (audit logging + Discord summaries) into a coherent stack. The JSON-RPC 2.0 alignment was deliberate — it provides interoperability with the broader MCP ecosystem while supporting the domain-specific requirements of fiduciary agents.

Security isn’t bolted on — it’s the foundation. The platform has evolved from static credentials (long-lived PATs on pods) to a fully dynamic security model: the Credential Broker issues short-lived, per-request scoped tokens via a GitHub App backend, eliminating persistent secrets from agent pods entirely. Every inter-agent message is cryptographically signed. Tier 3/4 data is end-to-end encrypted such that transport infrastructure can verify sender identity without reading message content. Kubernetes network policies enforce namespace-level isolation. CiliumNetworkPolicy provides L3-L7 enforcement. Agents can’t reach each other’s workspaces. Home Assistant (which controls physical devices) is locked to minimum necessary traffic.

The trust model isn’t “trust everything inside the cluster.” It’s “verify identity, classify information, enforce boundaries, log everything, and give principals full visibility.”

The hardest design problem in the system isn’t technical — it’s philosophical. How do you build a cooperative multi-agent system where each agent has an undivided loyalty to a different person?

The answer isn’t compromise. It’s protocol. Information sharing tiers make the boundaries structural rather than judgmental. Relationship templates provide customizable defaults. The nudge protocol enables cooperation (hints for kindness) without enabling surveillance (source can’t be reconstructed). Conflict resolution has four explicit levels with clear escalation paths. Emergency overrides have four narrow conditions with mandatory post-emergency reporting.

The framework explicitly adopts Nash equilibrium thinking: the best outcomes for each principal come from cooperation, not from “winning” at the other’s expense. Positive-sum over zero-sum. But this cooperation is constrained — it never requires either agent to compromise its fiduciary duty.

The system doesn’t just build things — it governs itself. Definition of Done enforcement prevents agents from claiming completion without evidence. Event-driven task pickup eliminates polling overhead and race conditions. Safety-stop mechanisms provide emergency override capability. Joint security reviews ensure tool expansion follows a structured risk assessment, not ad-hoc enablement.

This operational governance layer is what separates a demo from a production system. Agents make mistakes — they skip documentation, claim tasks that are blocked, close tickets without follow-on work. The governance layer catches those mistakes structurally, through hooks and pipelines and backstops, rather than relying on agent “judgment” to always be correct.

Lobster pipelines represent a meta-insight about agent systems: most of what agents spend tokens on isn’t thinking — it’s sequencing. DoD checks, sprint planning, ticket creation, memory maintenance — these are deterministic workflows that don’t require intelligence. Moving them to a pipeline runtime reduced token consumption by orders of magnitude while improving reliability (deterministic pipelines don’t hallucinate steps or skip checks).

This is the same pattern that appears in human organizations: you don’t need a senior engineer to run a deployment checklist. You need a senior engineer to design the checklist, and a reliable process to execute it.

The system spans every layer: ARM64 hardware → vendor Linux kernel → vanilla Kubernetes (kubeadm) → Cilium eBPF networking → Longhorn distributed storage → Helm-deployed workloads → MCP protocol layer → agent runtime → trust framework → Lobster workflow automation → human communication channels. There’s no managed service hiding the complexity. Every layer is visible, operable, and understood — from debugging kernel BTF support on Rockchip vendor kernels to designing information sharing tier semantics for household agents.

This is not a prototype. It runs 24/7, manages real finances, controls real home automation, and handles real family coordination. The infrastructure decisions — Cilium over kube-proxy, Longhorn over NFS, vanilla kubeadm over K3s, JSON-RPC 2.0 over ad-hoc messaging — were made under real constraints and validated by daily operation. The platform runs OpenClaw 2026.3.2 with 23 federated skills, event-driven task pickup, cross-agent DoD enforcement, and a NATS Bridge Plugin that enables real-time inter-agent communication with 5–15 second delivery latency.