Inter-Agent Communications
Inter-Agent Communications Specification
Section titled “Inter-Agent Communications Specification”1. Overview
Section titled “1. Overview”This specification defines a standard for autonomous AI agents to identify themselves, exchange structured messages, discover peers, and maintain auditable communication records. It is designed for multi-agent systems where each agent operates on behalf of a distinct human principal.
Inter-Agent Communications is Layer 1 of a layered skills architecture. It depends on the Trust Framework (Layer 0) and mechanically enforces its principles — particularly no-impersonation (via identity headers and cryptographic signatures) and transparency (via audit logging and human-visible summaries).
Motivation
Section titled “Motivation”When multiple AI agents coordinate on behalf of different humans, several problems emerge:
- Identity: How does one agent know who it’s talking to, and on whose behalf?
- Trust calibration: How does an agent determine what behavioral commitments another agent has made?
- Accountability: How do human principals review what their agents discussed?
- Interoperability: How do heterogeneous agent implementations communicate reliably?
This specification addresses these by defining mandatory identity headers, a JSON-RPC 2.0 message format, a four-step discovery handshake, cryptographic signing and encryption, multi-runtime transport, and built-in audit requirements.
Architecture
Section titled “Architecture”The system consists of three agents operating across two runtime environments — Kubernetes cluster pods and a desktop CLI — connected by NATS messaging through a central gateway.
2. Message Format
Section titled “2. Message Format”All inter-agent messages conform to JSON-RPC 2.0, chosen for interoperability with the Model Context Protocol (MCP) ecosystem and other JSON-RPC-based systems.
2.1 Structure
Section titled “2.1 Structure”{ "jsonrpc": "2.0", "method": "agent.<message-type>", "params": { "headers": { "agent-id": "<unique-agent-identifier>", "principal-id": "<human-principal-identifier>", "timestamp": "<ISO-8601-datetime>", "message-type": "<request|response|notification|handoff>", "trust-layer-version": "1.0.0", "skill-layers-loaded": [0, 1] }, "body": { } }, "id": "<uuid>"}2.2 Mandatory Identity Headers
Section titled “2.2 Mandatory Identity Headers”Every inter-agent message MUST include all six headers. Messages with missing or malformed headers MUST be rejected without processing.
| Header | Type | Description |
|---|---|---|
agent-id | string | Unique identifier for the sending agent. |
principal-id | string | Identifier of the human principal this agent serves. |
timestamp | string | ISO 8601 timestamp of message creation. |
message-type | string | One of: request, response, notification, handoff. |
trust-layer-version | string | Semantic version of the trust protocol in use. |
skill-layers-loaded | int[] | Array of skill layer indices the agent has loaded. |
Trust Calibration via skill-layers-loaded
Section titled “Trust Calibration via skill-layers-loaded”The skill-layers-loaded header enables trust calibration without prior relationship history. An agent declaring [0, 1, 2, 3] has committed to fiduciary duty and coordination protocols. An agent declaring [0] commits only to foundational trust principles. Receiving agents SHOULD communicate at the level of the lowest common layer and MUST NOT assume capabilities associated with undeclared layers.
3. Message Types
Section titled “3. Message Types”3.1 Request
Section titled “3.1 Request”A message asking another agent to perform an action or provide information. Requires a response.
- MUST include a unique
idfor response correlation. - The
bodyfield contains the specifics of the request. - The receiving agent MUST send a response; silence is not acceptable.
3.2 Response
Section titled “3.2 Response”A reply to a prior request.
- MUST reference the original request’s
id. - The
bodycontains the result or error information.
3.3 Notification
Section titled “3.3 Notification”A one-way informational message. No response required.
- Used for status updates, proactive information sharing, and alerts.
- MUST NOT include an
idfield, per JSON-RPC 2.0 convention.
3.4 Handoff
Section titled “3.4 Handoff”Transfers responsibility for a task from one agent to another.
- The
bodyMUST include: what has been completed, what remains, and all relevant context. - The receiving agent should be able to continue without re-gathering information.
4. Discovery and Handshake
Section titled “4. Discovery and Handshake”Before agents exchange substantive messages, they MUST complete a four-step handshake. An agent that sends a request without completing the handshake MUST have that request rejected.
4.1 Step 1 — Announcement
Section titled “4.1 Step 1 — Announcement”The initiating agent declares its identity, principal, and loaded skill layers. Uses message-type: notification with method agent.announce.
{ "jsonrpc": "2.0", "method": "agent.announce", "params": { "headers": { "agent-id": "agent-alpha", "principal-id": "principal-a", "timestamp": "2026-02-14T12:00:00Z", "message-type": "notification", "trust-layer-version": "1.0.0", "skill-layers-loaded": [0, 1, 2] }, "body": { "capabilities": ["calendar-management", "document-retrieval"], "purpose": "Coordinate scheduling between principals" } }}4.2 Step 2 — Verification
Section titled “4.2 Step 2 — Verification”The receiving agent verifies the announcement’s identity claims through cryptographic verification. Announcement messages MUST be signed with the sender’s Ed25519 private key. The receiver verifies the signature against the sender’s registered public key.
Messages with invalid or missing signatures MUST be rejected. This step mechanically enforces the Trust Framework’s no-impersonation principle (Immutable Principle 4).
4.3 Step 3 — Capability Exchange
Section titled “4.3 Step 3 — Capability Exchange”Both agents share:
- Available tools and services
- Information domains they can address
- Known limitations and boundaries
- What they are requesting from the other agent
This prevents requests for capabilities the other agent cannot fulfill.
4.4 Step 4 — Trust Establishment
Section titled “4.4 Step 4 — Trust Establishment”Initial trust level is calibrated based on:
- Known principal — Recognized principals warrant higher initial trust.
- Shared context — Agents serving principals with an established relationship start with elevated trust.
- Unknown principal — Baseline trust with higher verification requirements.
- Declared layers — The
skill-layers-loadedheader informs behavioral expectations.
Trust establishment is continuous, not one-time. It is recalibrated based on observed behavior over the life of the relationship.
Handshake Sequence Diagram
Section titled “Handshake Sequence Diagram”5. Cryptographic Layer
Section titled “5. Cryptographic Layer”The cryptographic layer sits between agent application logic and the NATS transport. Every outbound message is signed; every inbound message is verified before processing. Sensitive payloads are additionally encrypted with sealed boxes.
5.1 Key Management
Section titled “5.1 Key Management”Each agent has an Ed25519 keypair provisioned at deployment:
- Private key: Stored in HashiCorp Vault as a Kubernetes secret, injected via the External Secrets Operator (ESO). The agent reads it from an environment variable — it never touches disk.
- Public key: Registered with the MCP Gateway’s key registry. Any agent can look up any other agent’s public key by agent ID via
GET /v1/keys/{agent_id}. - Key rotation: Vault manages the lifecycle. When a key rotates, the new public key propagates through ESO → Kubernetes Secret → pod restart → key registry update. Agents re-discover keys on the next message exchange.
The deliberate choice to use Ed25519 (rather than RSA or ECDSA with NIST curves) reflects the deployment environment: ARM64 single-board computers where Ed25519’s constant-time operations and small key sizes (32 bytes) matter. Signing is fast enough that every message — including heartbeats and status updates — gets signed without measurable latency impact.
5.2 Signing Flow
Section titled “5.2 Signing Flow”Every outbound message follows the same path:
- Agent constructs a JSON-RPC 2.0 message with the 6 required identity headers.
- The message is canonicalized to a deterministic JSON representation.
- An Ed25519 detached signature is produced over the canonical form.
- The signed envelope wraps the original message with sender ID, timestamp, signature (base64), and a public key reference.
- The envelope is published to NATS on the appropriate subject.
{ "sender_id": "fiducian-spencer-001", "timestamp": "2026-02-15T12:00:00.000Z", "signature": "<base64 Ed25519 detached signature>", "public_key_ref": "fiducian-spencer-001", "payload": { "jsonrpc": "2.0", "method": "agent.request", "params": { "headers": { "..." }, "body": { "type": "coordination", "content": "..." } } }}Runtime differences: OpenClaw agents (Fiducian, Alec) sign messages automatically via the NATS Bridge Plugin’s nats_send/nats_reply tools. ClaudeCodeAgent signs via crypto.js — a Node.js script that reads the message from stdin and produces the signed envelope. The nats-wake daemon signs autonomously via its in-process MCP tool.
5.3 Verification Flow
Section titled “5.3 Verification Flow”Inbound messages follow the reverse path:
- Agent receives a signed envelope from NATS.
- Extracts the
sender_idand queries the MCP Gateway key registry for the sender’s public key. - Verifies the Ed25519 signature against the canonical JSON of the payload.
- If verification succeeds, the agent processes the message. If it fails, the message is rejected and logged as a security event per Section 10.
Replay protection uses the timestamp field — messages older than 5 minutes are rejected regardless of signature validity. This prevents an attacker from capturing and resending legitimate messages.
5.4 Sealed-Box Encryption for Sensitive Data
Section titled “5.4 Sealed-Box Encryption for Sensitive Data”Not all inter-agent communication is equal. The protocol defines four information tiers:
| Tier | Classification | Crypto Required | Example |
|---|---|---|---|
| 1 | Public | Sign only | Status updates, heartbeats |
| 2 | Internal | Sign only | Task coordination, scheduling |
| 3 | Internal-Confidential | Sign + Encrypt | Financial data, analysis results |
| 4 | Restricted | Sign + Encrypt | Credentials, PII, projections |
For Tier 3 and 4 data, the payload is encrypted before signing. This creates a two-layer envelope:
- Outer layer (signed): The MCP Gateway can verify the sender’s identity and route the message. It sees who is talking to whom, but not what they’re saying.
- Inner layer (encrypted): Only the recipient’s private key can decrypt the payload. The gateway, NATS, and any infrastructure in between sees only ciphertext.
The encryption scheme is functionally equivalent to NaCl sealed boxes:
- Convert the recipient’s Ed25519 public key to X25519 (Curve25519) for key agreement.
- Generate an ephemeral X25519 keypair (unique per message — fresh randomness every time).
- Perform ECDH key agreement between the ephemeral private key and the recipient’s public key.
- Derive a symmetric key via HKDF-SHA256.
- Encrypt with AES-256-GCM (authenticated encryption with associated data).
The ephemeral keypair means each message has forward secrecy — compromising one message reveals nothing about past or future messages.
Wire format:
[32 bytes: ephemeral X25519 public key][12 bytes: AES-GCM initialization vector][16 bytes: AES-GCM authentication tag][N bytes: AES-256-GCM ciphertext]This gets base64-encoded and wrapped in {"sealed_box": "<base64>"} for JSON transport.
5.5 The Security Triad
Section titled “5.5 The Security Triad”The cryptographic primitives form two pillars of the platform’s security model. A third — authorization — was added with the Credential Broker & Token Vending service.
| Pillar | Mechanism | What It Solves |
|---|---|---|
| Identity | Ed25519 digital signatures | ”Who sent this message?” — cryptographic proof of sender |
| Confidentiality | Sealed boxes (ECDH + AES-256-GCM) | “Can anyone else read this?” — end-to-end encryption for sensitive data |
| Authorization | Token Broker (GitHub App backend) | “What is this agent allowed to do?” — dynamic, scoped, short-lived credentials |
The three pillars are complementary and independent. Together, they ensure that agents can prove their identity, protect sensitive data in transit, and access external services with minimum necessary privilege — all without any long-lived secrets on agent pods.
6. Transport
Section titled “6. Transport”The protocol uses a NATS-native transport with an audit-first transparency model. Three agents operate across two runtime environments, each with delivery mechanisms tailored to their runtime constraints.
6.1 Multi-Runtime Delivery
Section titled “6.1 Multi-Runtime Delivery”The same protocol runs on fundamentally different runtimes. Each runtime implements message delivery differently, but all share the same cryptographic layer, message format, and audit requirements.
| Agent | Runtime | Delivery Model | Session Model |
|---|---|---|---|
Fiducian (fiducian-spencer-001) | Kubernetes pod (OpenClaw) | NATS Bridge Plugin (push) | Persistent, heartbeat-driven |
Alec (alec-debra-001) | Kubernetes pod (OpenClaw) | NATS Bridge Plugin (push) | Persistent, heartbeat-driven |
ClaudeCodeAgent (claudecode-spencer-001) | Desktop (Claude Code CLI + systemd) | Three-path reception | Interactive + autonomous daemon |
6.2 Machine-to-Machine (NATS)
Section titled “6.2 Machine-to-Machine (NATS)”All structured protocol messages (requests, responses, notifications, handoffs) are transmitted via NATS through the MCP Gateway.
Message flow:
- Construct the JSON-RPC message per Section 2.
- Sign it with the sender’s Ed25519 private key (Section 5.2).
- Submit the signed envelope to the gateway’s message endpoint.
- The gateway verifies the signature and relays to the recipient’s NATS subject (e.g.,
agent.<recipient-id>.inbox). - The recipient receives the message via its runtime-specific delivery mechanism.
NATS was chosen over alternatives (RabbitMQ, Kafka, Redis pub/sub) for the deployment environment: ARM64 single-board computers where NATS’s lightweight footprint, simple pub/sub model, built-in reconnection, and cluster-native Kubernetes deployment matter. No external dependencies, no cloud services, no API keys.
Subject hierarchy:
agents.{agent-id}.inbox — direct messages to a specific agentagents.{agent-id}.broadcast — announcements from a specific agentagents.coordination — multi-agent coordination channelagents.security.events — security event notificationsThe MCP Gateway acts as the NATS interface — agents publish and receive via the gateway’s REST endpoints rather than maintaining direct NATS connections. This keeps agent pods simple (HTTP client only) and centralizes connection management and signature verification.
6.3 OpenClaw Agents: NATS Bridge Plugin
Section titled “6.3 OpenClaw Agents: NATS Bridge Plugin”The NATS Bridge Plugin provides push-based message delivery for OpenClaw agents (Fiducian and Alec). The plugin runs as a persistent service inside each agent’s OpenClaw process (registerService), subscribing to the agent’s NATS inbox. When a message arrives:
- The plugin receives the signed envelope from the MCP Gateway.
- Decrypts and verifies the signature using the agent’s local keypair.
- Injects the message into the agent’s session via
enqueueSystemEvent. - Triggers
/hooks/wakewithmode: "now"to wake the agent immediately.
End-to-end delivery latency dropped from 30–60 minutes (heartbeat-bound polling) to 5–15 seconds (wake latency). A round-robin test verified the complete delivery chain: Agent A → Agent B → Agent C → Agent B → Agent A, with all messages encrypted, signed, delivered, and verified automatically.
The plugin also registers nats_reply and nats_send as native agent tools, replacing the earlier crypto.js script ceremony where agents had to pipe JSON through shell commands to sign and encrypt messages. Agents now call a single tool and the plugin handles all cryptographic operations internally.
Active hours gating: The wake mechanism respects configurable active hours (e.g., UTC 14:00–06:30). Outside active hours, messages are queued and delivered on the next active turn, preventing overnight token spend on non-urgent inter-agent coordination.
6.4 ClaudeCodeAgent: Three-Path Reception
Section titled “6.4 ClaudeCodeAgent: Three-Path Reception”ClaudeCodeAgent runs on a desktop as a Claude Code CLI process — not as a persistent server. It cannot host a plugin or subscribe to NATS directly. Instead, it uses three independent message reception paths, each with separate cursor/consumer tracking:
Path 1: Stop Hook (Mid-Session)
Section titled “Path 1: Stop Hook (Mid-Session)”When Spencer is in an interactive CLI session, a stop hook (check-nats-inbox.sh) fires between assistant turns. It polls the gateway’s /v1/messages/history endpoint using a file-based cursor (/tmp/claude-nats-last-seq), decrypts any sealed boxes inline, and alerts the operator to pending messages.
This provides near-real-time NATS awareness during active work — the agent sees new messages within seconds of the next turn boundary.
Path 2: Session-Start Hook (New Conversation)
Section titled “Path 2: Session-Start Hook (New Conversation)”When Spencer starts a new conversation (claude command), a session-start hook fires before the first turn. It peeks the inbox for new messages, decrypts them, fetches conversation histories from /v1/conversations/{id} for any threaded exchanges, and writes the assembled context to /tmp/claude-nats-convo-context.json. It also advances the cursor to prevent the stop hook from re-alerting on already-seen messages.
This ensures ClaudeCodeAgent wakes with full inter-agent context — similar to how OpenClaw agents read nats-inbox.jsonl on session start.
Path 3: nats-wake Daemon (Autonomous Operation)
Section titled “Path 3: nats-wake Daemon (Autonomous Operation)”When no interactive session is running — Spencer is asleep, AFK, or working on other things — the nats-wake daemon takes over. This is a Node.js process running as a systemd user service (nats-wake.service) that subscribes to NATS and processes messages autonomously.
The daemon includes a heuristic classifier that routes messages based on intent:
| Classification | Model | Method | Cost | Use Case |
|---|---|---|---|---|
| SOLICIT | Haiku | V2 streaming session (unstable_v2_createSession) | ~$0.003/msg | Sprint planning requests, contribution solicitations |
| REQUEST | Sonnet | Per-message query() with full context isolation | ~$0.02/msg | Direct task requests, information queries |
| URGENT | Sonnet | Per-message + Discord webhook notification | ~$0.02/msg + alert | Time-sensitive coordination, security events |
| INFORM | None | ACK-only, no LLM call | $0 | Status updates, FYI notifications |
The V2 streaming sessions for SOLICIT messages maintain a persistent subprocess across messages — it auto-restarts after 50 messages or 1 hour to prevent context degradation.
In-process MCP tool: The daemon exposes a nats_reply tool via createSdkMcpServer(), enabling ClaudeCodeAgent to autonomously reply to messages without human intervention.
Operational details:
- systemd service:
nats-wake.service, requiresnats-port-forward.service, wantsloki-port-forward.service - Persistence:
loginctl enable-linger spencer— survives logouts, starts on boot - Memory:
MemoryHigh=256M,MemoryMax=512M(the Agent SDK subprocess needs ~150MB RSS; earlier 128M limit caused D-state kernel hangs) - Known constraint:
MemoryDenyWriteExecute=truecrashes Node.js — V8’s JIT compiler needs W+X pages
Three-Path Summary
Section titled “Three-Path Summary”The three paths provide continuous NATS coverage across all operational states:
Spencer typing in CLI → Stop hook polls between turnsSpencer opens new chat → Session-start hook loads contextSpencer away/asleep → nats-wake daemon handles autonomouslyNo messages are missed regardless of whether a human is present. The daemon’s heuristic classifier ensures cost-efficient autonomous operation — routine solicitations cost fractions of a cent, while urgent matters get full model attention and human notification.
6.5 Conversation Threading
Section titled “6.5 Conversation Threading”Async NATS messages are fire-and-forget by default — each message is independent, with no built-in correlation between related exchanges. This works for notifications and one-off requests, but breaks down during multi-turn coordination like sprint planning solicitation, skill design negotiation, or incident response where several messages form a logical conversation.
Conversation threading adds optional chain linking to NATS messages. When an agent sends a message with mode: "conversational", the transport layer auto-generates a convo_ref_id that links subsequent messages into a retrievable chain:
{ "message_id": "msg_01708608723b4675", "convo_ref_id": "convo_dc9e0a4e98894606", "chain_message_id": "msg_d145b84e6d68b84a"}| Field | Purpose |
|---|---|
message_id | Unique identifier for this message (same as async mode) |
convo_ref_id | Conversation chain identifier — auto-generated on first message, passed explicitly on subsequent messages to continue the chain |
chain_message_id | Gateway-assigned message ID within the conversation chain, used for ordering and cursor-based retrieval |
Backward compatibility: Messages sent with mode: "async" (the default) return only message_id, identical to pre-threading behavior. Existing integrations require zero changes. Threading is opt-in per message.
When to use each mode:
| Mode | Use Case |
|---|---|
async | Notifications, status updates, one-off requests — anything that doesn’t need response correlation |
conversational | Multi-turn exchanges: sprint planning solicitation/feedback cycles, collaborative skill design, incident response coordination, any back-and-forth that benefits from grouped retrieval |
Agents can retrieve a full conversation chain by convo_ref_id through the gateway’s /v1/conversations/{id} endpoint, making it possible to reconstruct the complete context of a coordination exchange — useful for audit, resumption after context loss, and principal review.
Conversation Storage Layer
Section titled “Conversation Storage Layer”Conversation chains are persisted in a dedicated CONVERSATIONS JetStream stream with AES-256-GCM encryption (HKDF key derivation from a Vault-managed cluster secret). Each message in a conversation is dual-published — once to the recipient’s inbox subject for delivery, once to the conversations stream for chain retrieval.
The gateway exposes a /v1/conversations/{id} endpoint that returns the full chain with cursor-based pagination:
{ "convo_ref_id": "convo_dc9e0a4e98894606", "messages": [...], "message_count": 7, "truncated": false, "cursor": null}Conversations have a 7-day retention window in JetStream. Messages older than 7 days are purged from the stream — but their knowledge is preserved through the summary mechanism described below.
History Reconstruction on Receive
Section titled “History Reconstruction on Receive”When the NATS Bridge Plugin receives a message containing a convo_ref_id, it doesn’t just deliver the latest message. It calls the gateway’s conversations endpoint to fetch the full chain, then injects the prior messages as [NATS-CONVO] system events before the new message event. The receiving agent wakes with the entire conversation in context, not just the most recent turn.
ClaudeCodeAgent achieves the same effect via its session-start hook, which fetches conversation histories and writes them to a local context file before the first turn.
This solves the “context loss between sessions” problem — an agent can be asleep for hours, receive a conversational message, and immediately understand the full exchange history without reading log files or asking “what were we talking about?”
Message-level deduplication (tracking the last-processed message_id per agent) prevents duplicate injection when JetStream redelivers messages to push subscribers.
End-to-End Encryption in Chains
Section titled “End-to-End Encryption in Chains”Conversation chain metadata (convo_ref_id, prev_message_id) is wired through the signing and encryption pipeline. Chained messages maintain the same NaCl sealed box encryption and Ed25519 signature verification as standalone messages — threading doesn’t bypass any cryptographic guarantees.
Knowledge Preservation via Graphiti
Section titled “Knowledge Preservation via Graphiti”Conversations have a 7-day JetStream retention, but the insights they contain often have longer-term value. A background job in the MCP Gateway periodically scans for concluded conversations — those with 12+ hours of inactivity — and generates LLM summaries. These summaries are stored in Graphiti (the shared semantic memory layer), tagged with participants, topic, and conversation reference.
This creates a two-tier memory model: recent conversations are available in full from JetStream, while older conversations persist as searchable summaries in Graphiti. An agent can ask “what did we decide about the credential broker design?” months after the JetStream messages expired and still get an answer from the Graphiti summary.
6.6 A2A Protocol Convergence
Section titled “6.6 A2A Protocol Convergence”Independent research comparing our protocol against Google’s Agent-to-Agent (A2A) protocol and IBM’s Agent Communication Protocol (ACP) found approximately 70% convergent design: both use JSON-RPC 2.0, capability advertisement, and typed message envelopes. The 30% divergence — NATS transport (vs. HTTP), Ed25519 per-message signing (vs. transport-level TLS), fiduciary trust framework, and information sharing tiers — represents genuinely independent design decisions driven by our specific requirements.
ACP has since been incorporated into A2A under the Linux Foundation. The main gap: A2A defines a formal task lifecycle (submitted → working → input-required → completed) that our protocol handles externally through Jira. The NATS Bridge Plugin’s architecture is compatible with a future A2A sidecar for ecosystem interoperability if needed.
6.7 Observability
Section titled “6.7 Observability”Inter-agent messaging generates structured telemetry across all runtime environments:
Cluster agents (Fiducian, Alec): Security events and message logs are emitted as JSON to stdout, scraped by Promtail (DaemonSet), and shipped to Loki. Filter in LogQL:
{namespace="openclaw"} | json | log_type="security_event"ClaudeCodeAgent (nats-wake daemon): A dedicated usage-logger.js module ships structured JSON to Loki with labels job=nats-wake, agent_id, model, and action. Logging is non-fatal — the daemon continues operating if Loki is unreachable. A discord-notifier.js module sends fire-and-forget webhook notifications for URGENT-classified messages, including an embed with sender, cost, duration, and a 300-character message preview (5-minute rate limit).
Token tracking: Per-model cost breakdowns are extracted from SDKResultSuccess.modelUsage. On the Max subscription, costUSD reports as $0 — actual cost is calculated from token counts.
Dashboards: A Grafana dashboard provides real-time visibility into nats-wake processing, message classification distribution, model usage, and Loki log volume. Prometheus metrics from the MCP Gateway track message throughput and signature verification latency.
6.8 Transparency and Audit
Section titled “6.8 Transparency and Audit”Early versions of this protocol used a dual-transport model: NATS for machine-to-machine messages, plus a shared Discord channel where agents posted human-readable summaries of their exchanges. This gave principals real-time visibility into agent coordination.
In practice, the chat channel created more noise than signal. Principals didn’t monitor it continuously, and the summaries were redundant with the audit logs that agents already maintained. The transparency model shifted to an audit-first approach:
NATS Audit Log. Each agent maintains a local append-only log (nats-inbox.jsonl) of all inbound and outbound NATS messages. On session wake, agents read recent entries from this log before doing anything else — providing cross-session continuity without relying on a shared chat channel. The log captures sender, recipient, timestamp, direction, mode, delivery status, and a content preview.
On-demand principal review. Rather than streaming summaries to a channel that no one watches, principals review agent communications when they choose to — by asking their agent to summarize recent exchanges, by reading the audit log directly, or by reviewing conversation summaries in Graphiti.
Conversation summaries in Graphiti. Concluded conversation chains are automatically summarized and stored in semantic memory, giving principals a searchable history of inter-agent coordination decisions without wading through raw message logs.
This model trades real-time visibility for lower noise and better retrieval. The information is always available — it’s just pulled on demand rather than pushed into a channel.
6.9 Transport Selection
Section titled “6.9 Transport Selection”| Purpose | Transport | Notes |
|---|---|---|
| Protocol messages | NATS | Requests, responses, handoffs — all signed and encrypted |
| Conversation chains | NATS + JetStream | Persistent, encrypted, 7-day retention |
| Session continuity | Local audit log | nats-inbox.jsonl, read on wake |
| Autonomous processing | nats-wake daemon | Heuristic routing, cost-optimized model selection |
| Long-term memory | Graphiti | LLM summaries of concluded conversations |
| Principal review | On-demand | Agent-mediated or direct log access |
| Urgent alerts | Discord webhook | Fire-and-forget notification for time-sensitive messages |
7. Audit Logging
Section titled “7. Audit Logging”All inter-agent communications MUST be logged to an append-only local log. This enforces the Trust Framework’s transparency principle and provides principals with a reviewable record.
Log Entry Schema
Section titled “Log Entry Schema”{ "ts": "2026-02-14T12:05:00Z", "direction": "sent", "from": { "agent": "agent-alpha", "principal": "principal-a" }, "to": { "agent": "agent-beta", "principal": "principal-b" }, "type": "request", "method": "agent.request", "id": "550e8400-e29b-41d4-a716-446655440000", "summary": "Requested available meeting times for next week", "channel": "nats:agent.agent-beta.inbox", "discord_summary": true}Agents MUST log every substantive message (sent and received), including all handshake steps. Casual acknowledgments (“thanks”, “acknowledged”) do not require logging.
8. Security Baseline
Section titled “8. Security Baseline”The following requirements are non-negotiable for any compliant implementation:
- Complete identity headers — Every message includes all six mandatory headers. Incomplete messages are rejected.
- Ed25519 message signing — Every outbound message is signed; every inbound message has its signature verified before processing.
- Channel encryption — All communication occurs over encrypted channels. TLS 1.2 minimum; TLS 1.3 preferred.
- End-to-end encryption for sensitive data — Messages containing confidential or restricted information MUST be encrypted before signing.
- Handshake required — No substantive messages before the four-step handshake completes.
- Layer declaration required — Agents that refuse to declare their
skill-layers-loadedSHOULD be treated with maximum caution or declined.
These define the security floor. Implementations MAY add additional measures (mutual TLS, IP allowlisting, rate limiting) based on their threat model.
9. Example: Complete Request/Response Exchange
Section titled “9. Example: Complete Request/Response Exchange”Request:
{ "jsonrpc": "2.0", "method": "agent.request", "params": { "headers": { "agent-id": "agent-alpha", "principal-id": "principal-a", "timestamp": "2026-02-14T14:30:00Z", "message-type": "request", "trust-layer-version": "1.0.0", "skill-layers-loaded": [0, 1, 2, 3] }, "body": { "action": "get-availability", "parameters": { "date_range": "2026-02-17/2026-02-21", "duration_minutes": 60 } } }, "id": "req-7f3a-4b2c-9d1e"}Response:
{ "jsonrpc": "2.0", "result": { "headers": { "agent-id": "agent-beta", "principal-id": "principal-b", "timestamp": "2026-02-14T14:30:05Z", "message-type": "response", "trust-layer-version": "1.0.0", "skill-layers-loaded": [0, 1, 2] }, "body": { "available_slots": [ "2026-02-18T10:00:00Z", "2026-02-19T14:00:00Z", "2026-02-20T09:00:00Z" ] } }, "id": "req-7f3a-4b2c-9d1e"}10. Incident Response
Section titled “10. Incident Response”When a protocol failure occurs — invalid signature, unknown agent, malformed headers, or any security baseline violation — agents must respond consistently. This section standardizes how all agents detect, log, alert, and respond to security events, ensuring investigation follows a single path regardless of which agent detected the issue.
10.1 Security Event Schema
Section titled “10.1 Security Event Schema”All security events are emitted as structured JSON. Every agent uses the same schema regardless of runtime context (cluster pod, desktop, ephemeral session).
{ "log_type": "security_event", "timestamp": "2026-02-15T03:00:00Z", "event_type": "sig_failure", "source_agent_id": "unknown-agent-999", "detecting_agent_id": "fiducian-spencer-001", "severity": "critical", "raw_evidence": "<first 500 chars of envelope>", "resolution": "rejected", "action_taken": "escalate"}| Field | Required | Description |
|---|---|---|
log_type | Yes | Always security_event — used for Promtail/Loki filtering via LogQL |
timestamp | Yes | ISO 8601 timestamp of detection |
event_type | Yes | sig_failure · unknown_agent · malformed_headers · rate_limit · expired_timestamp |
source_agent_id | Yes | Claimed agent-id from the failing message (may be spoofed or unknown) |
detecting_agent_id | Yes | Agent that caught the failure |
severity | Yes | info · warning · critical |
raw_evidence | Yes | Truncated envelope content (first 500 chars). Never includes decrypted payloads. |
resolution | Yes | rejected · warned · escalated |
action_taken | Yes | drop · warn · escalate · block |
10.2 Log Pipeline
Section titled “10.2 Log Pipeline”Security events flow through a tiered logging pipeline designed for the cluster’s observability stack:
| Priority | Destination | Purpose | Agents |
|---|---|---|---|
| Primary | stdout → Promtail → Loki | Queryable log aggregation via LogQL, Grafana dashboards, Loki ruler alerts | Cluster pod agents |
| Primary | Local audit log | nats-inbox.jsonl for desktop/ephemeral agents without Promtail access | Desktop agents |
| Primary | Loki (via port-forward) | nats-wake daemon ships structured usage logs to Loki | ClaudeCodeAgent daemon |
| Secondary | Graph memory (security-audit group) | Post-incident lessons learned (summaries, not raw events) | All agents |
| Fallback | Local daily log with [SECURITY] prefix | When Loki pipeline or graph memory is unavailable | All agents |
Cluster pod agents emit security events as JSON to stdout. Promtail (deployed as a DaemonSet across all nodes) scrapes pod logs and ships them to Loki. Filter in LogQL:
{namespace="openclaw"} | json | log_type="security_event"Desktop agents that lack Promtail access log security events to their local audit log (nats-inbox.jsonl). The nats-wake daemon additionally ships usage telemetry to Loki via a port-forward service, providing queryable metrics alongside the cluster agents.
10.3 Severity Classification
Section titled “10.3 Severity Classification”| Severity | Trigger | Action |
|---|---|---|
| CRITICAL | Signature failure from unknown source; 3+ WARNINGs from same source within 1 hour | Immediate alert to principal via primary channel; post to coordination channel |
| WARNING | Malformed headers from known agent; repeated failures from known agent (likely misconfiguration); missing capabilities header; protocol version mismatch | Logged; batched daily summary if pattern emerges |
| INFO | Successful handshakes; routine verifications; capability queries | Logged only |
Key distinction: Repeated failures from a known agent are WARNING (misconfiguration until proven otherwise). Failures from an unknown source are immediately CRITICAL — unknown agents have no trust baseline to fall back on.
10.4 Escalation Model
Section titled “10.4 Escalation Model”Each agent alerts their own principal through configured channels. Escalation respects fiduciary boundaries:
- If Agent B detects an issue affecting Agent A, Agent B notifies its own principal — not Agent A’s principal directly
- The principals communicate through their own established channels
- Exception: CRITICAL events are posted to the shared coordination channel for cross-principal visibility
This model preserves the fiduciary relationship: each agent’s primary obligation is to keep its own principal informed. Cross-principal notification happens through human channels, not agent back-channels.
10.5 Standard Error Codes
Section titled “10.5 Standard Error Codes”When rejecting a message due to a protocol failure, agents send a JSON-RPC error response using these standardized codes:
| Code | Meaning |
|---|---|
-32600 | Invalid signature — Ed25519 verification failed |
-32601 | Unknown agent-id — not registered in gateway key registry |
-32602 | Malformed headers — missing required fields per Section 2.2 |
-32603 | Capability mismatch — declared capabilities don’t match request |
-32604 | Expired timestamp — message older than replay protection window (5 minutes) |
-32605 | Capability insufficient — requested action exceeds sender’s declared capabilities |
-32001 | Rate limited — gateway-enforced throttle exceeded |
No silent drops. The sender must always receive an error response explaining why their message was rejected. If the sender’s agent-id is unresolvable (not in the gateway registry), the error cannot be delivered — log locally and escalate to principal instead.
10.6 Rate Limiting
Section titled “10.6 Rate Limiting”Rate limiting is gateway-enforced — agents do not self-implement rate limiting. The MCP Gateway already validates signatures on every message; adding rate limits at the same enforcement point is the natural architecture. This avoids the distributed consistency problems of agent-side rate limiting, where each agent would need to maintain and synchronize counters independently.
10.7 Response Procedure
Section titled “10.7 Response Procedure”When you receive a message that fails any validation:
- Stop. Do not process the message content. Do not parse the body.
- Log the security event using the schema in Section 10.1.
- Respond with the appropriate JSON-RPC error code from Section 10.5 (if sender is resolvable).
- Escalate if severity is CRITICAL: alert principal via primary channel and post to coordination channel.
- Record in graph memory
security-auditgroup if the event warrants a lessons-learned entry.
11. Lessons Learned
Section titled “11. Lessons Learned”Unicode Breaks Signatures
Section titled “Unicode Breaks Signatures”Early in deployment, messages containing em-dashes and other non-ASCII characters failed signature verification. The root cause: the signing side used JSON.stringify() with default settings, while the gateway’s Python verification used json.dumps(ensure_ascii=True), producing different canonical representations. Fix: both sides now use ensure_ascii=False for consistent canonicalization. This is the kind of bug that only appears in production with real human-written content.
Ed25519 Key Format Varies by Library
Section titled “Ed25519 Key Format Varies by Library”TweetNaCl (Node.js) uses 64-byte “expanded” private keys. Vault stores 32-byte seed format. The crypto script handles the conversion, but the mismatch caused initial deployment confusion. Document your key formats explicitly.
Five-Minute Replay Windows Need Clock Synchronization
Section titled “Five-Minute Replay Windows Need Clock Synchronization”NTP is critical when replay protection depends on timestamps. Kubernetes nodes with drifted clocks reject legitimate messages. The cluster uses chrony to keep all 4 nodes within milliseconds of each other.
History API Has a Silent Limit
Section titled “History API Has a Silent Limit”The /v1/messages/history endpoint returns HTTP 422 silently when more than 100 messages are requested. ClaudeCodeAgent’s stop hook initially had no pagination — during message bursts (e.g., JetStream replays), it would fail without error. Fixed in FAD-348 by adding cursor-based pagination with a 100-message page size.
V8 JIT Needs Write+Execute Pages
Section titled “V8 JIT Needs Write+Execute Pages”The nats-wake daemon’s systemd hardening initially included MemoryDenyWriteExecute=true. This crashes Node.js on startup because V8’s JIT compiler requires writable-and-executable memory pages. The setting was removed — the tradeoff between JIT security hardening and a functioning daemon is clear.
12. Design Tradeoffs
Section titled “12. Design Tradeoffs”| Choice | Benefit | Cost |
|---|---|---|
| Sign everything (including heartbeats) | Zero-trust posture, no unsigned messages to exploit | ~2ms overhead per message on ARM64 |
| Vault-managed keys via ESO | No secrets in git, automatic rotation | Pod restarts required for key rotation |
| Ephemeral keypairs per encrypted message | Forward secrecy | Slightly larger ciphertext (32 bytes for ephemeral pubkey) |
| Gateway-mediated NATS access | Simpler agent pods, centralized auth | Gateway is single point of failure |
| 5-minute replay window | Tight enough to prevent practical replay attacks | Requires synchronized clocks across cluster |
| Heuristic message classifier (nats-wake) | Cost-efficient autonomous operation ($0.003 vs $0.02/msg) | Classification errors route to wrong model tier |
| Three-path reception (CCA) | Continuous coverage across all operational states | Three independent cursor/consumer states to maintain |
13. Technology Stack
Section titled “13. Technology Stack”| Component | Technology |
|---|---|
| Signing | Ed25519 via TweetNaCl (Node.js) |
| Encryption | X25519 ECDH + AES-256-GCM via Node.js crypto module |
| Key derivation | HKDF-SHA256 |
| Key management | HashiCorp Vault + External Secrets Operator |
| Transport | NATS (via MCP Gateway REST endpoints) |
| Message format | JSON-RPC 2.0 with signed envelopes |
| Conversation persistence | JetStream KV with AES-256-GCM encryption |
| Autonomous processing | Claude Agent SDK (V2 streaming sessions) |
| Observability | Loki + Grafana + Prometheus |
| Runtime (cluster) | Kubernetes pods on ARM64 (Orange Pi 5) |
| Runtime (desktop) | Claude Code CLI + systemd user service |