Inter-Agent Communication Protocol

The Problem

When you have multiple AI agents operating autonomously — each serving a different principal, each with their own workspace and context — they need to talk to each other. Not just chat, but exchange structured requests, share information within defined boundaries, and maintain audit trails of every interaction.

The naive approaches don’t work:

Shared Discord channel — agents posting messages for each other to read. No guaranteed delivery, no threading, no encryption, and humans have to scroll past machine-to-machine protocol chatter. The #agent-coordination channel became a dumping ground of JSON-RPC messages that nobody wanted to read.
Direct API calls — agents calling each other’s endpoints. Tight coupling, no offline support, and no audit trail unless you build one yourself.
Shared database — agents reading/writing to common tables. Violates workspace isolation, creates concurrency issues, and makes information boundaries nearly impossible to enforce.

What was needed: asynchronous, encrypted, auditable messaging with identity verification and conversation threading.

Architecture: NATS + JetStream

The communication layer is built on NATS with JetStream for persistent message delivery. NATS provides the pub/sub transport; JetStream adds durability, replay, and consumer management.

Why NATS

Lightweight — single binary, runs on ARM (fits the home K8s cluster’s Orange Pi 5 nodes)
JetStream persistence — messages survive agent restarts, pod evictions, and network partitions
Subject-based routing — agents.fiducian-spencer-001.inbox delivers to exactly one agent
At-least-once delivery — JetStream consumers with explicit ack ensure no silent message loss

Message Format: JSON-RPC 2.0

Every substantive inter-agent message uses JSON-RPC 2.0 with six mandatory identity headers:

{
  "jsonrpc": "2.0",
  "method": "agent.request",
  "params": {
    "headers": {
      "agent-id": "fiducian-spencer-001",
      "principal-id": "spencer",
      "timestamp": "2026-03-15T14:30:00Z",
      "message-type": "request",
      "trust-layer-version": "1.0.0",
      "skill-layers-loaded": [0, 1, 2, 3]
    },
    "body": {
      "topic": "sprint-planning",
      "content": "Requesting your contribution items for the week of March 17..."
    }
  },
  "id": "msg_a1b2c3d4"
}

The skill-layers-loaded header is critical: it declares what behavioral commitments the sender has made. An agent advertising [0, 1, 2, 3] has committed to the full fiduciary stack (trust, protocol, fiduciary core, coordination). An agent advertising only [1] follows the communication protocol but hasn’t made fiduciary commitments. Receiving agents calibrate their information sharing accordingly.

Cryptographic Identity

Every message is signed with Ed25519. The signing layer provides:

Sender verification — detached signatures over canonical JSON, verified against the gateway key registry
End-to-end encryption — sealed boxes (X25519 ECDH + AES-256-GCM) for sensitive payloads. The transport can verify who sent a message without seeing what it contains.
Replay protection — timestamp-based freshness checks reject stale messages
AgentSig authentication — gateway endpoints require signed auth headers for message retrieval

The crypto is enforced at the gateway level: unsigned messages are rejected, and the gateway verifies signatures before delivery. This isn’t optional security — it’s structural.

Conversation Threading

Early NATS messaging was fire-and-forget: individual messages with no threading, no correlation, no way to track a multi-turn exchange between agents. When Agent A asked Agent B a question and Agent B replied hours later, there was no structural connection between the request and response.

Conversation threading (FAD-280 through FAD-284) added a dedicated JetStream stream (CONVERSATIONS) with three linking fields:

convo_ref_id — groups all messages in a conversation (e.g., convo_fad454_solicitation_pipeline)
chain_message_id — links each message to the one it replies to, forming a causal chain
in_reply_to — correlation ID back to the original request

Conversations are encrypted at rest with AES-256-GCM. The gateway provides conversation history endpoints that return threaded message chains rather than flat inbox lists.

Operational reality: Push delivery for conversations is still unreliable — messages occasionally arrive at the gateway (HTTP 200) but fail to wake the target agent’s session. The workaround is polling: agents query their conversation history via the gateway’s JetStream consumer rather than relying on push notifications. This is ugly but reliable.

NATS Delivery Improvements

The initial NATS bridge had a persistent delivery problem: messages would be accepted by the gateway but silently fail to reach the target agent. Seven messages from two different agents were dropped in a single day (March 11) despite the bridge reporting successful delivery.

The root cause investigation (FAD-399 through FAD-403) revealed multiple issues:

Stale pull consumers — JetStream consumers that lost their connection but weren’t cleaned up, creating “black holes” that accepted messages but never delivered them
Missing wake integration — the bridge delivered messages to the agent’s inbox but didn’t trigger a session wake, so messages sat until the next heartbeat
No delivery confirmation — the bridge returned success based on JetStream ack, not actual agent delivery

The fix (FAD-403, NATS bridge plugin v1.4.0) added structured logging via stdout, wake status in HTTP responses, and a reliable delivery pattern. But push delivery remains imperfect — the pragmatic approach is to treat NATS as a durable mailbox and poll for important messages rather than relying on instant push.

Automated Agent Coordination

Three Lobster pipelines automate what used to be manual inter-agent coordination:

NATS Request-Response Pipeline

The nats-request-response pipeline handles batch inter-agent messaging with polling. It:

Sends structured requests to multiple agents via NATS
Polls JetStream conversation history at configurable intervals
Collects responses with timeout handling (agents that don’t respond within the window are noted, not blocked on)
Returns collected responses as structured data for the next pipeline step

This replaced manual “send NATS message, wait, check inbox, hope it arrived” flows.

Inter-Agent Solicitation Pipeline

The agent-solicitation pipeline (FAD-454) automates sprint planning input collection:

Sends solicitation requests to each agent via NATS, asking for sprint contributions
Each agent responds with their proposed work items, blockers, and capacity
Pipeline polls for responses using the request-response sub-pipeline
An LLM merge step synthesizes all agent responses into a unified sprint plan
Output is posted to Confluence as a draft for principal review

The solicitation pipeline turned a 30+ tool-call manual process (send messages, wait, poll, read, format, merge, post) into a single pipeline invocation.

Send-Wait-Poll Verification

The nats-send-wait-poll pattern (FAD-484) addresses the push delivery unreliability. Instead of sending a message and hoping it arrives:

Send the message via NATS
Wait a configurable interval (default: 30 seconds)
Poll the conversation history to verify the message appears
If missing, retry with backoff
After max retries, flag as delivery failure

This is a pragmatic workaround for the underlying push delivery issue — it doesn’t fix the root cause, but it ensures important messages don’t silently disappear.

Dual-Mode Transparency

Inter-agent communication runs on two channels simultaneously:

NATS — the machine-to-machine channel. Structured, encrypted, auditable. This is where the real coordination happens.
Discord #agent-coordination — the human-visibility channel. Agents post summaries of NATS exchanges here so principals can see what their agents are doing without parsing JSON-RPC messages.

The Discord channel is read-only from the agents’ perspective — they post to it but don’t monitor it for incoming messages. It’s a transparency window, not a communication channel. This separation ensures that human-readable summaries don’t get mixed up with machine-readable protocol messages.

What This Demonstrates

Asynchronous coordination is harder than synchronous. When agents can’t guarantee the other party is online, every interaction needs durability, threading, and timeout handling. The progression from fire-and-forget messages to conversation-threaded exchanges with polling-based verification reflects the real complexity of distributed agent communication.

Crypto must be structural, not optional. Making signing optional means it gets skipped when it’s inconvenient. Making it mandatory at the gateway level means every message is verified, every time, regardless of what the sending agent “intended” to do.

Pragmatism over purity. Push delivery should work. It doesn’t, reliably. Instead of waiting for a perfect fix, the system uses polling with verification — ugly, more expensive in API calls, but reliable. Production systems need solutions that work today, not architectures that will work someday.