Observability Stack

Why Observability Matters for an Agent Platform

When AI agents run autonomously — making API calls, exchanging signed messages, executing tool invocations, processing financial data — you need to know what they’re doing. Not just “is the pod running?” but “what did this agent say to that agent at 3 AM, and why did it fail?”

The cluster runs 87+ pods across 4 ARM64 nodes. Three AI agents serve two principals, coordinating through NATS messaging with Ed25519-signed envelopes. MCP tool servers process web searches, financial queries, and graph memory operations. Without centralized logging, debugging a failed inter-agent exchange means SSH-ing into nodes and grepping pod logs by hand — assuming the pod hasn’t been rescheduled since the event.

The observability stack provides three capabilities: log aggregation (all pod logs in one place, queryable), structured event filtering (security events, protocol failures, tool errors), and visualization (Grafana dashboards for operational awareness).

Architecture

Observability Stack Architecture

The stack runs entirely on-cluster — no cloud log services, no SaaS vendors, no data leaving the network.

Promtail (Collection Layer)

Promtail runs as a DaemonSet — one instance per node, automatically scheduled by Kubernetes. Each Promtail pod mounts the node’s /var/log/pods directory and tails every container log file.

Key configuration decisions:

Pipeline stages parse JSON-structured logs (agent security events, MCP Gateway requests) and extract labels for efficient LogQL filtering
Pod labels (app, namespace, component) are automatically attached as Loki labels, so you can filter by {namespace="openclaw"} or {app="mcp-gateway"} without any application-side instrumentation
Resource limits are tuned for ARM64 SBCs: 128Mi memory request, 256Mi limit. Promtail’s memory footprint scales with the number of active log streams, not log volume — important on 16GB nodes running 20+ pods each

Promtail was stalled for 6 weeks after initial deployment — pods were OOMKilled silently because the default resource limits from the Helm chart assumed x86 servers with 64GB+ RAM. The fix: explicit resource limits tuned for the hardware, plus clearing stale position files that had accumulated during the outage period.

Loki (Storage & Query Layer)

Loki is the core of the stack — it indexes and stores logs, and serves LogQL queries. Deployed as a StatefulSet with Longhorn-replicated persistent storage.

The architecture choice here is deliberate: Loki in single-binary mode rather than microservices mode. The microservices deployment (separate ingester, distributor, querier, compactor pods) is designed for clusters processing terabytes of logs per day. This cluster generates maybe 50MB/day. Single-binary mode runs all components in one pod, which means:

One pod to monitor instead of five
One PVC instead of five
Simpler debugging (one log stream, one process)
Fits the resource envelope of an ARM64 SBC

Loki v3.5.7 runs with 256Mi memory request, 512Mi limit, and a 2Gi Longhorn PVC. Retention is set to 30 days — enough for incident investigation without consuming storage that other workloads need.

LogQL: Structured Queries

LogQL is Loki’s query language — similar to PromQL but for logs. It’s what makes the difference between “I have logs” and “I can find things in logs.”

Examples from actual operational use:

Find all security events across the platform:

{namespace="openclaw"} | json | log_type="security_event"

Agent-to-agent message failures in the last hour:

{app="mcp-gateway"} | json | level="error" | line_format "{{.message}}" |= "signature"

MCP tool invocations with latency above 5 seconds:

{namespace="mcp-tools"} | json | duration > 5s

All logs from a specific agent pod during an incident window:

{namespace="openclaw", pod=~"openclaw-.*"} | json | ts >= "2026-02-15T03:00:00Z" and ts <= "2026-02-15T04:00:00Z"

The structured JSON logging pattern established in the inter-agent communications protocol spec pays dividends here. Because security events use a consistent schema (log_type, event_type, severity), LogQL can filter them precisely without regex parsing. This was a deliberate design choice — the incident response section of the protocol was written knowing that Loki would be the query backend.

Grafana (Visualization Layer)

Grafana connects to Loki as a data source and provides dashboards for operational monitoring. Deployed as a standard Kubernetes Deployment with Longhorn storage for dashboard persistence.

Current dashboards include:

Agent Activity — message volume per agent over time, error rates, NATS message latency
Security Overview — security events by severity, unknown agent attempts, signature failures
MCP Gateway — tool invocation counts by server, error rates, P95 latency per tool, with a live logs panel for real-time debugging and an average latency gauge. The gateway exposes /metrics with health gauges (backend connectivity, tool discovery status, message queue depth) for Prometheus scraping via a ServiceMonitor. A companion PrometheusRule defines alert thresholds — gateway unreachable, backend error rate spikes, tool discovery failures — routed through AlertManager to Discord #cluster-alerts. This provides both log-based and metrics-based observability in a single dashboard.
Graphiti Knowledge Graph — episode processing success/failure rates, LLM backend health, group-level isolation metrics. Prometheus alerts fire on sustained episode failures (graphiti_episode_failures_total) with group_id and error_type labels for targeted diagnosis — a failure in the shared group (cross-agent state) has different urgency than a failure in a single agent’s private group.
Cluster Health — pod restart counts, OOMKill events, node resource utilization
Logs Explorer — live log tailing with namespace/pod/container filters, enabling real-time debugging during incidents without SSH access to nodes

The MCP Gateway dashboard deserves special mention. It combines Prometheus metrics (request counts, latency histograms, error rates by tool server) with Loki log queries (request payloads, error messages, authentication failures) in a single view. This dual-source approach — metrics for trends, logs for details — provides both the “is something wrong?” signal and the “what exactly went wrong?” context in one place.

Grafana also supports Loki ruler alerts — alerting rules defined in LogQL that fire when conditions are met.

Loki Alerting Pipeline

The alerting pipeline uses Loki’s ruler component to evaluate LogQL expressions on a schedule and fire alerts through Grafana’s unified alerting system. Four production alert rules are currently active:

CrashLoopBackOff Detection: Watches for pods entering CrashLoopBackOff state by matching container restart patterns in kubelet logs. This catches the most common failure mode on the cluster — a misconfigured workload that starts, crashes, and restarts in a loop, consuming resources without doing useful work. The alert fires after 3 restarts within 10 minutes.

Node Health Monitoring: Tracks node-level conditions — NotReady state, disk pressure, memory pressure, PID pressure. On a 4-node cluster where every node matters, a single unhealthy node can cascade into scheduling failures across the platform. The alert fires immediately on any non-Ready condition.

OOM Kill Detection: Monitors kernel OOM killer events via container runtime logs. OOM kills are especially pernicious on ARM64 SBCs with 16GB per node — they happen silently, the pod restarts, and unless you’re watching at that exact moment, the only evidence is a counter increment in kube_pod_container_status_last_terminated_reason. The LogQL rule catches them in real-time and includes the container name, namespace, and memory limit in the alert payload.

Longhorn Capacity Warnings: Monitors Longhorn volume utilization by querying Longhorn manager logs for capacity events. With distributed storage across 4 nodes, a single node filling its disk can cause replica scheduling failures across the cluster. The alert fires at 80% volume utilization, giving enough runway to expand volumes or clean up before hitting hard limits.

Graphiti Episode Failure Alerts: Monitors the knowledge graph’s episode processing pipeline via Prometheus metrics. Episodes are the atomic unit of knowledge extraction — when an agent stores a fact, decision, or observation in Graphiti, it’s processed as an episode. Failures here mean knowledge is being silently dropped. The alert uses two label dimensions: group_id (which agent or shared context is affected) and error_type (LLM extraction failure, graph write failure, validation failure). This granularity matters because the system migrated from Groq’s hosted LLM (llama-3.3-70b, ~30 RPM limit, 67% episode failure rate) to Foundry AIP (~760 RPM, near-zero failure rate) — the alert would have caught the Groq-era failures much earlier had it existed then.

All five alerts route through AlertManager, which delivers to Discord’s #cluster-alerts channel via webhook. AlertManager handles deduplication, grouping, and silencing — preventing alert storms during cascading failures. When a node goes NotReady, the system sends one grouped notification covering all affected pods rather than individual alerts for each pod reschedule.

The alerts include structured metadata (namespace, pod, node) for rapid triage — when an alert fires at 3 AM, the notification contains enough context to assess severity without opening a dashboard.

Proactive Health Monitoring

The Loki alerting pipeline catches problems that produce log events — CrashLoopBackOff, OOM kills, disk pressure. But what about problems that produce silence? If Promtail stops shipping logs, there’s no log event to alert on. If the MCP Gateway stops responding, no error appears in the gateway’s own logs — it just stops logging entirely.

This is the gap that proactive health monitoring fills. It answers the question: “Are these services actually working right now?” rather than “Did these services report a problem?”

The Motivation: Six Weeks of Silence

The Promtail outage (detailed in Lessons Learned) made the case viscerally. Promtail was OOMKilled, restarted into a crash loop, and eventually sat in CrashLoopBackOff — but since Promtail was the log collection layer, there were no logs about its own failure. The monitoring system had a blind spot exactly where it mattered most: monitoring itself.

Six weeks of logs were permanently lost. Not because the cluster was unhealthy, but because the observability of the observability stack had no proactive checks. The Loki alerting pipeline was technically working — it just had nothing to evaluate because Promtail wasn’t feeding it data.

This is a general problem: reactive alerting (pattern-match on log lines) fails when the failure mode is absence rather than presence. Proactive monitoring — actively pinging services and checking for expected responses — covers that gap.

Agent Cron Health Checks

The fiduciary agent runs scheduled health checks via its cron system, pinging critical infrastructure services on a recurring schedule. Each check is a lightweight probe — an HTTP request or TCP connection that validates the service is not just running, but responding correctly.

Services monitored:

Service	Check Method	Expected Response	Frequency
MCP Gateway	`GET /healthz`	HTTP 200	Every 15 min
Home Assistant	`GET /api/`	HTTP 200 + `{"message": "API running."}`	Every 15 min
NATS	TCP connect to client port	Connection accepted	Every 15 min
NATS Wake Events	Heartbeat log verification	Recent log entries present	Every 60 min
Loki	`GET /ready`	HTTP 200 + `ready`	Every 15 min

Each check runs as an isolated cron job — a fresh agent session that executes the probe, evaluates the result, and routes alerts if the check fails. Isolation matters: a hung main session doesn’t affect health check execution.

The NATS wake event check deserves a note: early implementations fired false positive “NATS down” alerts because the check looked for recent NATS activity but didn’t account for quiet periods where no inter-agent messages were exchanged. The fix was adding periodic heartbeat logging — a lightweight log entry emitted on each agent heartbeat that proves the NATS connection is alive even when no substantive messages are flowing. This eliminated false positives without adding traffic to the NATS bus.

Why agent-level, not Kubernetes-level? Kubernetes liveness and readiness probes check whether a pod is healthy from the kubelet’s perspective. Agent-level health checks validate that the service is healthy from a consumer’s perspective — the same path that actual agent tool invocations traverse. A pod can pass its readiness probe while its application layer is silently broken (misconfigured auth, TLS issues, upstream dependency failures). The agent health check catches these because it exercises the real request path.

Alert Routing

When a health check fails, the agent routes alerts through two channels based on severity and time of day:

Discord #alerts — All failures, immediately. This provides a persistent log of health check results alongside the Loki-driven alerts, creating a unified alert history in one channel.
Signal — Critical failures during active hours. Reserved for services whose failure impacts agent operations in real-time (MCP Gateway down means all tool invocations fail; NATS down means inter-agent messaging stops).

The routing logic applies quiet hours to avoid overnight noise for non-critical degradation. Discord alerts fire 24/7 since they’re asynchronous.

Complementing Loki Alerts

The proactive health checks and Loki alerting pipeline are complementary, not redundant:

Failure Mode	Loki Alerts	Proactive Checks
Pod CrashLoopBackOff	✅ Detects via kubelet logs	❌ Not its job
OOM kills	✅ Detects via runtime logs	❌ Not its job
Service silently unresponsive	❌ No logs to match	✅ Detects via ping failure
Log pipeline itself broken	❌ Can’t detect own failure	✅ Loki readiness check catches it
Disk pressure	✅ Detects via node conditions	❌ Not its job
Auth/TLS misconfiguration	❌ May not produce error logs	✅ Real request path fails

Together, they provide defense in depth: Loki alerts handle the “something went wrong and reported it” case, while proactive checks handle the “something went wrong and went silent” case. The Promtail outage would have been caught in 15 minutes instead of 6 weeks.

ARM64 Considerations

Running the observability stack on ARM64 SBCs introduces constraints that don’t exist on x86 servers:

Memory is the primary constraint. Each node has 16GB shared across 20+ pods. Loki’s default configuration assumes it can allocate multiple gigabytes for ingestion buffers and query caches. On ARM64, these must be explicitly limited:

limits_config:
  ingestion_burst_size_mb: 6
  ingestion_rate_mb: 4
  max_query_series: 500
  max_entries_limit_per_query: 5000

CPU is surprisingly not an issue. The RK3588 SoC in the Orange Pi 5 has 4 Cortex-A76 performance cores and 4 Cortex-A55 efficiency cores. Log parsing and LogQL queries are I/O-bound, not CPU-bound — Promtail spends most of its time waiting for filesystem reads, and Loki spends most of its time waiting for storage I/O.

Container images must be multi-arch. Grafana Labs publishes official ARM64 images for Loki, Promtail, and Grafana. This wasn’t always the case — early Loki releases were x86-only. Using Helm charts with explicit image architecture overrides prevents accidental x86 image pulls that would fail with exec format error.

Storage I/O matters more than storage capacity. Longhorn replicates across nodes, which means every log write hits the network twice (local + remote replica). For write-heavy workloads like log ingestion, this amplifies I/O. Loki’s write-ahead log and batch flushing help — writes are buffered in memory and flushed periodically rather than on every log line.

Lessons Learned

Resource limits are not optional on SBCs. The 6-week Promtail outage happened because the Helm chart’s default resource requests assumed enterprise hardware. On ARM64 with 16GB per node, every pod needs explicit limits. There is no “plenty of headroom” — headroom doesn’t exist.

Stale position files cause silent data loss. When Promtail was restarted after the 6-week gap, it tried to resume from saved positions that pointed to rotated log files. Result: it skipped everything between the crash and restart. Fix: clear position files on restart when the gap exceeds the log retention period. The logs from those 6 weeks are gone — an unrecoverable data loss that underscores why resource limits matter.

Single-binary Loki is the right call for small clusters. The microservices deployment adds operational complexity (more pods to monitor, more failure modes, more inter-component networking) that only pays off at scale. For a 4-node cluster with 87 pods, single-binary mode is simpler, uses fewer resources, and is easier to debug. You can always migrate to microservices mode later if log volume grows — the storage format is the same.

Structured logging at the application layer is a force multiplier. Loki becomes dramatically more useful when applications emit structured JSON with consistent field names. The decision to standardize security events with log_type, event_type, and severity fields across all agents means one LogQL query can find security events regardless of which agent generated them. This was designed in the protocol spec, not bolted on after deployment.

Observability drives operational fixes. The MCP Gateway Grafana dashboard revealed a subtle bug: when a backend MCP server went down, the gateway’s tool discovery would fail permanently for that server — even after the backend recovered. The dashboard’s error rate panel showed the pattern clearly (errors going to 100% and staying there), leading to a fix that auto-resets stale backend sessions. Without the dashboard, this would have manifested as mysterious “tool not found” errors that required a gateway restart to resolve.

Proactive checks cover the observability blind spot. Reactive alerting has an inherent limitation: it can only detect failures that produce observable signals. When the failure is the absence of signals — a dead log shipper, a silent service — reactive systems are blind by definition. The 15-minute health check cadence means the maximum detection time for a silent failure is 15 minutes, compared to the 6 weeks it took to notice the Promtail outage manually.

Design Tradeoffs

Choice	Benefit	Cost
Single-binary Loki over microservices	Simpler ops, fewer resources, one pod to debug	Can’t independently scale query vs ingestion
DaemonSet Promtail over sidecar injection	No application changes needed, catches all pod logs	One more pod per node (~50Mi memory each)
30-day retention	Sufficient for incident investigation	Older events require Grafana snapshots or manual export
On-cluster Grafana over cloud-hosted	No data exfiltration, no SaaS dependency	Must manage Grafana upgrades and storage
Longhorn PVC for Loki	Replicated storage, survives node failure	Write amplification from network replication
Agent-level health checks over k8s probes	Tests real consumer path, catches app-layer failures	Depends on agent cron system availability
15-minute check cadence	Balances detection speed vs API/resource overhead	Up to 15 min detection delay for silent failures

Technology Stack

Log collection: Promtail 3.5.1 (DaemonSet, all nodes)
Log storage & query: Loki 3.5.7 (StatefulSet, single-binary mode)
Visualization: Grafana (Deployment, Longhorn PVC)
Alerting: Loki ruler + AlertManager + Grafana unified alerting + agent cron health checks
Storage: Longhorn-replicated PVCs (2Gi Loki, 1Gi Grafana)
Runtime: Kubernetes v1.28.2 on ARM64 (Orange Pi 5, RK3588)
Query language: LogQL
Namespace: monitoring