ADR-002: Longhorn for Distributed Storage on ARM64

Status: Accepted Date: 2025-07-01 Author: Spencer Fuller

Context

The cluster consists of 4 Orange Pi 5 nodes (ARM64, Rockchip RK3588), each with limited local NVMe storage. Multiple workloads require persistent volumes: Home Assistant configuration and database, OpenClaw agent workspace and audit logs, security scan results, and Longhorn’s own metadata. Without distributed storage, a single node failure means data loss for any workload scheduled on that node.

Key requirements:

Data resilience — survive single-node failure without data loss
ARM64 native — no emulation or unsupported architectures
Kubernetes-native — integrate with StorageClass, PVC, and dynamic provisioning
Operationally simple — manageable by one person without dedicated storage expertise
Reasonable overhead — can’t dedicate half the cluster’s resources to storage infrastructure

Decision

Deploy Longhorn (CNCF Sandbox project, now v1.10.1) as the cluster’s distributed block storage provider with 2x replication as the default StorageClass.

Rationale

ARM64 native support. Longhorn publishes multi-arch images and has tested ARM64 since v1.1. On the Orange Pi 5 cluster, Longhorn runs without modification or workarounds — engine, replica, and manager components all run natively on aarch64.
Built-in replication with tunable replica count. Setting numberOfReplicas: 2 means every volume is stored on 2 of the 4 nodes. This survives any single-node failure (hardware failure, kernel panic, power loss) while keeping storage overhead at 2x rather than 3x. For a 4-node cluster, 2x replication is the sweet spot — 3x would consume 75% of aggregate storage.
Web UI for management. Longhorn ships with a dashboard that shows volume health, replica placement, node disk utilization, and backup status. For a single-operator homelab, this eliminates the need to memorize CLI commands for routine storage operations.
Kubernetes-native integration. Longhorn registers as a CSI driver and provides a StorageClass. Workloads request storage via standard PVCs — no special annotations or sidecar containers needed. Dynamic provisioning just works.
Handles node failures gracefully. When a node goes offline, Longhorn automatically rebuilds replicas on remaining healthy nodes (if configured). Volumes remain accessible from the surviving replica with no manual intervention.

Alternatives Considered

Alternative	Why Not
Rook-Ceph	The gold standard for distributed storage, but catastrophically heavy for SBCs. Ceph’s OSD, MON, and MGR daemons consume multiple GB of RAM each. On nodes with 16GB total and 4-8GB free, Ceph would starve workloads. Designed for 10+ node clusters with dedicated storage nodes.
OpenEBS	Promising architecture (especially Mayastor for NVMe-native), but ARM64 support was spotty at evaluation time. Community reports of build failures and missing multi-arch images made it a risk for a production homelab.
NFS (single server)	Simple and proven, but creates a single point of failure. If the NFS server node dies, every workload with a PV goes down. Defeats the purpose of a multi-node cluster. Also, NFS write performance over the network is measurably worse than local block storage.
local-path-provisioner	Rancher’s lightweight local storage provisioner. Zero overhead, fastest performance, but no replication. A node failure means data loss for any volume on that node. Acceptable for ephemeral or reconstructable data, but not for Home Assistant’s database or agent audit logs.

Consequences

Positive

Every PVC in the cluster is automatically replicated across 2 nodes — no per-workload storage decisions needed
Node maintenance (OS updates, hardware swaps) is non-disruptive: drain the node, volumes failover to the surviving replica
Longhorn UI provides at-a-glance storage health without CLI gymnastics
Security scanning infrastructure uses a 2Gi Longhorn PVC (secscan-results) that survives node failures, ensuring scan results are always available for the interpretation layer

Negative

2x storage overhead — a 10GB volume consumes 20GB of aggregate cluster storage. On nodes with limited NVMe capacity, this means careful capacity planning
Write latency — synchronous replication adds latency to every write. For most homelab workloads (config files, databases with modest write rates), this is imperceptible. For write-heavy workloads, it’s measurable but acceptable
Longhorn manager resource consumption — the manager, engine, and replica pods consume memory on every node. Tuning resource requests/limits was necessary to avoid memory pressure on the smaller nodes
Backup story requires additional configuration — Longhorn supports S3-compatible backup targets, but setting up off-cluster backups is a separate effort (not yet implemented)