ADR-006: External Secrets Operator + Vault over Kubernetes Secrets

Status: Accepted Date: 2025-11-05 Author: Spencer Fuller

Context

The cluster manages a growing number of secrets: LLM provider API keys (Anthropic, OpenAI, Google), MCP tool credentials, database passwords, TLS certificates, Home Assistant tokens, and inter-agent Ed25519 signing keys. These secrets are consumed by workloads across multiple namespaces (mcp-tools, llm-gateway, openclaw, home-assistant, security).

Kubernetes Secrets are the default mechanism, but they have well-known limitations: values are base64-encoded (not encrypted), stored in etcd (encrypted at rest only if explicitly configured), and visible to anyone with RBAC access to the namespace. There’s no rotation mechanism, no audit trail of access, and managing secrets across namespaces means duplicating Secret manifests or using complex RBAC rules.

For GitOps workflows, secrets are particularly problematic — you can’t commit Kubernetes Secret manifests to git (they contain raw credentials), which means secrets become manual, out-of-band operations that break the “everything in git” promise.

Key requirements:

Encryption at rest and in transit — secrets must be encrypted, not just encoded
Dynamic rotation — credentials should rotate on a schedule without redeploying workloads
Audit trail — who accessed which secret, when, from which namespace
GitOps compatibility — secret references (not values) should live in git alongside other manifests
Cross-namespace management — centralized secret store accessible from any namespace with fine-grained access control

Decision

Deploy HashiCorp Vault as the centralized secret store and External Secrets Operator (ESO) as the Kubernetes integration layer. Vault stores all secrets with encryption, access policies, and audit logging. ESO runs in-cluster and syncs secrets from Vault into Kubernetes Secrets via ExternalSecret CRDs — the CRDs live in git, the actual secret values never do.

Rationale

Vault provides encryption, not encoding. Kubernetes Secrets base64-encode values — echo "my-api-key" | base64 is not security, it’s formatting. Vault encrypts secrets with AES-256-GCM using auto-unseal, stores them in its encrypted backend, and only decrypts them for authenticated, authorized requests. The difference between base64 and AES-256-GCM is the difference between a screen door and a vault door.
Dynamic secret rotation without workload redeployment. Vault supports dynamic secrets (generate short-lived credentials on demand) and automatic rotation policies. ESO periodically reconciles ExternalSecret CRDs — when Vault rotates a secret, ESO updates the corresponding Kubernetes Secret, and workloads pick up the new value via volume mounts or environment variable refresh. No kubectl delete pod required. For API keys with provider-imposed rotation schedules, this is essential.
Comprehensive audit trail. Every secret access in Vault is logged: which identity requested which secret, from which IP, at what time, and whether access was granted or denied. This audit log feeds into the security monitoring pipeline. With plain Kubernetes Secrets, there’s no native access logging — you’d need to enable and parse Kubernetes API audit logs, which are verbose, unstructured, and mix secret access with every other API call.
GitOps-compatible via ExternalSecret CRDs. An ExternalSecret manifest specifies which Vault path to read, which keys to extract, and which Kubernetes Secret to create — but never contains the secret value itself. These CRDs commit safely to git alongside Deployments, Services, and ConfigMaps. ArgoCD or Flux applies the CRD, ESO reads from Vault, and the Kubernetes Secret materializes. The GitOps loop remains unbroken.
Centralized cross-namespace access control. Vault policies define which Kubernetes ServiceAccounts can access which secret paths. The openclaw namespace can read LLM API keys but not Home Assistant tokens. The home-assistant namespace can read its own tokens but not agent signing keys. This is fine-grained, centralized, and auditable — unlike Kubernetes RBAC for Secrets, which requires per-namespace Role/RoleBinding proliferation.

Alternatives Considered

Alternative	Why Not
Plain Kubernetes Secrets	The default. Base64 encoding provides zero security — anyone with namespace read access sees raw credentials. No rotation mechanism (manual delete and recreate). No audit trail of who read which secret. No cross-namespace sharing without duplication. Etcd encryption-at-rest helps but doesn’t address access control, rotation, or auditability. Fine for development, insufficient for a platform managing API keys with real cost implications.
Bitnami Sealed Secrets	Improvement over plain Secrets — encrypts values with a cluster-specific key so encrypted manifests can safely live in git. But: no dynamic rotation (you re-encrypt and reapply manually), no audit trail of secret access (only creation/modification is tracked via git history), and no centralized cross-namespace management. Sealed Secrets solves the “secrets in git” problem but not the rotation, audit, or lifecycle problems.
Mozilla SOPS	File-level encryption for secret values using KMS, PGP, or age keys. Integrates with git workflows — encrypt files before commit, decrypt during CI/CD. But SOPS operates at the file level, not the runtime level. There’s no dynamic rotation (change the file, re-encrypt, redeploy), no access audit trail (it’s just file decryption), and no Kubernetes-native integration without additional tooling. SOPS is encryption for files; Vault is a secret management platform.

Consequences

Positive

All secrets are encrypted at rest in Vault with AES-256-GCM — a leaked etcd backup or Vault storage snapshot reveals nothing without the unseal key
API key rotation happens on schedule without manual intervention or workload restarts
Audit logs provide forensic-grade visibility into secret access patterns across all namespaces
ExternalSecret CRDs in git make secret management declarative and reviewable — PRs show which secrets a workload needs without exposing what those secrets contain

Negative

Operational complexity. Vault is a stateful, security-critical service that requires careful deployment — unseal management, backup procedures, HA configuration, and version upgrades. It’s the most operationally demanding component in the cluster. If Vault is down, ESO can’t refresh secrets, and new workloads can’t start
Bootstrap chicken-and-egg. Vault needs to be running before ESO can sync secrets, but Vault itself needs storage (Longhorn PVC) and network (Cilium) to run. The initial cluster bootstrap requires manual secret seeding before the automated pipeline takes over
ESO reconciliation delay. ESO polls Vault on a configurable interval (default: 1 hour). A rotated secret doesn’t appear in Kubernetes instantly — there’s a window where the old secret is still active. Shortening the interval increases Vault API load. For most secrets this is acceptable; for time-critical rotations, ESO supports webhook-triggered refresh
Learning curve for Vault policies. Vault’s policy language (HCL) and auth method configuration (Kubernetes auth, AppRole) have a non-trivial learning curve. Writing correct policies that follow least-privilege without being overly restrictive requires understanding Vault’s path-based access model