K8s Deep Dive: GitOps
K8s Deep Dive: GitOps
Section titled “K8s Deep Dive: GitOps”This is a companion to the Home Kubernetes Cluster overview and a sibling to the Networking Deep Dive and the Storage Deep Dive. Those pages cover how traffic flows and how data persists. This one covers how changes get to the cluster in the first place — and why “I’ll just kubectl apply it” is a trap.
Every piece of infrastructure described in the sibling pages — Cilium, Longhorn, MetalLB, ingress-nginx, cert-manager — is deployed and managed through a GitOps pipeline. No manual Helm installs, no imperative kubectl commands for steady-state configuration. Git is the single source of truth for what runs on this cluster.
Why GitOps
Section titled “Why GitOps”Three properties make GitOps worth the setup cost, especially for infrastructure managed by one person:
Auditability. git log is the audit trail. Every change to the cluster — a new Helm chart, a bumped version, a modified value — is a commit with a timestamp, author, and diff. When something breaks at 11 PM, I don’t have to remember what I changed. The commit history tells me exactly what changed, when, and why (assuming I wrote decent commit messages, which is its own discipline).
Reproducibility. The entire cluster can be reconstructed from a single Git repository. If all four nodes caught fire tomorrow, I could provision new hardware, bootstrap Kubernetes, point Flux at the repo, and walk away. Every namespace, every Helm release, every Kustomization would converge to the declared state. The recovery time is limited by hardware provisioning, not by remembering what was deployed where.
Safety. Pull requests are the review gate. On a solo project, PRs might seem like overhead — who am I reviewing for? Myself, it turns out. The diff in a PR forces me to read the change in context before it hits the cluster. And when a change goes wrong, git revert followed by a push is the rollback mechanism. No fumbling with helm rollback flags or trying to remember the previous values. Revert the commit, Flux reconciles, done.
The Stack
Section titled “The Stack”The GitOps pipeline has a clear flow: a GitHub repository holds the desired state, Flux watches that repo, and Kustomizations and HelmRelease custom resources translate the desired state into actual Kubernetes objects.
graph LR repo["GitHub Repo\nspencer2211/fiducian-kube"] -->|"SSH poll\n(1 min)"| flux["Flux\n(flux-system)"] flux -->|"Reconcile"| kust["Kustomizations"] kust -->|"Render manifests"| hr["HelmRelease CRs"] hr -->|"Helm install/upgrade"| workloads["Deployed Workloads\n(18 Helm releases)"]
style repo fill:#f5f5f5,stroke:#333 style flux fill:#e8f4fd,stroke:#2196F3 style kust fill:#e8f4fd,stroke:#2196F3 style hr fill:#fff3e0,stroke:#FF9800 style workloads fill:#e8f5e9,stroke:#4CAF50| Component | Role |
|---|---|
GitHub (spencer2211/fiducian-kube) | Source of truth. All cluster state lives here. |
| Flux v2 | GitOps operator. Watches the repo and reconciles cluster state to match. |
| Kustomizations | Flux CRs that define which paths in the repo to apply and in what order. |
| HelmRelease CRs | Declarative Helm installs. Flux manages the Helm lifecycle — install, upgrade, rollback. |
| Vault + ESO | Secrets stay in Vault. External Secrets Operator syncs them into Kubernetes Secrets. |
Flux Configuration
Section titled “Flux Configuration”Flux runs in the flux-system namespace. It was bootstrapped with flux bootstrap github, which sets up the initial GitRepository source, the self-managing Kustomization, and the deploy key on the GitHub repo.
GitRepository Source
Section titled “GitRepository Source”The source configuration tells Flux where to find the desired state:
apiVersion: source.toolkit.fluxcd.io/v1kind: GitRepositorymetadata: name: flux-system namespace: flux-systemspec: interval: 1m0s ref: branch: production secretRef: name: flux-system url: ssh://git@github.com/spencer2211/fiducian-kubeA few things worth noting:
-
SSH transport. The repo is accessed over SSH with a deploy key, not HTTPS with a token. Deploy keys are scoped to a single repository and can be read-only, which is the minimum privilege Flux needs. If the key is compromised, the blast radius is one repo, not my entire GitHub account.
-
productionbranch. Notmain. The production branch is the branch that Flux watches. Changes go throughmainfirst via PR, then get merged or promoted toproduction. This gives me a gate between “merged code” and “deployed code” when I want it, though in practice I often merge directly toproductionfor infrastructure changes. -
1-minute poll interval. Flux checks the repo every 60 seconds. For a homelab, this is a good balance — changes deploy within a minute of merging, without hammering GitHub’s API. Flux also supports webhook receivers for instant reconciliation, but polling is simpler and sufficient here.
Kustomizations
Section titled “Kustomizations”Flux uses three Kustomizations to manage different parts of the cluster:
| Kustomization | Path | Purpose |
|---|---|---|
flux-system | clusters/production | Base Flux components and cluster-wide resources. Self-referential — Flux manages its own deployment. |
llm-keda | apps/llm-keda | KEDA autoscaling configuration for the LLM inference stack. ScaledObjects and triggers for scaling Ollama replicas based on queue depth. |
llm-ollama | apps/llm-ollama | Ollama LLM inference deployment. Model serving, resource limits, GPU scheduling. |
The flux-system Kustomization is the root — it’s what Flux bootstraps first, and it references everything else. The LLM-specific Kustomizations are separated because they have different reconciliation needs. The Ollama deployment changes frequently as I experiment with models and resource allocations, while the base Flux configuration is essentially static.
Helm Releases
Section titled “Helm Releases”Every workload on the cluster is deployed as a HelmRelease custom resource. Flux watches these CRs and manages the Helm lifecycle — installing charts, upgrading when the spec changes, and rolling back on failure.
There are currently 18 Helm releases across the cluster:
| Release | Namespace | Chart Version | App Version |
|---|---|---|---|
cilium | cilium-system | 1.16.18 | 1.16.18 |
longhorn | longhorn-system | 1.10.1 | v1.10.1 |
cert-manager | cert-manager | v1.13.3 | v1.13.3 |
ingress-nginx | ingress-nginx | 4.14.1 | 1.14.1 |
metallb | metallb-system | 0.13.12 | v0.13.12 |
vault | vault | 0.31.0 | 1.20.4 |
home-assistant | home-assistant | 0.3.43 | 2026.2.1 |
external-secrets | external-secrets-system | 1.1.1 | v1.1.1 |
kube-prometheus | monitoring | 80.9.2 | v0.87.1 |
grafana | monitoring | 10.4.0 | 12.3.0 |
loki | monitoring | 6.46.0 | 3.5.7 |
promtail | monitoring | 6.17.1 | 3.5.1 |
jaeger | monitoring | 3.4.1 | 1.53.0 |
keda | keda | 2.18.3 | 2.18.3 |
That’s 14 listed explicitly — the remaining 4 are application-specific releases in namespaces like openclaw, nats, graphiti, and llm-gateway that are covered in their respective project pages.
The releases fall into natural tiers:
- Infrastructure (Cilium, MetalLB, cert-manager, ingress-nginx, Longhorn) — the foundation everything else depends on. These change rarely and require careful testing when they do.
- Platform Services (Vault, External Secrets, KEDA) — shared capabilities consumed by applications. Medium change frequency.
- Observability (kube-prometheus, Grafana, Loki, Promtail, Jaeger) — the monitoring stack. Changes usually mean dashboard updates or retention tuning, not architectural shifts.
- Applications (Home Assistant, plus the workloads in other namespaces) — the things the cluster exists to run. Highest change frequency.
Anatomy of a HelmRelease
Section titled “Anatomy of a HelmRelease”Here’s what a typical HelmRelease looks like, using Home Assistant as an example:
apiVersion: helm.toolkit.fluxcd.io/v2kind: HelmReleasemetadata: name: home-assistant namespace: home-assistantspec: interval: 5m chart: spec: chart: home-assistant version: "0.3.43" sourceRef: kind: HelmRepository name: home-assistant namespace: flux-system values: image: repository: ghcr.io/home-assistant/home-assistant tag: "2026.2.1" hostNetwork: true persistence: enabled: true storageClass: longhorn size: 10Gi resources: requests: cpu: 200m memory: 512Mi limits: memory: 1GiThe structure is straightforward:
chart.specpins the Helm chart version. This is the version of the chart packaging, not the application itself.valuesis the equivalent of avalues.yamlfile — it configures the chart. The image tag, persistence settings, resource limits all go here.interval: 5mmeans Flux checks this release every 5 minutes and reconciles if the actual state has drifted from the desired state.sourceRefpoints to a HelmRepository resource that Flux also manages — it’s the chart registry URL.
When I want to upgrade Home Assistant, I change tag: "2026.2.1" to the new version, commit, push, and Flux handles the rolling update. No helm upgrade commands, no remembering which values I passed last time.
Deployment Workflow
Section titled “Deployment Workflow”Normal Changes
Section titled “Normal Changes”The standard deployment workflow is branch-based:
sequenceDiagram participant Dev as Developer participant Git as GitHub (fiducian-kube) participant Flux as Flux (flux-system) participant K8s as Kubernetes Cluster
Dev->>Git: Create branch, modify HelmRelease Dev->>Git: Open PR, review diff Dev->>Git: Merge to production Git-->>Flux: Poll detects new commit (≤1 min) Flux->>K8s: Reconcile — helm upgrade K8s-->>Flux: Release status: deployed Flux-->>Git: Update status (GitOps Toolkit)In practice for a solo operator, the PR step is often a quick self-review rather than a formal process. But the discipline of reading the diff before merging has caught real errors — typos in resource limits, wrong chart versions, accidentally removed values. The PR isn’t bureaucracy; it’s a forcing function for a second look.
Rollback
Section titled “Rollback”Rollback is the inverse of deployment:
git revert <commit>— creates a new commit that undoes the change.git push— Flux detects the new commit within a minute.- Flux reconciles — the cluster converges back to the previous state.
This is fundamentally safer than helm rollback because it operates on the source of truth. A helm rollback changes the cluster state without changing the Git repo, which means Git and the cluster are now out of sync. The next Flux reconciliation would re-apply the change you just rolled back. Git revert avoids this entirely — the revert is in Git, so Flux and the cluster agree on what the desired state should be.
Emergency Changes
Section titled “Emergency Changes”Sometimes you need to fix something immediately. A pod is crash-looping, a misconfiguration is dropping traffic, something is on fire. The workflow:
kubectl applyorkubectl edit— fix the immediate problem.- Immediately commit the same change to Git — this is the critical step.
If you skip step 2, you have undocumented drift. The cluster state doesn’t match Git. The next Flux reconciliation (within a minute) will revert your emergency fix back to whatever Git says. Undocumented drift is a ticking time bomb — it might not explode for a minute, or it might not explode until the next unrelated change triggers a full reconciliation. Either way, the fix you applied at 2 AM disappears silently.
The rule: if you kubectl apply something, you have 60 seconds to get the equivalent change into Git before Flux overwrites it. In practice, I usually suspend Flux reconciliation first (flux suspend kustomization <name>), make the fix, commit to Git, then resume reconciliation. Belt and suspenders.
Secrets Management
Section titled “Secrets Management”The one thing that can’t live in Git is secrets. API keys, database passwords, TLS private keys — committing these to a repository, even a private one, is a security failure mode. If the repo is ever exposed, every secret is compromised.
The cluster solves this with HashiCorp Vault and the External Secrets Operator (ESO):
graph LR vault["HashiCorp Vault\nv1.20.4"] -->|"API"| eso["External Secrets\nOperator v1.1.1"] eso -->|"Sync"| secret["Kubernetes Secret"] secret -->|"Volume mount\nor env var"| pod["HelmRelease / Pod"]
style vault fill:#fce4ec,stroke:#E91E63 style eso fill:#fff3e0,stroke:#FF9800 style secret fill:#e8f4fd,stroke:#2196F3 style pod fill:#e8f5e9,stroke:#4CAF50The flow:
-
Vault stores the actual secret values. It’s deployed on the cluster itself via Helm (version 1.20.4), with its storage backend on a Longhorn volume. Vault is unsealed using a set of key shares — the unseal process is the one manual step after a full cluster restart.
-
External Secrets Operator runs in the
external-secrets-systemnamespace. It watchesExternalSecretcustom resources, which declare “I need a Kubernetes Secret with key X, and the value comes from Vault path Y.” -
The ESO controller authenticates to Vault (using Kubernetes service account auth), reads the secret value, and creates or updates a standard Kubernetes Secret.
-
Workloads consume the Kubernetes Secret normally — as environment variables or volume mounts. They don’t know or care that the value came from Vault.
The key benefit: the Git repository contains ExternalSecret resources that reference Vault paths, but never the actual secret values. The repo can be shared, reviewed, and even made public without exposing credentials. The secret values live exclusively in Vault, which has its own access control, audit logging, and rotation capabilities.
What This Looks Like in Practice
Section titled “What This Looks Like in Practice”An ExternalSecret in the repo:
apiVersion: external-secrets.io/v1beta1kind: ExternalSecretmetadata: name: app-credentials namespace: my-appspec: refreshInterval: 1h secretStoreRef: name: vault-backend kind: ClusterSecretStore target: name: app-credentials data: - secretKey: DATABASE_URL remoteRef: key: secret/data/my-app property: database_urlThis is safe to commit. It says “create a Kubernetes Secret called app-credentials with a key DATABASE_URL whose value comes from secret/data/my-app in Vault.” The actual database URL never appears in Git.
Monitoring and Observability
Section titled “Monitoring and Observability”The entire monitoring stack is itself GitOps-managed — deployed and configured through HelmRelease resources, same as everything else. Monitoring the monitors through the same pipeline they monitor. It’s turtles all the way down, but it works.
| Component | Namespace | Chart Version | App Version | Purpose |
|---|---|---|---|---|
kube-prometheus | monitoring | 80.9.2 | v0.87.1 | Prometheus + Alertmanager + default recording/alerting rules. The metrics backbone. |
grafana | monitoring | 10.4.0 | 12.3.0 | Dashboards and visualization. Connects to Prometheus, Loki, and Jaeger as data sources. |
loki | monitoring | 6.46.0 | 3.5.7 | Log aggregation. Receives logs from Promtail, stores on Longhorn (50 Gi volume). |
promtail | monitoring | 6.17.1 | 3.5.1 | Log shipper. DaemonSet that tails container logs on every node and forwards to Loki. |
jaeger | monitoring | 3.4.1 | 1.53.0 | Distributed tracing. Captures trace spans from instrumented services. |
This stack provides three pillars of observability:
-
Metrics (Prometheus via kube-prometheus-stack) — numeric time-series data. CPU usage, memory consumption, request rates, error rates, Helm release status, Flux reconciliation duration. Prometheus scrapes metrics endpoints across the cluster every 30 seconds.
-
Logs (Loki + Promtail) — structured and unstructured log data from every container. Promtail runs on every node as a DaemonSet, tails the container log files, and ships them to Loki. Queried through Grafana using LogQL. When a Flux reconciliation fails, the logs tell me why.
-
Traces (Jaeger) — distributed request tracing for services that emit OpenTelemetry spans. Most useful for the AI agent workloads where a single user request fans out across multiple services (API gateway, agent runtime, tool execution, LLM inference). Less relevant for infrastructure components, but available when needed.
All three data sources are configured in Grafana, which serves as the single pane of glass. I can go from a spike in a Prometheus metric to the relevant logs in Loki to the trace that shows where latency was introduced — all without leaving the browser.
The fact that the monitoring stack is itself managed through GitOps means upgrades, configuration changes, and dashboard provisioning all go through the same PR workflow. A new Grafana dashboard is a JSON file committed to Git, not a manual creation in the UI that gets lost when the pod restarts.
Summary
Section titled “Summary”| Aspect | Detail |
|---|---|
| GitOps Operator | Flux v2, flux-system namespace |
| Source Repo | spencer2211/fiducian-kube, production branch, SSH transport |
| Reconciliation | 1-minute poll interval, automatic convergence |
| Helm Releases | 18 total across infrastructure, platform, observability, and application tiers |
| Secrets | HashiCorp Vault v1.20.4 + External Secrets Operator v1.1.1 |
| Monitoring | kube-prometheus, Grafana, Loki, Promtail, Jaeger — all GitOps-managed |
| Deployment | Branch, PR, merge, Flux reconciles. Rollback via git revert. |
| Emergency | kubectl apply + immediate Git commit. Suspend Flux if needed. |
GitOps on a homelab might seem like overkill. It’s one person managing one cluster — why not just helm install and move on? The answer is the same reason I use version control for code: because my memory is unreliable, my 2 AM self makes questionable decisions, and the ability to see exactly what changed and roll it back without thinking is worth the upfront investment in the pipeline.
The cluster has been running this way since deployment, and the number of “what did I change that broke this” debugging sessions has been exactly zero. Every incident starts with git log and ends with either “ah, that commit” or “nothing changed in Git, so the problem is external.” That clarity alone justifies the approach.