Bare-Metal Kubernetes Storage: Longhorn Setup (Part 2)
In Part 1, I walked through setting up NFS storage with Unraid for my bare-metal Kubernetes cluster. That gave me two NFS storage classes (nfs-unraid and nfs-unraid-retain) with ReadWriteMany support. That solved the immediate problem of shared storage, but it introduced a single point of failure: my Unraid NAS. If that server goes down, everything depending on those NFS volumes stops working.
That's when I started looking at distributed storage solutions.
The Problem: I Still Had a Single Point of Failure
Don't get me wrong, NFS with Unraid is fantastic for certain workloads. I use it for config files, shared media libraries, and anything that needs ReadWriteMany access. But here's what kept me up at night:
- Database workloads: My PostgreSQL instances were on local-path storage (node-local, no redundancy)
- Application state: If a worker node died, pods would restart but lose their data
- No automatic failover: I couldn't just drain a node and expect volumes to follow pods elsewhere
I needed storage that could:
- Replicate data across multiple nodes automatically
- Survive node failures without manual intervention
- Move with pods when they reschedule
- Not depend on external infrastructure (no NAS required)
That's distributed block storage, and for Kubernetes, that means Longhorn.
What is Longhorn (and Why Should You Care)?
Longhorn is a distributed block storage system built specifically for Kubernetes. Think of it as turning your cluster's worker node disks into a resilient storage pool that works like this:
┌─────────────────────────────────────────────────────────┐
│ Pod requests 10GB volume (3 replicas) │
└───────────────────┬─────────────────────────────────────┘
│
┌───────────▼──────────────┐
│ Longhorn Manager │ (Decides where replicas go)
└───────────┬──────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌───▼────┐ ┌────▼───┐ ┌─────▼──┐
│Worker-1│ │Worker-2│ │Worker-3│
│10GB │ │10GB │ │10GB │
│Replica │◄───►Replica │◄───►Replica │ (Data synchronised)
└────────┘ └────────┘ └────────┘
If Worker-2 dies, Longhorn automatically:
- Detects the failure
- Promotes one of the remaining replicas to primary
- Creates a new replica on another healthy node
- Your application keeps running (brief I/O pause during failover)
I learned this the hard way when I accidentally shut down a worker during testing. The PostgreSQL pod paused for about 10 seconds, reconnected, and kept running like nothing happened. That's when I knew this was the solution I needed.
Distributed Storage vs. NFS: Which Do You Actually Need?
Here's how I think about my storage now:
| Storage Type | StorageClasses | Best For | Avoid For | Why I Use It |
|---|---|---|---|---|
| Local-path | local-path | Caches, temp data, build artifacts | Anything stateful | Fastest, but data dies with the node |
| NFS (Unraid) | nfs-unraid, nfs-unraid-retain | Config files, media, ReadWriteMany | Databases, high IOPS | Shared access, massive capacity, but single point of failure |
| Longhorn | longhorn, longhorn-gp, longhorn-ha, longhorn-io | Databases, stateful apps, HA workloads | Massive files (TB+) | Survives node failures, follows pods |
The sweet spot: Use all three. Longhorn for resilience, NFS for sharing, local-path for speed.
Prerequisites: What You Need Before Installing
Longhorn uses iSCSI under the hood, so each worker node needs a few packages. I'm running Ubuntu 22.04 on my worker nodes, adjust package names if you're on a different distro.
Quick Check: Do You Have What You Need?
SSH into one of your worker nodes and run:
# Check for required packages
dpkg -l | grep open-iscsi
dpkg -l | grep nfs-common
# Check if iscsid service exists
systemctl status iscsid
If you see "unit not found," you'll need to install the packages.
Installing Prerequisites on All Workers
I have 6 worker nodes, so I automated this with a quick loop. Update the IP range to match your cluster:
# From your local machine (not inside the cluster)
# Replace with your actual worker node IPs
for ip in 192.168.1.10 192.168.1.11 192.168.1.12 192.168.1.13 192.168.1.14 192.168.1.15; do
echo "=== Configuring $ip ==="
ssh ubuntu@$ip "sudo apt update && sudo apt install -y open-iscsi nfs-common"
ssh ubuntu@$ip "sudo systemctl enable iscsid && sudo systemctl start iscsid"
ssh ubuntu@$ip "sudo systemctl is-active iscsid" # Should return "active"
done
Why nfs-common? Longhorn can export volumes as NFS for ReadWriteMany scenarios. You don't have to use this feature, but it's nice to have the option.
Installing Longhorn via Helm
I prefer Helm for Longhorn because it makes upgrades and configuration changes much cleaner than raw YAML manifests.
Step 1: Add the Longhorn Helm Repository
helm repo add longhorn https://charts.longhorn.io
helm repo update
Step 2: Install Longhorn
This command sets a few important defaults:
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
--create-namespace \
--version 1.10.1 \
--set defaultSettings.defaultReplicaCount=2 \
--set defaultSettings.storageMinimalAvailablePercentage=15 \
--set defaultSettings.storageOverProvisioningPercentage=100
What those settings mean:
defaultReplicaCount=2: Each volume gets 2 copies by default (survives 1 node failure)storageMinimalAvailablePercentage=15: Stop scheduling replicas if a node has less than 15% free spacestorageOverProvisioningPercentage=100: Allow volumes totaling 200% of actual space (thin provisioning)
Step 3: Wait for Pods to Start
Longhorn deploys a bunch of components. Watch them come up:
kubectl get pods -n longhorn-system -w
You'll see:
- longhorn-manager: One pod per worker node (DaemonSet)
- longhorn-ui: The web dashboard (2 replicas)
- longhorn-driver-deployer: Sets up the CSI driver
- longhorn-csi-plugin: One per node for volume attachment
Wait until everything shows Running. On my 6-node cluster, this took about 3 minutes.
The Hotfix You Need to Apply (Seriously, Don't Skip This)
Longhorn v1.10.1 shipped with some critical bugs (nil-pointer crashes, volume migration issues, and replica balancing stalls). The Longhorn team released hotfix-2 to address these, but the upgrade process is a little tricky because Longhorn blocks "downgrades" by default.
Here's the one-liner that works:
helm upgrade longhorn longhorn/longhorn \
--namespace longhorn-system \
--reuse-values \
--set preUpgradeChecker.upgradeVersionCheck=false \
--set image.longhorn.manager.tag=v1.10.1-hotfix-2
Why this works:
--reuse-values: Keeps your existing config (replica count, storage settings, etc.)upgradeVersionCheck=false: Disables the version validation that would block the "downgrade"image.longhorn.manager.tag: Switches to the hotfix image
Wait for the rollout:
kubectl rollout status daemonset/longhorn-manager -n longhorn-system
Verify the hotfix is running:
kubectl get daemonset longhorn-manager -n longhorn-system \
-o jsonpath='{.spec.template.spec.containers[0].image}'
Should output: longhornio/longhorn-manager:v1.10.1-hotfix-2
Understanding Storage Classes: Why Longhorn Creates Two Automatically
When you install Longhorn via Helm, it automatically creates two StorageClasses:
kubectl get storageclass | grep longhorn
Output:
longhorn (default) driver.longhorn.io Delete Immediate true 5m
longhorn-static driver.longhorn.io Delete Immediate true 5m
Here's what they're for:
1. longhorn (The Default)
This is your go-to storage class. When you create a PVC without specifying a storageClassName, it uses this.
Default settings (from my installation):
- 3 replicas (survives 2 node failures)
- Delete reclaim policy (volume deleted when PVC is deleted)
- ext4 filesystem
Check the details:
kubectl get storageclass longhorn -o yaml
You'll see numberOfReplicas: "3" in the parameters section.
2. longhorn-static
This is for edge cases where you're manually creating Longhorn volumes and want to bind them to specific PVCs. I haven't needed this yet, but it's there if you're migrating from another storage system.
Creating Custom Storage Classes (When and Why)
The default longhorn class works great for most things, but I created a few custom classes for specific use cases:
| Class | Replicas | Reclaim Policy | When I Use It |
|---|---|---|---|
longhorn (default) |
3 | Delete | General workloads (databases, app state) |
longhorn-gp |
2 | Delete | Less critical data where I want to save space |
longhorn-ha |
3 | Retain | Critical data that must survive PVC deletion |
longhorn-io |
1 | Delete | Apps with built-in replication (Redis Cluster, Cassandra) |
Here's how I created them:
Creating a 2-Replica Class (Space-Optimized)
Save this as longhorn-gp.yaml:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-gp
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "2880"
fsType: "ext4"
dataLocality: "best-effort" # Try to keep one replica on the pod's node
Apply it:
kubectl apply -f longhorn-gp.yaml
Why 2 replicas? My worker nodes have limited disk space (workers 1-3 have ~20GB free each). Using 2 replicas instead of 3 saves 33% of storage while still surviving a single node failure.
Creating a Retain-Policy Class (Critical Data)
For databases I can't afford to lose accidentally, I created this:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-ha
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Retain # Volume survives even if PVC is deleted
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "2880"
fsType: "ext4"
dataLocality: "best-effort"
The Retain policy saved me once: I accidentally deleted a PVC for a test database. With Delete policy, the data would've been gone instantly. With Retain, the Longhorn volume still existed. I just recreated the PVC and bound it to the orphaned volume. Crisis averted.
Creating a Single-Replica Class (Maximum Performance)
For applications that handle their own replication (like CockroachDB, Cassandra, or Elasticsearch), you don't need Longhorn to replicate. It's just overhead:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-io
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "1"
staleReplicaTimeout: "2880"
fsType: "ext4"
dataLocality: "disabled"
Use case: I run a 3-node Redis Cluster. Each Redis instance uses longhorn-io because Redis itself handles replication. No point in Longhorn duplicating that work.
Exposing the Longhorn UI
Longhorn includes a web UI for monitoring volumes, replicas, and node storage. By default, it's only accessible inside the cluster. I exposed it using MetalLB (my LoadBalancer solution from the NFS guide).
Create longhorn-ui-lb.yaml:
apiVersion: v1
kind: Service
metadata:
name: longhorn-frontend-lb
namespace: longhorn-system
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8000
protocol: TCP
name: http
selector:
app: longhorn-ui
Apply and get the IP:
kubectl apply -f longhorn-ui-lb.yaml
kubectl get svc longhorn-frontend-lb -n longhorn-system
In my case, MetalLB assigned an IP from my configured pool. Now I can access the Longhorn dashboard at that IP address (check with kubectl get svc longhorn-frontend-lb -n longhorn-system).
What you'll see in the UI:
- Dashboard: Storage usage across nodes
- Volume: List of all Longhorn volumes and their replica distribution
- Node: Health status and available space per worker
- Setting: Global Longhorn configuration
Testing Longhorn: Does It Actually Work?
Theory is great, but I wanted to see failover in action.
Test 1: Create a Volume and Write Data
Save as test-longhorn.yaml:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-longhorn-pvc
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn # Use the default class
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: test-longhorn-pod
namespace: default
spec:
containers:
- name: test
image: nginx:alpine
volumeMounts:
- name: data
mountPath: /data
command: ["/bin/sh"]
args: ["-c", "echo 'Testing Longhorn replication' > /data/test.txt && tail -f /dev/null"]
volumes:
- name: data
persistentVolumeClaim:
claimName: test-longhorn-pvc
Apply and verify:
kubectl apply -f test-longhorn.yaml
# Wait for pod to be running
kubectl get pod test-longhorn-pod -w
# Check that data was written
kubectl exec test-longhorn-pod -- cat /data/test.txt
# Output: Testing Longhorn replication
Test 2: Check Replica Distribution
# See which node the pod is running on
kubectl get pod test-longhorn-pod -o wide
# Check replica distribution in Longhorn
kubectl get volumes -n longhorn-system
kubectl get replicas -n longhorn-system | grep test-longhorn
You should see 3 replicas spread across different worker nodes. That's Longhorn doing its job.
Test 3: Simulate Node Failure (The Real Test)
This is where it gets fun. I wanted to see if Longhorn actually handled failover.
What I did:
- Noted which node the pod was running on (
k8s-worker-6) - SSH'd into the worker and ran
sudo shutdown now - Watched what happened
What actually happened:
The process wasn't as instant as I expected, but the failover did work:
- Node goes down: Worker shut down immediately
- Detection delay (~2-3 minutes): Kubernetes marked the node as NotReady
- Pod eviction (5+ minutes): Kubernetes waited for the default pod eviction timeout
- Pod rescheduled: Kubernetes placed it on a healthy node
- Volume reattached: Longhorn reattached the volume with data intact
- Pod started: Application came back online
Manual intervention needed: Because the node went down hard, the volume was stuck "attached" to the dead node. I had to delete the stale VolumeAttachment to allow it to reattach elsewhere:
# Find the attachment
kubectl get volumeattachments | grep <volume-id>
# Delete it to allow reattachment
kubectl delete volumeattachment <attachment-name>
Total failover time: About 3-4 minutes (most of that was Kubernetes pod eviction timeout, not Longhorn).
I checked the data:
kubectl exec test-longhorn-pod -- sh -c "cat /data/test.txt"
# Output: Testing Longhorn replication
Still there. No data loss. The replicas on the remaining nodes (worker-4 and worker-5) kept the data safe.
Important notes about this test:
- I was testing with a standalone Pod. In production, you'd use a Deployment or StatefulSet, which automatically recreates pods when nodes fail.
- The manual VolumeAttachment deletion was needed because the node went down abruptly. In a graceful shutdown scenario, this cleanup happens automatically.
- Longhorn did its job perfectly: the data was intact and accessible once the volume reattached. The delays were all Kubernetes-side, which has conservative timeouts to avoid prematurely killing pods on nodes that might come back.
Check replica status:
kubectl get replicas -n longhorn-system | grep <volume-name>
You'll see replicas marked as "stopped" on the failed node, and "running" on the healthy nodes. When the failed node comes back online, Longhorn automatically rebuilds the replica.
Cleanup
kubectl delete -f test-longhorn.yaml
Longhorn automatically deletes the volume (because we used the Delete reclaim policy).
Storage Comparison: What I Use Where
After running Longhorn for a couple of weeks alongside NFS and local-path, here's how my storage setup looks:
┌──────────────────────────────────────────────────────────────────┐
│ Application StorageClass Why │
├──────────────────────────────────────────────────────────────────┤
│ PostgreSQL longhorn HA, survives failures │
│ Redis Cluster longhorn-io Fast, app-level HA │
│ Odoo App Data nfs-unraid Shared across pods │
│ Odoo PostgreSQL nfs-unraid-retain DB data, must retain │
│ Config Maps nfs-unraid Easy to edit from NAS│
│ Build Caches local-path Speed, don't care │
│ Prometheus Data longhorn-gp HA but not critical │
└──────────────────────────────────────────────────────────────────┘
Current capacity:
- Longhorn: ~309GB distributed across 6 workers
- NFS-Unraid: 10TB+ (with parity protection)
- Local-path: ~40GB per node (fast NVMe, no redundancy)
Monitoring Disk Usage: The One Thing That'll Bite You
Longhorn uses /var/lib/longhorn on each worker node by default. That's your root disk. I learned this the hard way when worker-2 started throwing disk pressure warnings.
My worker node disk situation:
- Workers 1-3: 40GB total (~20GB free after OS)
- Workers 4-6: 100GB total (~80GB free after OS)
Longhorn won't schedule replicas on nodes below 15% free space (our setting from installation). Keep an eye on this, especially if you're running on smaller VMs like I am.
Check disk usage from the Longhorn UI or:
kubectl get nodes.longhorn.io -n longhorn-system -o wide
Pro tip: If you have dedicated disks on your worker nodes, you can configure Longhorn to use those instead of the root disk. Check the Longhorn docs for "multiple disks per node."
Troubleshooting: Issues I Ran Into
Volume Stuck in "Attaching"
Symptom: Pod stays in ContainerCreating, volume never attaches.
What I did:
kubectl describe pvc <pvc-name>
kubectl describe volume <volume-name> -n longhorn-system
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100
Root cause (in my case): Worker node had run out of disk space. Longhorn couldn't create the replica.
Fix: Freed up space by cleaning old container images (docker system prune -a).
Replicas Not Scheduling
Symptom: Volume created, but only 1 replica instead of 3.
What I did:
kubectl get nodes.longhorn.io -n longhorn-system
Root cause: Two of my worker nodes had less than 15% free space (below the threshold).
Fix: Adjusted the threshold temporarily:
kubectl edit settings.longhorn.io storage-minimal-available-percentage -n longhorn-system
# Changed from 15 to 10
Not ideal long-term, but it got me through until I expanded disk space on those VMs.
What's Next: Stateful Workloads and Failure Scenarios
At this point, you've got:
- NFS storage for shared files (Part 1)
- Longhorn for distributed, fault-tolerant block storage (this post)
- Multiple storage classes for different use cases
Now comes the interesting part: running actual stateful applications on Longhorn and seeing how it handles real-world failure scenarios.
Things I'm planning to cover in future posts:
- Stateful applications: Running PostgreSQL, Redis, and other databases on Longhorn
- Failure testing: What happens when a node crashes during a database write operation?
- Volume snapshots: Creating point-in-time backups before risky operations (like schema migrations or major upgrades)
- Backup strategies: Exploring Longhorn's backup feature with S3-compatible storage (
MinIOGarage on my Unraid server, because MinIO went full enterprise and abandoned their community, but hey, they'll still accept your bug reports on Slack!) - Storage performance: How does Longhorn compare to local-path for database workloads?
I'm still learning how these pieces work together in production. Some of this might be Part 3, some might be separate deep dives. We'll see what breaks first and what's worth writing about.
Questions or issues? Drop a comment below. I'm happy to help troubleshoot. My cluster is still evolving, and I learn something new every week.
Cluster specs (for reference):
- 7 nodes: 1 control plane + 6 workers
- Kubernetes v1.34.2
- Proxmox VMs across 2 physical servers
- Calico CNI, MetalLB, NGINX Ingress
- Storage Classes:
- Longhorn: longhorn (default), longhorn-gp, longhorn-ha, longhorn-io, longhorn-static
- NFS-Unraid: nfs-unraid, nfs-unraid-retain
- Local: local-path
Member discussion