Zero downtime with Kubernetes on top of GlusterFs on top of a ZFS raid - Is this the best solution?

Question

TL;DR

A client requests me to make a robust system to run containerized microservices within their LAN.

Restriction: They give me 2 machines and 6 data-disks. No more hardware. I have to build "the best I can" here.

My worries are the storage and availability. Speed/performance is not an issue.

I'm thinking of:

In each machine build a RAID-5 using 3 data-disks, yielding in one data-volume in each machine. ZFS for example.
Tie the 2 machines with a distributed filesystem. GlusterFs for example.
Then use Kubernetes to create a cluster of 2 nodes pointing their Persistent Volumes to the distributed FS.
The fact the kubernetes cluster runs in the same hardware than the distributed filesystem is a mere casuality.

Question is: Any better solution given the restrictions of the client?

Read the context to understand more.

Context

I'm designing a server architecture for running a bunch of 30 microservices locally for a radio station. No AWS, no cloud. We are talking about on-premises server.

For the whole scenario speed is not an issue (traffic is low). The business drivers here are:

Data persistance (minimize the risk of loosing data).
High availability (minimize the risk of a downtime).

If at any point they are incompatible, avoiding data-loose takes precedence over high-availability: In the limit, I can tell the journalists to stop working for a few minutes; but we can't loose the interview that was recorded earlier this morning.

What they currently run (don't cry, please!!)

They run now all in one server with no containers and no data redundancy beyond backups. They experienced a disaster on 2018 and took them 2 full days to recover. The radio had to stop all the employees working, re-install the full OS, reinstall all the applications by hand, recover all the data from the backups, test everything then thell the journalists "back to write news".

Of course this is not acceptable in those days (even it was not acceptable in 2018).

The hardware they have for this project

To overcome this, they recently bought 2 servers with 1 system disk + 3 data disks on each (total 6 data-disks). The initial idea they had for the data-disks is to make a local software RAID-5 across the 3 data-disks within each server.

Call the servers alpha and beta. The two machines are the same in cpu, ram and system-disk as well as the 3 data-disks. So the computers are exact clones. They will both run Ubuntu linux.

Master-slave (client's idea) vs cluster (my proposal)

The idea they had is to use alpha as the main server and make beta be a "clone" of alpha so if it dies they can switch over the clients to beta in half an hour by manually reconfiguring the clients to point to another IP.

Nevertheless I think that the current technologies should enable me to create some sort of cluster so they are both alive and fully-in-sync so if anyone of those break, the clients experience zero-downtime.

Their request

The radio station initially requested me to build a RAID on alpha, via ZFS and then another RAID on beta via ZFS. Set a bunch of dockers on alpha with --restart=always and then point the clients of the services (running in their journalists' respective PCs) to alpha (think services as news writing, image uploading, audio recording, program scheduling, web publishing, media transcoding, local live stream feed to the cloud, etc.).

About the storage, their initial though was:

For photos and audios, make regular backups from alpha to beta.
For MySQL have a master-master in alpha-beta so beta is mostly acting as a slave but ready to be used in case of alpha dies.

Then if alpha breaks, switch all the clients to beta.

I'm specifically interested in the storage part of the story.

What I'm thinking

Instead of "manually switching clients" I was thinking of using kubernetes to make a cluster of 2 worker nodes. As I can't have separate hardware to act as "kubernetes master" I was thinking of making also alpha and beta be both redundant kubernetes masters of themselves as workers.

So alpha would be a kubernetes master for alpha and beta nodes, as well as beta would be a redundant master of kubernetes also for both alpha and beta.

Up to here no problem.

Until we arrive to the storage.

Storage, persistent volumes, GlusterFs and ZFS

When it comes to Persistent Volumes in kubernetes, the users of kubernetes launching their pods/containers need to rely that the data will not be lost. Some system administrator (in this case me) needs to "build" the redundancy below to ensure the volume "is" there with the proper data.

So this is what I was thinking:

1. A local ZFS layer

In the operating system of alpha (native to the system, forget kubernetes for a second) use ZFS to make a RAID across the 3 data-disks (equal in size). So if each disk is, say 1TB, there are 3TB of which 2TB will be available in the data volume and 1TB is under the hood for redundancy. Let's call the disks A1, A2 and A3. Let's call the ZFS volume A.
In beta, replicate the structure. Disks B1, B2, B3. Let's call the ZFS volume B.

Up to here I'd have have 2 independent servers each protected against a single failure of a single disk. No protection against 2-disk simultaneus failure. No protection against a full-node down.

2. A distributed GlusterFs layer

Then create GlusterFs across alpha and beta on top of the ZFS volumes. I understand that GlusterFs has to give me some sort of mirroring configuration, so the ZFS volumes A and B are one mirror of the other. So if A is 2TB and B is 2TB the "total available storage" is also 2TB for use.
So adding up GlusterFs and ZFS at this moment, from the 6TB in total hardware capacity, 2TB are available for users and therefore 4TB are acting as redundancy.

Up to here, I should have a "distributed disk" that has much more redundancy and allows failure of 2 disks and also node-failure.

I see protection of 2 disks failing in the following manner:

If the two disks pertain to different volumes (say fails A2 and B3) then each NFS separately protects against that and both ZFS volumes A and B are not disrupted (GlusterFs sees no changes).
If the 2 disks failling belong to the same node, then the full volume is failing. For example a failure in A2 and A1 makes A broken. But GlusterFs should be able to balance to use "only 1 node" until the other becomes available (in this case "use only B until A comes back again").

3. Kubernetes container runtime for service cluster + Kubernetes Persistent Volumes.

Finally, use Kubernetes Persistent Volumes would point to the GlusterFs volumes.
If I had 4 machines, probably I'd use 2 as kubernetes nodes and 2 for storage acting as a networked storage to the cluster.
But we only have 2 physical machines so kubernetes will point "persistent volumes" to "GlusterFs" exactly as if they were "in another remote machine" making it agnostic that the volumes are physically in the same nodes.

Question

I have never physically built a construct like this before. Over the paper, it works. I wonder if the reality is different.

Given the constraints (2 machines, 6 data-disks), question is:

Is this topology the best way to create a mini-cluster with zero-downtime and data-redundancy for the client?
If not, what changes should I apply and why.

Mircea Vutcovici · Answer 1 · 2021-03-29T12:42:56.490

When you do clustering, you have to think of split brain. For this you need 3 nodes.

I would prefer a RAID10 instead of RAID5 (RAIDZ), in the case of ZFS mostly for performance.

For MySQL/MariaDB I would use Galera plugin for replication.

You will need a clustering management software like ClusterLabs Pacemaker.

For storage I would consider also Ceph.

But there is another aspect of this setup. Complexity. Where do you test it? Do you plan to automate the installation. Will you automation allow to install your setup for VMs? How do you plan to configure fencing? Will you use a storage VLAN?

Do you plan to use a load balancer (e.g HAProxy)?

Network redundancy? LACP, Spanning tree, OSPF/BGP...

How is the server load? Maybe you can install all setup in VMs. You would still need 3 physical hosts, but you will have more flexibility.

And you need to write down documentation and scripts for various failure scenarios, including those caused by human errors.

With only 2 machines, for written data (storage, database) it's better to do an master-slave config where you write only on the master and have the salave as backup. For stateless services, you can configure them in active-active mode.