1

I'm making a Gedankenexperiment about deploying PostgresXL on Kubernetes (k8s) where each datanode uses local (Directly Attached) storage.


Imagine we have the following nodes:

  • 2x highend machines with blazingly fast Optane DC SSD and NVDIMM. These nodes are registered with k8s and labelled as type: datanode.
  • 1x machine registered with k8s and labelled as type: GTM.
  • 2x machines registered with k8s and labelled as type: coodinator.

Let's say that each label creates a "node group" (e.g. the type: datanode label makes a group with two nodes).
Let's also suppose that each node has a /data mount point (in the host OS) that is mapped to its underlying fastest disk (or LVM logical volume).

I would deploy an architecture like this one:

                        --------------
                        |   gtm_0    |
                        --------------
                       / |          | \
                     /   |          |   \
                   /     |          |     \
                 /       |          |       \
               /         |          |         \
             /           |          |           \
           /             |          |             \
         /               |          |               \
       /       ------------        ------------      \
      |        | coord_0  |        | coord_1  |       |
      |        ------------        ------------       |
      |       /             \    /             \      |
      |     /                 \/                 \    |
------------      ------------/\------------      ------------
|  data_0  |     /                          \     |  data_1  |
------------ ----                            ---- ------------

Which cames from a postgres-xl docker compose image.

I want to eventually scale horizontally each of the three nodes categories and I need a stable name for each one, so I would create three StatefulSets (each one backed by the same headless service): ss-datanodes, ss-gtms and ss-coords. In each of these StatefulSet the pod template would select only the relevant nodes (eg: the type: datanode nodes for the ss-datanodes pods) and have the appropriate image (eg: of the GTM for the ss-gtms set).

This would allow k8s to "shuffle" the pods within the same node group.

However I don't want to use any NAS, each pod must only use DAS to maximize performance.
To do so I create a StorageClass without provisioning and with a binding mode set to "WaitForFirstConsumer".
I would then create a three PersistentVolumes, these PVs would be of type local, point to the /data mount point and differs only in the nodeAffinity.
I would also create a three PersistenceVolumeClaim (PVC) matching the companion PVs.

Finally, I would add the PVCs in the pods templates in the StatefulSets.

This should allow k8s to:

  • Schedule the components (GTM, Coordinator, DataNode) only in their node group.
  • Shuffle the pods in their node group.
  • Let the /data mount point be available on each pod.

I'm making the following assumption about PostgresXL:

  • The components can be brought up in any order (e.g. coord_0, data_1, gtm_0, data_0, coord_1).
  • Each component can handle the data written by another component of the same kind (e.g. data_0 and data_1 pods could be swapped and made to use the other one data and keep working). This doesn't seem unreasonable as the only identity a pod keeps is its hostname.

Would this setup work for PostgresXL and what do I get by using k8s?

One thing k8s is useful is HPA, which could be used to automatically add another datanode node and scale the ss-datanodes up.
However, all over the Internet, I read that state management is hard within k8s. But this setup isn't hard, so I must miss something.

0 Answers0