Posting this answer as the community wiki as the underlying topic of the question could be a bit wide.
Feel free to expand it.
Why a GKE
cluster can loose data?
Without specific information on how exactly the application/workload was deployed on a GKE
cluster it could be hard to pinpoint the actual issue.
It's worth to mention following things:
- Workloads that have an expectation to store data (like databases) should be using
Persistent Volumes
. In case of a node failure the data stored on a PV
will not be lost as it will be stored on different entity.
PersistentVolume
resources are used to manage durable storage in a cluster. In GKE
, a PersistentVolume
is typically backed by a persistent disk.
Cloud.google.com: Kubernetes Engine: Docs: Concepts: Persistent Volumes
There is a guide for deploying WordPress on GKE with Persistent Disks and Cloud SQL. It could be used an example for deploying workload with PVC
(Persistent Disk):
Modifications on the boot disk of a node VM do not persist across node re-creations. To preserve modifications across node re- creation, use a DaemonSet.
Cloud.google.com: Kubernetes Engine: Docs: How to: Node auto upgrade: Overview
Reffering to the question asked
I am new to GCP so pardon the ignorance.
I encourage you to visit the official documentation of GCP
and GKE
. You can find there are a lot information/guides and examples to follow:
Each node has a 100GB standard persistent disk allocated.
This disks are specifically used as boot disks for a GKE
node and they shouldn't be used as a place to store data. You can use Persistent Volumes
as mentioned earlier or opt for a local SSD on which you can read more by following below link:
However, I find every so often (has happened at least 3 time since august) that I boot up and the data is lost
GKE
cluster and nodes cannot be turned off. What you can do is reduce (scale) the amount of nodes in a node pool. Have you meant that you connect to it?
any firewall rules that had been put in place are reset to default.
You shouldn't reconfigure firewall rules of a GKE
node. Instead you should be using the GCP Firewall located in Cloud Console
(Web UI) -> VPC Network
-> Firewall
. A node recreation due to a node upgrade or failure will reset the firewall rules.
Hoe can I:
- Stop the data in the DB from being erased
- prevent the firewall rules from being reset
Is this due to infrastructure upgrading?
You could consider (depending on your exact use case) using the GCE
instance instead of a GKE
cluster. GKE
is a managed Kubernetes cluster designed to run containerized workloads and some of the parts of it are managed by Google (like for example control plane).
As for infrastructure upgrading you could take a look on what happens when a cluster is upgraded by following below links:
Additional reference: