1

I've created a Deployment which can contain anywhere between 2 and ~25 containers all working on a slice of a larger single logical unit of work. The containers will use a peak of 700MB-4GB of ram and my initial approach was to request 1G, limit 4G. In the worst case scenario (where there are more 4GB's than 700MBs), this takes a node down (or will not schedule to begin with) even while 3 or 400% free aggregate resources are available elsewhere.

Watching a container or two slowly creep up in RAM and OOM the node down rather than have the scheduler pluck the container off and relocate seems a pretty glaring concern for stability.

Having dug through literally years of git debates, the documentation and the code itself. It's unclear at which level of abstraction upon abstraction does the scheduler even spread containers on launch or if there's any proactive steps K8S bothers with once work has been deployed.

If a ReplicaSet (I believe that's the new, improved ReplicationController) will only reanimate containers until killing the host, you have to create hard worst-case scenarios requests to each pod under it's remit. For the larger of the jobs we run as a Deployment, this introduces a 50%+ RAM waste lost to over-provisioning 'just in case'.

Isn't keeping around over-provisioned resources one of problems we're trying to solve here?

I've used quite a few schedulers/resource managers over the years and don't recall a case where one jobstep - container -- whatever the analogy would be would be allowed to compromise the host itself rather than be force migrated or just outright marked ineligible for scheduling..

Even though the docs admonish the idea, naked pods or 1 pod:1 ReplicaSet seem the only way to keep work distributed (assuming containers checkpoint and commit suicide often enough for the overall resource picture to be reconsidered).

I should also mention that this is the hosted Google Container Engine (v1.2.2) and given the what looked like several pages of flags one can launch K8S with, its unclear if this is an inherent issue, user error or just how GCE has configured K8S. I'm really hoping for user error on this one.

Pierre.Vriens
  • 1,159
  • 34
  • 15
  • 19

1 Answers1

1

To answer my own question based on some quite helpful folks on the Kubernetes slack channel.

-- My experience of a node failing because of container's OOM'ing is likely due to a secondary effect as the resource manager is designed to prevent this. The suggested culprit was actually the I/O subsystem becoming overloaded to the point of de-stabilizing the node which after some measurements, looks very likely.

In GKE, the OS, Docker, K8S, and any temporary directories the pods request are all on one non-local 100GB (by default, I believe) ext4 filesystem.

Most of the pods we'd spun up were asking for and writing to temporary directories and the collective I/O overwhelms the system to the point of becoming unresponsive and in our case locking up the OS itself.

-- An initial test, setting up my own K8S with the OS on its own ext4 drive, docker and ephemeral space in their own ZFS pools and the same deployment manifests do stress, but does not come close to crashing the OS.

-- A workaround that has been proposed but not yet tested is to use Jobs and manage the dependencies between them with some coordinating process, presumably as this would spread the individual containers across the cluster. This may work, but strikes me as papering over an underlying issue.

While I've not yet measured assigning persistent disks for the scratch space we were using emptyDir for, I'm assuming this would also lessen the load on the primary disk and may be enough to mask the problem.

Unfortunately, the default GKE setup assumes sda will be able to handle the entire load of the OS, K8S logs, Docker and scratch space which apparently must work for most folks as I couldn't find another issue like ours.

Coming from bare metal, I'd hoped to avoid some low level detail in having the cluster managed, but both dataproc and GKE, so far at least have me leaning heavily towards building the clusters out myself.

Hopefully this will help someone who's workload is amenable to the Job pattern or using provisioned disks mostly.

I'm surprised any best practice would have so much expected of the boot drive and will flag this with support as even the 'regular' compute engine seems to discourage this given the default boot drive sizes.