I've created a Deployment which can contain anywhere between 2 and ~25 containers all working on a slice of a larger single logical unit of work. The containers will use a peak of 700MB-4GB of ram and my initial approach was to request 1G, limit 4G. In the worst case scenario (where there are more 4GB's than 700MBs), this takes a node down (or will not schedule to begin with) even while 3 or 400% free aggregate resources are available elsewhere.
Watching a container or two slowly creep up in RAM and OOM the node down rather than have the scheduler pluck the container off and relocate seems a pretty glaring concern for stability.
Having dug through literally years of git debates, the documentation and the code itself. It's unclear at which level of abstraction upon abstraction does the scheduler even spread containers on launch or if there's any proactive steps K8S bothers with once work has been deployed.
If a ReplicaSet (I believe that's the new, improved ReplicationController) will only reanimate containers until killing the host, you have to create hard worst-case scenarios requests to each pod under it's remit. For the larger of the jobs we run as a Deployment, this introduces a 50%+ RAM waste lost to over-provisioning 'just in case'.
Isn't keeping around over-provisioned resources one of problems we're trying to solve here?
I've used quite a few schedulers/resource managers over the years and don't recall a case where one jobstep - container -- whatever the analogy would be would be allowed to compromise the host itself rather than be force migrated or just outright marked ineligible for scheduling..
Even though the docs admonish the idea, naked pods or 1 pod:1 ReplicaSet seem the only way to keep work distributed (assuming containers checkpoint and commit suicide often enough for the overall resource picture to be reconsidered).
I should also mention that this is the hosted Google Container Engine (v1.2.2) and given the what looked like several pages of flags one can launch K8S with, its unclear if this is an inherent issue, user error or just how GCE has configured K8S. I'm really hoping for user error on this one.