Kubernetes on GCE randomly stops working

Question

So ultimately what will happen, is everything will work fine, sometimes for days. However, once in a while when I do a deployment of my code (all contained within it's own Docker container, with the images stored on Dockerhub) it will cause Kubernetes to crash, which causes everything else to crash. I haven't been able to figure out any rhyme or reason to it. And for the most part, I've yet to find anything that actually helps fix the issues. Usually, it just starts working again for whatever reason - though I know at least once I deleted the whole instance group and started over. Which worked.

Now, when I do a deployment, all I do is run the kubectl set image deployment command. Which works most of the time, just once in a while weird stuff happens.

Now, more specifically the weird stuff that happens is if I try to go to https://<master node>/ui I will get an error like this:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "no endpoints available for service \"kubernetes-dashboard\"",
  "reason": "ServiceUnavailable",
  "code": 503
}

This is the output of kubectl cluster-info

Kubernetes master is running at https://104.198.207.42
GLBCDefaultBackend is running at https://104.198.207.42/api/v1/proxy/namespaces/kube-system/services/default-http-backend
Heapster is running at https://104.198.207.42/api/v1/proxy/namespaces/kube-system/services/heapster
KubeDNS is running at https://104.198.207.42/api/v1/proxy/namespaces/kube-system/services/kube-dns
kubernetes-dashboard is running at https://104.198.207.42/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard

and half way through writing this it magically started working again, so I can't really paste any more output (or, I don't know where to look for it, at least).

But if anyone has any ideas about what is causing this and how I can try and fix it next time it happens that would be amazing. It's extremely frustrating that a deployment can randomly break things and cause me hours of downtime while I try aimlessly and seemingly pointlessly to fix it. Just to have it randomly decide to work again.

Thanks for reading!

The error message "no endpoints available" means the pods that back the service are not ready, but updating an unrelated deployment shouldn't normally cause that. Do you have cluster autoscaling enabled? Are any nodes added/removed when you do the deployment? Are your nodes running close to capacity? — Tim Allclair, Sep 07 '16 at 02:05
I am fairly certain it was caused by running out of memory on my nodes. Though you would think that some kind of notification would get sent out by default. Guess not. Also, I'm not sure why it killed the whole cluster and not just the node that exceeded the capacity limits. — Kenyon, Sep 07 '16 at 02:20
You can always monitor your cluster resources using [Stackdriver](https://cloudplatform.googleblog.com/2015/12/monitoring-Container-Engine-with-Google-Cloud-Monitoring.html). Another good place to have a look for errors related to compute resources is the [Serial Console](https://cloud.google.com/compute/docs/troubleshooting#interacting_with_the_serial_console) — Carlos, Nov 01 '16 at 20:53
I was using Stackdriver, but it wasn't reporting Out of Memory. But it did just recently under go a fairly major update, so maybe it's fixed now. — Kenyon, Nov 01 '16 at 21:41
I was also thinking that if the issue was related just to the UI, the problem could have been unrelated to the resources in your nodes. If so a good way to go would have been checking the dashboard logs. — Carlos, Nov 02 '16 at 20:53
Don't think it had anything to do with the UI, since everything has dying all together. — Kenyon, Nov 02 '16 at 21:15
Were you ever able to resolve this issue? If so please consider posting a self-answer so the community can benefit. — Faizan, Jan 02 '17 at 22:09
@Faizan I added the answer. It's been quite a few months so I don't remember the exact commands I used to diagnose my issue, but, that is what the problem was. — Kenyon, Jan 03 '17 at 17:29
Actually even thought the UI doesn't work "Cabin" app works. Just saying. — Ravindranath Akila, May 22 '17 at 23:56

score 2 · Accepted Answer · answered Jan 03 '17 at 17:28

So in the interest of documentation in-case anyone else has this issue. I had to upgrade to larger instances, it was ultimately because I was getting OOM (out of memory) errors.

I don't remember how I found those errors anymore, whether it was the kubectl logs or the gcloud command line utility. But one of them eventually said there were "OOM" errors.

score 0 · Answer 2 · edited Jul 22 '17 at 17:43

I was also facing the same issue, whenever the cpu utilization is near to 100%, the kubernetes dashboard gives the same error

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "no endpoints available for service \"kubernetes-dashboard\"",
  "reason": "ServiceUnavailable",
  "code": 503
}

And when i delete some of the dummy pods, and it will automatically started working again.

Main thing is I have 4 nodes, and most of the pods are scheduling at only 1-2 nodes.

Kubernetes on GCE randomly stops working

2 Answers2