0

We have an horizontal pod auto-scaler (HPA) installed on gke cluster, most of the time the auto-scaler works perfectly fine. but from time to time (mostly on our customer rush hours) the auto scaler is getting error code 503 from stack driver.

here are the errors we encounter:

Failed request to stackdriver api: googleapi: Error 503: The service is currently unavailable., backendError

and

"apiserver received an error that is not an metav1.Status: &googleapi.Error{Code:503, Message:"The service is currently unavailable.", Body:"{\n  \"error\": {\n    \"code\": 503,\n    \"message\": \"The service is currently unavailable.\",\n    \"errors\": [\n      {\n        \"message\": \"The service is currently unavailable.\",\n        \"domain\": \"global\",\n        \"reason\": \"backendError\"\n      }\n    ],\n    \"status\": \"UNAVAILABLE\"\n  }\n}\n", Header:http.Header(nil), Errors:[]googleapi.ErrorItem{googleapi.ErrorItem{Reason:"backendError", Message:"The service is currently unavailable."}}}" 

Now I'm a bit puzzled here; google is encouraging to use the stackdriver as a source of HPA's (https://cloud.google.com/kubernetes-engine/docs/tutorials/external-metrics-autoscaling), but if it isn't 100% available or fault tolerant - the cluster is just broken as the pods are not scaling up and the resources are being exhausted..

Anyone know how to work around here ?

1 Answers1

0

It’s an API error and you only see this error during peak hours (when you are making so many requests to the Stackdriver API). Since it is happening on peak hours, the API cannot handle all of the requests at that time and becomes unavailable; However, it doesn’t mean pods will not scale up. The service is currently just unavailable, it will hold the request and will be sent again.It may take a few minutes to respond successfully to requests.

Aarti S
  • 31
  • 3
  • Thanks Aarti, but during that time the HPA is not scaling, therefor we don't have enough resources to handle our users requests. What will be the best approach to overcome this unavailability time (as we do see a decrease in our service responsibility)? – kobymol Oct 11 '19 at 10:02
  • As this is a Stackdriver API issue, I don’t have access to further investigate your issue. You can report it in Public Issue Tracker [1]. Someone from the Stackdriver API team will look into the issue and will help you. [1] https://issuetracker.google.com – Aarti S Oct 11 '19 at 23:02
  • Thanks, I'll do so ... – kobymol Oct 12 '19 at 01:59