0

We are running two separate subdomains, each on a separate external IP address, and each matched to its own kubernetes nginx service. The configuration looks like this:

#--------------------
# config for administrative nginx ssl termination deployment and associated service


apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: admin-nginx
labels:
  name: admin-nginx
spec:
replicas: 1
template:
  metadata:
    name: admin-nginx
    labels:
      name: admin-nginx
  spec:
    nodeSelector:
      cloud.google.com/gke-nodepool: currentNodePool
    containers:
    - name: admin-nginx
      image: path/to/nginx-ssl-image:1
      ports:
        - name: admin-http
          containerPort: 80
        - name: admin-https
          containerPort: 443

apiVersion: v1
kind: Service
metadata: 
name: admin-nginx
spec: 
ports: 
  - name: https
    port: 443
    targetPort: admin-https
    protocol: TCP
  - name: http
    port: 80
    targetPort: admin-http
    protocol: TCP
selector: 
  name: admin-nginx
type: LoadBalancer


#--------------------
# config for our api's nginx ssl termination deployment and associated service


apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: public-nginx
labels:
  name: public-nginx
spec:
replicas: 2
strategy:
  rollingUpdate:
    maxUnavailable: 0
template:
  metadata:
    labels:
      name: public-nginx
  spec:
    nodeSelector:
      cloud.google.com/gke-nodepool: currentNodePool
    containers:
    - name: public-nginx
      image: path/to/nginx-ssl-image:1
      ports:
        - name: public-http
          containerPort: 80
        - name: public-https
          containerPort: 443


apiVersion: v1
kind: Service
metadata: 
name: public-nginx
spec: 
ports: 
  - name: https
    port: 443
    targetPort: public-https
    protocol: TCP
  - name: http
    port: 80
    targetPort: public-http
    protocol: TCP
selector: 
  name: public-nginx
type: LoadBalancer


#--------------------

Inside of our kubernetes cluster, we have, associated with each of the nginx deployments, a custom API router/gateway we use internally. These routers each have a /health endpoint, for, ah, health checks. This will be important in a second.

Some of the detail above have been elided; there is also a bit of configuration that makes nginx aware of the target service's address and port.

The configuration above creates 2 load balancers, sort of. I guess, technically, it creates two forwarding rules, each with an associated external IP, and a target pool consisting of all the instances in our k8s cluster. This, generally speaking, should work fine. Each k8s generated forwarding rule has an annotation in its description field, like this:

{"kubernetes.io/service-name":"default/admin-nginx"}

An associated firewall entry is created as well, with a similar annotation in its description field:

{"kubernetes.io/service-name":"default/admin-nginx", "kubernetes.io/service-ip":"external.ip.goes.here"}

The external IPs are then wired up to one of our subdomains via CloudFlare's DNS service.

Ideally, the way this should all work, and the way it had worked in the past, is as follows.

An incoming request to admin.ourdomain.com/health returns the health status page for everything handled by the API router deployment (well, the service pointing to the pods that implement that deployment anyway) that deals with admin stuff. It does this by way of the nginx pod, pointed to by the nginx service, pointed to by the description annotation on the forwarding rule, pointed to by way of the GCE external IP address manager and the firewall, though I'm less clear on the ordering for that last part.

Like this:

server          status  lookupMicros
https://adminservice0:PORT/health    Ok  910
https://adminservice1:PORT/health    Ok  100
https://adminservice2:PORT/health    Ok  200
https://adminservice3:PORT/health    Ok  876

And so on.

Meanwhile, a request to public.ourdomain.com/health should return pretty much the same thing, except for public services.

Like this:

server          status  lookupMicros
https://service0:PORT/health    Ok  910
https://service1:PORT/health    Ok  100
https://service2:PORT/health    Ok  200
https://service3:PORT/health    Ok  876

Etc.

Pretty reasonable, right?

As best I understand it, the whole thing hinges around making sure a request to the admin subdomain, by way of the external address linked to the admin annotated forwarding rule, eventually make its through GCE's network apparatus and into the kubernetes cluster, somewhere. It shouldn't matter where in the cluster it ends up first, as all of the nodes are aware of what services exist and where they are.

Except... That's not what I'm seeing now. Instead, what I'm seeing is this: every couple of refreshes on admin.ourdomain.com/health, which is definitely on a different IP address than the public subdomain, returns the health page for the public subdomain. That's bad.

On the bright side, I'm not seeing, for some reason, requests destined for the public subdomain's /health end up returning results from the admin side, but it's pretty disturbing anyway.

Whatever is going on, it might also be interesting to note that requests made on the wrong side, like admin.ourdomain.com/publicendpoint, are 404'd correctly. I'd imagine that's just because the /health is the only endpoint that inherently belongs to the API router, and moreover that that bolsters the case that whatever is happening seems to be happening because of an issue in the path from the GCE forwarding rule to the correct kubernetes service.

So, I guess we finally get to the part where I ask a question. Here goes:

Why are requests through an external ip associated with a forwarding rule targeting a particular kubernetes service being sent, intermittently, to the wrong kubernetes service?

Any assistance or information on this issue would be greatly appreciated.

mdavids
  • 36
  • 2
  • Do you have one instance group in your project which is serving several backends ? If yes, make sure to specify different [port names](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/get-named-ports) in the instance groups. – Faizan Mar 14 '17 at 00:02
  • Alas, no luck. No named ports are returned... it looks like named ports are for use with http(s) load balancers, and we are using network load balancers. – mdavids Mar 14 '17 at 22:23
  • Were you able to solve this issue? If so please consider to post an answer, it would benefit the community. – Carlos Apr 13 '17 at 17:00
  • No, sorry. I ended up running a single node cluster separately to handle our admin/metrics tools. Very frustrating. – mdavids Apr 14 '17 at 18:34

0 Answers0