1

I have single-machine (untainted) Kubernetes cluster on bare metal CentOs.
I'm using nginx-ingress-controller as the gateway. The image I use is from https://quay.io/repository/kubernetes-ingress-controller/nginx-ingress-controller. I was using version 0.13.0, and I did upgrade to 0.14.0 recently because of the crashes. Unfortunately it doesn't help.

The server works pretty fine for ~3 days. After this time the ingress controller is going to the CrashLoopBackOff state.

I've prepared a service which tries to access the URL handled by the cluster every minute and if it notices that it is not accessible, it captures the log and information about pods and sends me it via email. If the cluster will not recover after 5 minutes the system reboot is performed.

So yesterday it failed again, first it recovered itself after 2 minutes so the restart was not performed. All pods were in running state, here are the logs:

May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.845990 2556 remote_runtime.go:278] ContainerStatus "bdb53e54dbcc9250663b64db5827276efba7012b50edc1351195d34b87e46529" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.849158 2556 kuberuntime_container.go:416] ContainerStatus for bdb53e54dbcc9250663b64db5827276efba7012b50edc1351195d34b87e46529 error: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.849184 2556 kuberuntime_manager.go:874] getPodContainerStatuses for pod "nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)" failed: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.850157 2556 generic.go:241] PLEG: Ignoring events for pod nginx-ingress-controller-65b9795548-br445/ingress-nginx: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:47 web-backend kubelet[2556]: E0515 07:33:47.851766 2556 pod_workers.go:186] Error syncing pod 0438c326-4d0e-11e8-ad63-005056b1f077 ("nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"), skipping: rpc error: code = Unknown desc = Error response from daemon: devmapper: Unknown device 516757f48afe0bd82be957abd70578bf46e5a89ccfb782f913acca08514fb58d
May 15 07:33:57 web-backend kubelet[2556]: I0515 07:33:57.696417 2556 kuberuntime_manager.go:758] checking backoff for container "nginx-ingress-controller" in pod "nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)"
May 15 07:34:05 web-backend kubelet[2556]: W0515 07:34:05.779589 2556 prober.go:103] No ref for container "docker://2ac4271f1f2a9f515deb4c2d86465d3db5c23dae10858c39122879d67a458976" (nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077):nginx-ingress-controller)
May 15 07:34:15 web-backend kubelet[2556]: W0515 07:34:15.784498 2556 prober.go:103] No ref for container "docker://2ac4271f1f2a9f515deb4c2d86465d3db5c23dae10858c39122879d67a458976" (nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077):nginx-ingress-controller)
May 15 07:34:25 web-backend kubelet[2556]: W0515 07:34:25.820287 2556 prober.go:103] No ref for container "docker://2ac4271f1f2a9f515deb4c2d86465d3db5c23dae10858c39122879d67a458976" (nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077):nginx-ingress-controller)
May 15 07:34:28 web-backend kubelet[2556]: W0515 07:34:28.577852 2556 pod_container_deletor.go:77] Container "e12f6be27df976112456671d640db9711cc5546127eee47fa52a0118db9fc573" not found in pod's containers

The second failure occurred after 20 minutes, now it hasn't recovered so the reboot was performed after 5 minutes.
Logs:

May 15 07:57:05 web-backend kubelet[2556]: I0515 07:57:05.946879 2556 kuberuntime_manager.go:768] Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)
May 15 07:57:05 web-backend kubelet[2556]: E0515 07:57:05.946919 2556 pod_workers.go:186] Error syncing pod eb6191db-4fab-11e8-bc40-005056b1f077 ("nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)"), skipping: failed to "StartContainer" for "nginx-ingress-controller" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-8nznl_ingress-nginx(eb6191db-4fab-11e8-bc40-005056b1f077)"
May 15 07:57:11 web-backend kubelet[2556]: I0515 07:57:11.946715 2556 kuberuntime_manager.go:514] Container {Name:nginx-ingress-controller Image:quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0 Command:[] Args:[/nginx-ingress-controller --default-backend-service=$(POD_NAMESPACE)/default-http-backend --configmap=$(POD_NAMESPACE)/nginx-configuration --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services --udp-services-configmap=$(POD_NAMESPACE)/udp-services --annotations-prefix=nginx.ingress.kubernetes.io] WorkingDir: Ports:[{Name:http HostPort:0 ContainerPort:80 Protocol:TCP HostIP:} {Name:https HostPort:0 ContainerPort:443 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:POD_NAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:nginx-ingress-serviceaccount-token-x87kw ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
May 15 07:57:11 web-backend kubelet[2556]: I0515 07:57:11.946852 2556 kuberuntime_manager.go:758] checking backoff for container "nginx-ingress-controller" in pod "nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)"
May 15 07:57:11 web-backend kubelet[2556]: I0515 07:57:11.946975 2556 kuberuntime_manager.go:768] Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)
May 15 07:57:11 web-backend kubelet[2556]: E0515 07:57:11.947002 2556 pod_workers.go:186] Error syncing pod eb691c72-4fab-11e8-bc40-005056b1f077 ("nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)"), skipping: failed to "StartContainer" for "nginx-ingress-controller" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-4ddm7_ingress-nginx(eb691c72-4fab-11e8-bc40-005056b1f077)"
May 15 07:57:17 web-backend kubelet[2556]: I0515 07:57:17.958471 2556 kuberuntime_manager.go:514] Container {Name:nginx-ingress-controller Image:quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0 Command:[] Args:[/nginx-ingress-controller --default-backend-service=$(POD_NAMESPACE)/default-http-backend --configmap=$(POD_NAMESPACE)/nginx-configuration --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services --udp-services-configmap=$(POD_NAMESPACE)/udp-services --annotations-prefix=nginx.ingress.kubernetes.io] WorkingDir: Ports:[{Name:http HostPort:0 ContainerPort:80 Protocol:TCP HostIP:} {Name:https HostPort:0 ContainerPort:443 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:POD_NAME Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_NAMESPACE Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:nginx-ingress-serviceaccount-token-x87kw ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10254,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
May 15 07:57:17 web-backend kubelet[2556]: I0515 07:57:17.958587 2556 kuberuntime_manager.go:758] checking backoff for container "nginx-ingress-controller" in pod "nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"
May 15 07:57:17 web-backend kubelet[2556]: I0515 07:57:17.958737 2556 kuberuntime_manager.go:768] Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)
May 15 07:57:17 web-backend kubelet[2556]: E0515 07:57:17.958767 2556 pod_workers.go:186] Error syncing pod 0438c326-4d0e-11e8-ad63-005056b1f077 ("nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"), skipping: failed to "StartContainer" for "nginx-ingress-controller" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-ingress-controller pod=nginx-ingress-controller-65b9795548-br445_ingress-nginx(0438c326-4d0e-11e8-ad63-005056b1f077)"

And this time, the nginx-ingress controller pod went into the CrashLoopBackOff state, here is some detailed information about it from kubectl describe pod:

Name:           nginx-ingress-controller-65b9795548-4ddm7
Namespace:      ingress-nginx
Node:           web-backend/10.202.91.129
Start Time:     Fri, 04 May 2018 08:00:44 -0700
Labels:         app=ingress-nginx
                pod-template-hash=2165351104
Annotations:    prometheus.io/port=10254
                prometheus.io/scrape=true
Status:         Running
IP:             192.168.255.17
Controlled By:  ReplicaSet/nginx-ingress-controller-65b9795548
Containers:
  nginx-ingress-controller:
    Container ID:  docker://ad9447979d549757449cf26de19ba2485e39abbd32b9dd69cc9a18d27a48002d
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:4091d87c1f81fdd1036ddc96e2da725b1aeb37f26bb8bdd97e16a6ea4d2e1b14
    Ports:         80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=$(POD_NAMESPACE)/default-http-backend
      --configmap=$(POD_NAMESPACE)/nginx-configuration
      --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
      --udp-services-configmap=$(POD_NAMESPACE)/udp-services
      --annotations-prefix=nginx.ingress.kubernetes.io
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 15 May 2018 07:55:27 -0700
      Finished:     Tue, 15 May 2018 07:56:19 -0700
    Ready:          False
    Restart Count:  54
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-controller-65b9795548-4ddm7 (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-serviceaccount-token-x87kw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nginx-ingress-serviceaccount-token-x87kw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-serviceaccount-token-x87kw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
Name:           nginx-ingress-controller-65b9795548-8nznl
Namespace:      ingress-nginx
Node:           web-backend/10.202.91.129
Start Time:     Fri, 04 May 2018 08:00:44 -0700
Labels:         app=ingress-nginx
                pod-template-hash=2165351104
Annotations:    prometheus.io/port=10254
                prometheus.io/scrape=true
Status:         Running
IP:             192.168.255.32
Controlled By:  ReplicaSet/nginx-ingress-controller-65b9795548
Containers:
  nginx-ingress-controller:
    Container ID:  docker://57c588250fe42266d9c31494f8dd9c12b970f29f040d2975ee121279ed6af470
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:4091d87c1f81fdd1036ddc96e2da725b1aeb37f26bb8bdd97e16a6ea4d2e1b14
    Ports:         80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=$(POD_NAMESPACE)/default-http-backend
      --configmap=$(POD_NAMESPACE)/nginx-configuration
      --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
      --udp-services-configmap=$(POD_NAMESPACE)/udp-services
      --annotations-prefix=nginx.ingress.kubernetes.io
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 15 May 2018 07:55:24 -0700
      Finished:     Tue, 15 May 2018 07:56:06 -0700
    Ready:          False
    Restart Count:  63
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-controller-65b9795548-8nznl (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-serviceaccount-token-x87kw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nginx-ingress-serviceaccount-token-x87kw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-serviceaccount-token-x87kw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
Name:           nginx-ingress-controller-65b9795548-br445
Namespace:      ingress-nginx
Node:           web-backend/10.202.91.129
Start Time:     Tue, 01 May 2018 00:05:26 -0700
Labels:         app=ingress-nginx
                pod-template-hash=2165351104
Annotations:    prometheus.io/port=10254
                prometheus.io/scrape=true
Status:         Running
IP:             192.168.255.4
Controlled By:  ReplicaSet/nginx-ingress-controller-65b9795548
Containers:
  nginx-ingress-controller:
    Container ID:  docker://bd1db5de818e982b213b0ae2fe4e6208a0a4ec7bdc2d0a3c483867d63bd9fd76
    Image:         quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.14.0
    Image ID:      docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:4091d87c1f81fdd1036ddc96e2da725b1aeb37f26bb8bdd97e16a6ea4d2e1b14
    Ports:         80/TCP, 443/TCP
    Args:
      /nginx-ingress-controller
      --default-backend-service=$(POD_NAMESPACE)/default-http-backend
      --configmap=$(POD_NAMESPACE)/nginx-configuration
      --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
      --udp-services-configmap=$(POD_NAMESPACE)/udp-services
      --annotations-prefix=nginx.ingress.kubernetes.io
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 15 May 2018 07:55:26 -0700
      Finished:     Tue, 15 May 2018 07:56:22 -0700
    Ready:          False
    Restart Count:  56
    Liveness:       http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:10254/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       nginx-ingress-controller-65b9795548-br445 (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-serviceaccount-token-x87kw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nginx-ingress-serviceaccount-token-x87kw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nginx-ingress-serviceaccount-token-x87kw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

My kubelet version: Kubernetes v1.9.3
Docker: Docker version 1.12.6, build 3e8e77d/1.12.6
OS: CentOS Linux release 7.2.1511 (Core)

After reboot (it takes about 2 minutes to bring everything back) it is working without any outages for next ~3 days.

I have noticed that it is only the nginx-ingress-controller that is failing, but it basically means that whole cluster goes down, no service is available via the web browser.

I have noticed in logs weird error message:

May 15 05:28:15 web-backend kubelet[2556]: E0515 05:28:15.672576 2556 kubelet_node_status.go:383] Error updating node status, will retry: error getting node "web-backend": Get https://10.202.91.129:6443/api/v1/nodes/web-backend: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Since this is a single node cluster, it looks like an issue with the apiserver, but I have no idea how to address it.

I have scaled up the nginx-ingress-controller to 3 replicas, but it doesn't help. Probably only one replica goes down at the time, but all traffic is directed to it anyway.

Here is what I've found after next crash in POD logs:

kubectl --namespace=ingress-nginx logs nginx-ingress-controller-65b9795548-4ddm7 --since 15m
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.14.0
  Build:      git-734361d
  Repository: https://github.com/kubernetes/ingress-nginx
-------------------------------------------------------------------------------
W0518 12:31:30.646157       7 client_config.go:533] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0518 12:31:30.674673       7 main.go:181] Creating API client for https://10.96.0.1:443
I0518 12:31:34.857433       7 main.go:225] Running in Kubernetes Cluster version v1.9 (v1.9.3) - git (clean) commit d2835416544f298c919e2ead3be3d0864b52323b - platform linux/amd64
I0518 12:31:35.650003       7 main.go:84] validated ingress-nginx/default-http-backend as the default backend
I0518 12:31:37.486330       7 stat_collector.go:77] starting new nginx stats collector for Ingress controller running in namespace  (class nginx)
I0518 12:31:37.490876       7 stat_collector.go:78] collector extracting information from port 18080
I0518 12:31:37.994480       7 nginx.go:278] starting Ingress controller
I0518 12:31:38.289328       7 event.go:218] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"nginx-configuration", UID:"c56ae692-373e-11e8-937e-005056b1f077", APIVersion:"v1", ResourceVersion:"6055322", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/nginx-configuration
I0518 12:31:38.547714       7 event.go:218] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"tcp-services", UID:"ca14ac3e-373e-11e8-937e-005056b1f077", APIVersion:"v1", ResourceVersion:"4116575", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/tcp-services
I0518 12:31:38.547931       7 event.go:218] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"ingress-nginx", Name:"udp-services", UID:"cd74ece7-373e-11e8-937e-005056b1f077", APIVersion:"v1", ResourceVersion:"4116581", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap ingress-nginx/udp-services
I0518 12:31:39.518536       7 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"web", Name:"web-incubator", UID:"02a43f7c-37d2-11e8-937e-005056b1f077", APIVersion:"extensions", ResourceVersion:"7517344", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress web/web-incubator
I0518 12:31:39.537898       7 event.go:218] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"web", Name:"web-production", UID:"64f13208-3765-11e8-937e-005056b1f077", APIVersion:"extensions", ResourceVersion:"7517346", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress web/web-production
I0518 12:31:39.612898       7 nginx.go:299] starting NGINX process...
I0518 12:31:39.857564       7 leaderelection.go:175] attempting to acquire leader lease  ingress-nginx/ingress-controller-leader-nginx...
I0518 12:31:39.976074       7 status.go:196] new leader elected: nginx-ingress-controller-65b9795548-8nznl
I0518 12:31:40.018482       7 controller.go:168] backend reload required
I0518 12:31:40.028976       7 stat_collector.go:34] changing prometheus collector from  to default
I0518 12:31:48.350483       7 controller.go:177] ingress backend successfully reloaded...
I0518 12:31:55.369911       7 main.go:150] Received SIGTERM, shutting down
I0518 12:31:55.697116       7 nginx.go:362] shutting down controller queues
I0518 12:31:56.366677       7 nginx.go:370] stopping NGINX process...

Any ideas what may be causing the problems?

Djent
  • 89
  • 4
  • 15

2 Answers2

1

Kubernetes is a self healing, high availability and load balanced environment. Even if you have one node you still use cluster solutions to achieve best performance. In this kind of situation all software components are running on one machine. So there is a lot of software running - your application images and Kubernetes processes. This requires to have enough resources for every aspect of this configuration.

What if your application (that means docker image for example) is running out of resources? Kubernetes will kill it. Exhausted memory limits will cause OOO behavior to run and kill the container.

In log you've provided I found the following entry:

I0518 12:31:55.369911 7 main.go:150] Received SIGTERM, shutting down

So Kubernetes decided to kill Ingress controller due to exhausted resources. Try to extend the memory of the machine (or vm) and observe if it helps.

Another option is to start another node and extend cluster from one node to multi node with all features like HA and load balancing enabled to avoid issue like this.

d0bry
  • 186
  • 5
  • That actually might be a thing in my case. Seems like the RAM is short on the VM. I'll try to extend it. Thank you for the hint. – Djent May 21 '18 at 05:39
0

This is caused by the limit of allowed open files for the nobody user.

Both prometheus and Nginx-ingress run as the nobody user. Since Prometheus keeps a lot of file handles open, there isn't enough room for Nginx to function properly.

Add a file in /etc/security/limits.d/ that contains:

nobody soft nofile 4096

While you're at it, you may also want a file in /etc/sysctl.d/ that increases some other limits:

# Increase number of watches (Kubernetes/Docker/IMAP)
fs.inotify.max_user_watches: 16384  # default 8192
fs.inotify.max_user_instances: 1024  # was 128

This should make sure Nginx keeps running.


Finally, when your system receives many incoming connections, it should also be tuned for that:

# Full speed on idle connection
net.ipv4.tcp_slow_start_after_idle: 0

# Handle more concurrent connections, resize listen() backlog
net.core.somaxconn: 1024

# Increase connection tracking, or face dropping packets (seen in dmesg).
# Happened at servers that receives many connections from statsd/collectd
net.nf_conntrack_max: 65536

# Increase the number of outstanding syn requests allowed.
net.ipv4.tcp_max_syn_backlog: 2048
net.ipv4.tcp_syncookies: 1

# Buffers: min default max (16MB buffer at 50ms RTT = 320MB/s max rate)
net.ipv4.tcp_rmem: 4096 65536 16777216
net.ipv4.tcp_wmem: 4096 65536 16777216

# Lower TIME_WAIT / FIN timeout
net.ipv4.tcp_fin_timeout: 20
net.ipv4.tcp_tw_reuse: 1
net.ipv4.tcp_max_tw_buckets: 131072

See http://cdn.oreillystatic.com/en/assets/1/event/94/Tuning%20TCP%20For%20The%20Web%20Presentation.pdf for the details on the network values.

vdboor
  • 3,630
  • 3
  • 30
  • 32
  • Actually, at the moment when we had this issue, we didn't have prometheus setup. Extending RAM caused the issue to be gone. – Djent Nov 21 '18 at 10:29
  • @Djent good to know! In your case that would mean setting the resource limits would also avoid that issue. – vdboor Nov 21 '18 at 11:30