0

I just moved my cluster from AWS EC2 to Google Compute, and looking at the logs it seems there are constant network issues.

This happens with 2 specific nodes, several times a day.

It starts with the error:

master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:

Checking the logs it doesn't seem like they are restarting (running with docker), just disconnecting and reconnecting.

The networking tab in VM instance details is not that helpful.

Eran H.
  • 101
  • 1

1 Answers1

0

Just in case anyone runs into this, we eventually solved this.

(1) Apparently, in Google Compute connections get disconnected after 10 minutes, which is pretty low (the default for ubuntu is 2 hours keep alive pings). The source for this is here. They even recommend in this link which values to actually use, which are lower than our initial try.

(2) Another problem we had, is that docker requires it's own definition of sysctl, so changing ubuntu's config didn't actually do anything.

We didn't have one disconnect in 5 days.

Eran H.
  • 101
  • 1