I have a VPC with two Compute Engine VM instances in it. One of them, vpn-server
, is acting as a VPN for a cluster of on-premises computers. The other, test-instance
, is configured with an instance tag route-through-vpn
that routes traffic to the vpn-server
if it's going to 10.10.0.0/19
.
There is also an AppEngine instance that has the route-through-vpn
instance tag. The webapp running in it can directly connect to our on-premises cluster.
This setup has worked just fine for over a year. Then yesterday, a small number of IP addresses suddenly stopped working.
By "stopped working" I mean this:
- It is still possible to SSH into the non-working IP addresses if you're logged into the
vpn-server
. - But traffic originating from
test-instance
cannot reach these IPs.
One of the failing IPs is 10.10.0.8
. One IP that still works is 10.10.0.47
. As far as I can tell, all addresses correctly match the address range 10.10.0.0/19
.
To debug, I logged into the vpn-server
and the test-instance
and tried sending ICMP packets from test-instance
to various IP addresses in the cluster. I also ran tcpdump
on the vpn-server
so I could see the traffic as it passed through.
For the IP addresses that are still working, I saw the ICMP packets in the output of tcpdump
, as expected. But for the IP addresses that are no longer working, I see nothing in tcpdump
, indicating that Gcloud's routing layer is not even sending the traffic to my vpn-server
.
To test further, I shut down one of the on-premises machines whose traffic is being routed properly, and I tried pinging it. The ICMP echo request packets appeared in the output of tcpdump
with no replies, exactly as expected.
Google Cloud's routes don't have a whole lot of options, and there's no information available that would help me investigate further, so now it's down to somebody just happening to know why this would happen.
Has anybody solved a problem like this or have any idea what might be the cause?