1

We recently had a Google App Engine application fail to properly fail over during scheduled maintenance of our database server (hosted in Aiven).

During scheduled maintenance, the DB server will fail over to a replacement server by updating the DNS record. This is supposed to be instant, but we found that our node app running in GAE was crashing with connection failures for several minutes.

The connection error is taken as a hard error and so the node app exits, and is then immediately replaced with a new process by running npm start again. However, this process also failed as it also could not connect and so on until GAE decided the server was a lameduck and replaced it.

By the time the instances were replaced it looks like the connection issue was resolved, but i'm unclear on the reason for it taking so long to resolve.

My suspicion is that the old database hostname may have been cached and so it was stuck trying to connect to the old IP.

As a work around I'm wondering if it's possible to flush the DNS cache on a google app engine instance (from within the instance)?

I've looked for documentation on how App Engine resolves DNS, but end up at pages about setting up custom or internal DNS.

To summarise: Our node app running on GAE is connecting to an externally hosted database and is identifying the host by DNS lookup.

So the cache I need to flush is the one that is caching lookups of a public DNS record not hosted on google, but being requested by a GAE app.

ie:

Node GAE App -> { Public Internet } -> Database
ChrisJ
  • 285
  • 1
  • 9
  • `We recently had a Google App Engine application fail to properly fail over during scheduled maintenance of our database server` - Giving us details about your HA/Failover configuration would help us to understand how it applies to your problem. – joeqwerty Jun 15 '20 at 03:57
  • Thanks, I've added some more detail – ChrisJ Jun 15 '20 at 09:39
  • I guess this is what you're looking for (flush DNS records) https://developers.google.com/speed/public-dns/cache If not, please give us a more detailed explanation of the exact goal you are trying to ahieve and your current set up. It's a bit confusing right now. – Waelmas Jun 15 '20 at 12:51
  • No, that's for flushing a record of a domain that is being hosted on Google servers. My problem is that an app on google servers is using an outdated DNS record for a database that is hosted outside of google and outside of our control. – ChrisJ Jun 15 '20 at 23:42
  • I think the best way to solve your issue will be opening a ticket with official GCP support. – Jaroslav Jun 18 '20 at 14:23

1 Answers1

1

I think the solution should be a little architecture modification.

As a PaaS, you don't have too much access to GAE to run some admin operations.

Knowing this, these are the architecture changes I propose:

  1. Use a Virtual IP for the running database server and change between the servers on failover. The DNS will be mapped to the Virtual IP only.

OR

  1. Verify the TTL of the DNS record and reduce it to the minimal time you can wait for the DNS replication, like, 60 seconds. The problem is that the machines will access the DNS servers more often.

OR

  1. Put a load balancer in front of the database servers and change the load balancer mapping on failover. The DNS will map to the load balancer.
Eduardo
  • 51
  • 4
  • The database is hosted on Aiven which is managing this for us. The TTL on the DNS record is 15 seconds – ChrisJ Jun 18 '20 at 11:00
  • So, this might be a replication problem. Chech the [SOA](https://en.wikipedia.org/wiki/SOA_record) record of your domain. Check the REFRESH field and make sure the SERIAL is changed after a DNS record change. If GAE is querying a stale DNS server, it might take some time to get the new information. On engine restart, GAE might query an already replicated DNS server. Although today the REFRESH is [barely in use](https://serverfault.com/questions/69183/recommended-dns-soa-record-ttl-default). Would check the application to check if the connection is a global or local variable. – Eduardo Jun 19 '20 at 12:43