0

20% of requests to our backend Django application (deployed on AWS using ECS and Postgres RDS) are throwing 500 errors. Looking at the ECS logs, various related errors are shown:

psycopg2.OperationalError: could not translate host name "abc.efg.us-east-1.rds.amazonaws.com" to address
OSError: [Errno 16] Device or resource busy
<built-in function getaddrinfo>) failed with OSError

We use gunicorn and gevent to serve our app:

gunicorn -t 1000 -k gevent -w 4 -b 0.0.0.0:8000 backend.wsgi

Zev
  • 111
  • 4
  • You are not saying exactly which nameservers you are using to resolve names. In many cases things improve a lot if you install on the same box a local caching resolver, as simple as `unbound`, to have more stability and performance in resolving DNS queries, especially if they circle around a lot of time the same names... – Patrick Mevzek Oct 13 '21 at 18:55
  • We use Route53 to route traffic to a CloudFront distribution so it is awsdns. It should almost be the same ones so a caching resolver makes sense. – Zev Oct 13 '21 at 20:53
  • I am specifically talking about a **recursive** nameserver installed as close as possible (ideally same box) as applications doing DNS calls. From experience, this improves things. Where and what the authoritative nameservers are is irrelevant (until you can prove that the problem is really between recursive and authoritative and not between application and recursive) – Patrick Mevzek Oct 13 '21 at 20:54
  • I guess the answer to your original question would be AmazonProvidedDNS. Thanks for the suggestion. I'll have to dig more into this area and understand it more before modifying anything but I like the sound of the solution you suggested. – Zev Oct 13 '21 at 21:49

1 Answers1

0

getaddrinfo is a gevent function detailed here: https://www.gevent.org/dns.html

Those documentations mention that gevent offers 4 resolvers. The default resolver "Native thread-based hostname resolve" mentions that "there have been some reports of long delays, slow performance or even hangs, particularly in long-lived programs that make many, many DNS requests." And recommends switching resolvers if that happens to you.

We changed how we served our application to the ares resolver and we have not been able to reproduce the issue since:

GEVENT_RESOLVER=ares gunicorn -t 1000 -k gevent -w 4 -b 0.0.0.0:8000 backend.wsgi

Zev
  • 111
  • 4