66

I haven't changed anything related to the DNS entry for serverfault.com, but some users were reporting today that the serverfault.com DNS fails to resolve for them.

I ran a justping query and I can sort of confirm this -- serverfault.com dns appears to be failing to resolve in a handful of countries, for no particular reason that I can discern. (also confirmed via What's My DNS which does some worldwide pings in a similar fashion, so it's confirmed as an issue by two different sources.)

  • Why would this be happening, if I haven't touched the DNS for serverfault.com ?

  • our registrar is (gag) GoDaddy, and I use default DNS settings for the most part without incident. Am I doing something wrong? Have the gods of DNS forsaken me?

  • is there anything I can do to fix this? Any way to goose the DNS along, or force the DNS to propagate correctly worldwide?

Update: as of Monday at 3:30 am PST, everything looks correct.. JustPing reports site is reachable from all locations. Thank you for the many very informative responses, I learned a lot and will refer to this Q the next time this happens..

Jeff Atwood
  • 12,994
  • 20
  • 74
  • 92
  • Jeff, to put your mind at ease - it's definitely not you. It _may_ be GoDaddy, but it's more likely Global Crossing, specifically the router on 204.245.39.50 – Alnitak Jul 19 '09 at 19:30

7 Answers7

91

This is not directly a DNS problem, it's a network routing problem between some parts of the internet and the DNS servers for serverfault.com. Since the nameservers can't be reached the domain stops resolving.

As far as I can tell the routing problem is on the (Global Crossing?) router with IP address 204.245.39.50.

As shown by @radius, packets to ns52 (as used by stackoverflow.com) pass from here to 208.109.115.121 and from there work correctly. However packets to ns22 go instead to 208.109.115.201.

Since those two addresses are both in the same /24 and the corresponding BGP announcement is also for a /24 this shouldn't happen.

I've done traceroutes via my network which ultimately uses MFN Above.net instead of Global Crossing to get to GoDaddy and there's no sign of any routing trickery below the /24 level - both name servers have identical traceroutes from here.

The only times I've ever seen something like this it was broken Cisco Express Forwarding (CEF). This is a hardware level cache used to accelerate packet routing. Unfortunately just occasionally it gets out of sync with the real routing table, and tries to forward packets via the wrong interface. CEF entries can go down to the /32 level even if the underlying routing table entry is for a /24. It's tricky to find these sorts of problems, but once identified they're normally easy to fix.

I've e-mailed GC and also tried to speak to them, but they won't create a ticket for non-customers. If any of you are a customer of GC, please try and report this...

UPDATE at 10:38 UTC As Jeff has noted the problem has now cleared. Traceroutes to both servers mentioned above now go via the 208.109.115.121 next hop.

Alnitak
  • 20,901
  • 3
  • 48
  • 81
  • 9
    i wish i could upvote you more. i'm affraid in the world of outsourcing guys can contact level-1 helldesk of godaddy which will not understand much of the problem description and even less of possible problem explanations... – pQd Jul 20 '09 at 06:55
18

your dns servers for serverfault.com [ ns21.domaincontrol.com, ns22.domaincontrol.com. ] are unreachable. for last ~20h, at least from couple major isps in sweden [ telia, tele2, bredband2 ].

at the same time 'neighbor' dns servers for stackoverflow.com & superuser.com [ ns51.domaincontrol.com, ns52.domaincontrol.com ] are reachable.

sample traceroute to ns52.domaincontrol.com:

 1. xxxxxxxxxxx
 2. 83.233.28.193           
 3. 83.233.79.81            
 4. 213.200.72.5            
 5. 64.208.110.129          
 6. 204.245.39.50           
 7. 208.109.115.121         
 8. 208.109.115.162         
 9. 208.109.113.62          
10. 208.109.255.26          

and to ns21.domaincontrol.com

 1. xxxxxxxxxxxx
 2. 83.233.28.193      
 3. 83.233.79.81       
 4. 213.200.72.5       
 5. 64.208.110.129     
 6. 204.245.39.50      
 7. 208.109.115.201    
 8. ???

maybe screwed up filtering / someone triggered some unwanted ddos protection and blacklisted some parts of internet. probably you should contact your dns service provider - go daddy.

you can verify if problem is [partialy] solved by:

  1. checking if godaddy has reacted and changed name servers - eg lookup serverfault.com at http://www.squish.net/dnscheck/ using recort type: ANY
  2. check if provided name servers respond to ping [not very scientific since name servers can work fine and still block icmp, but in this case it seems that icmp is allowed to other servers ] from telia via looking glass.

edit: traceroutes from working places

poland

 1. xxxxxxxxxxxxxxx
 2. 153.19.40.254               
 3. ???
 4. 153.19.254.236              
 5. 212.191.224.205             
 6. 213.248.83.129              
 7. 80.91.254.171               
 8. 80.91.249.105               
    80.91.251.230
    80.91.254.93
    80.91.251.52
 9. 213.248.89.182              
10. 204.245.39.50               
11. 208.109.115.121             
12. 208.109.115.162             
13. 208.109.113.62              
14. 208.109.255.26              

germany

 1. xxxxxxxxxxxx
 2. 89.149.218.181       
 3. 89.149.218.2         
 4. 134.222.105.249      
 5. 134.222.231.205      
 6. 134.222.227.146      
 7. 80.81.194.26         
 8. 64.125.24.6          
 9. 64.125.31.249        
10. 64.125.27.165        
11. 64.125.26.178        
12. 64.125.26.242        
13. 209.249.175.170      
14. 208.109.113.58       
15. 208.109.255.26       

edit: all works fine now indeed.

pQd
  • 29,561
  • 5
  • 64
  • 106
  • yes, it's definitely an external problem, apparently localised to europe. – Alnitak Jul 19 '09 at 11:39
  • It doesn't appear to be all of europe. Eircom broadband lines (for example) resolve serverfault.com fine. – Cian Jul 19 '09 at 11:42
  • @Alnitak: it is not affecting whole europe - that's for sure. i can reach those naem servers from bredbandsbolaget in sweden, multiple isps in poland and germany. – pQd Jul 19 '09 at 11:55
  • While Eircom had some serious trouble for their customers the past two weeks, with poisoned DNS: http://www.siliconrepublic.com/news/article/13448/cio/eircom-reveals-cache-poisoning-attack-by-hacker-led-to-outages – Arjan Jul 19 '09 at 11:59
  • Eircom may have had DNS issues lately, however they haven't had any issues getting to serverfault. Also, even when they were having problems, you could resolve things from !eircom DNS servers – Cian Jul 19 '09 at 12:42
  • 2
    last time I saw a problem like this it was a CEF table corruption on a Cisco router. Some hosts were reachable and others weren't, even though they were in the same /24 subnet. That it's only certain ISPs affected only suggests that those ISPs have some common supplier. From a working connection it's not easy to find out why. – Alnitak Jul 19 '09 at 19:01
16

My suggestions: as explained by Alnitak, the problem is not DNS but routing (probably BGP). The fact that nothing was changed in the DNS setup is normal, since the problem was not in he DNS.

serverfault.com has today a very poor DNS setup, certainly insufficient for an important site like this:

  • only two name servers
  • all the eggs in the same basket (both are in the same AS)

We've just seen the result: a routing glitch (something which is quite common on the Internet) is sufficient to make serverfault.com disappears for some users (depending on their operators, not on their countries).

I suggest to add more name servers, located in other AS. This would allow failure resilience. You can either rent them to private companies or to ask serverfault users to offer secondary DNS hosting (may be only if the user has > 1000 rep :-)

bortzmeyer
  • 3,903
  • 1
  • 20
  • 24
  • 1
    zoneedit.com provide free DNS hosting, I use it for years and never get any problem with it. – radius Jul 24 '09 at 05:13
3

I do confirm that NS21.DOMAINCONTROL.COM and NS22.DOMAINCONTROL.COM are also unreacheable from ISP Free.fr in France.
Like pQd traceroute, mine also end after 208.109.115.201 for both ns21 and ns22.

traceroute to NS22.DOMAINCONTROL.COM (208.109.255.11), 64 hops max, 40 byte packets
 1  x.x.x.x (x.x.x.x)  2.526 ms  0.799 ms  0.798 ms
 2  78.224.126.254 (78.224.126.254)  6.313 ms  6.063 ms  6.589 ms
 3  213.228.5.254 (213.228.5.254)  6.099 ms  6.776 ms *
 4  212.27.50.170 (212.27.50.170)  6.943 ms  6.866 ms  6.842 ms
 5  212.27.50.190 (212.27.50.190)  8.308 ms  6.641 ms  6.866 ms
 6  212.27.38.226 (212.27.38.226)  68.660 ms  185.527 ms  14.123 ms
 7  204.245.39.50 (204.245.39.50)  48.544 ms  19.391 ms  19.753 ms
 8  208.109.115.201 (208.109.115.201)  19.315 ms  19.668 ms  34.110 ms
 9  * * *
10  * * *
11  * * *
12  * * *

But ns52.domaincontrol.com (208.109.255.26) do works and is in the same subnet as ns22.domaincontrol.com (208.109.255.11)

traceroute to ns52.domaincontrol.com (208.109.255.26), 64 hops max, 40 byte packets
 1  x.x.x.x (x.x.x.x)  1.229 ms  0.816 ms  0.808 ms
 2  78.224.126.254 (78.224.126.254)  12.127 ms  5.623 ms  6.068 ms
 3  * * *
 4  212.27.50.170 (212.27.50.170)  13.824 ms  6.683 ms  6.828 ms
 5  212.27.50.190 (212.27.50.190)  6.962 ms *  7.085 ms
 6  212.27.38.226 (212.27.38.226)  35.379 ms  7.105 ms  7.830 ms
 7  204.245.39.50 (204.245.39.50)  19.896 ms  19.426 ms  19.355 ms
 8  208.109.115.121 (208.109.115.121)  37.931 ms  19.665 ms  19.814 ms
 9  208.109.115.162 (208.109.115.162)  19.663 ms  19.395 ms  29.670 ms
10  208.109.113.62 (208.109.113.62)  19.398 ms  19.220 ms  19.158 ms
11  * * *
12  * * *
13  * * *

As you can see, this time after 204.245.39.50 we go to 208.109.115.121 instead of 208.109.115.201. And pQd has the same traceroute. From a working place I did not cross this 204.245.39.50 router (Global Crossing).

More traceroute from working and non working place would help, but it's highly probable that Global Crossing has a bogus routing entry for 208.109.255.11/32 and 216.69.185.11/32 as 208.109.255.10, 208.109.255.12, 216.69.185.10, 216.69.185.12 are working well.

Why it has a boged routing entry is hard to know. Probably 208.109.115.201 (Go Daddy) is advertising a non working route for 208.109.255.11/32 and 216.69.185.11/32.

EDIT: You can telnet route-server.eu.gblx.net to connect to the Global Crossing route server and do traceroute from within Global Crossing network

EDIT: It seems that the same problem already occured with others NS few days ago, see: http://www.newtondynamics.com/forum/viewtopic.php?f=9&t=5277&start=0

radius
  • 9,545
  • 23
  • 45
  • i doubt you can advertise [ via bgp ] anything smaller then /24 or even /23. i'd rather bet on filtering then routing glitch. – pQd Jul 19 '09 at 12:08
  • Right, but 204.245.39.50 could be a dedicated router between Go Daddy and Global Crossing. It may accept any route from go daddy but upstream router inside Global Crossing will only route /24 (on BGP tables 208.109.255.0 is advertised as /24). Go Daddy could also advertise all host as /32 and Global Crossing router aggregate them as /24 for BGP redistribution – radius Jul 19 '09 at 12:10
  • (But I agree that would be a bit ugly) – radius Jul 19 '09 at 12:19
  • 1
    I'd bet on CEF table corruption... – Alnitak Jul 19 '09 at 19:22
2

What would be handy would be to see a detailed resolution trace from the locations that are failing... see what layer of the resolution path it's failing on. I'm not familiar with the service you're using, but perhaps it's an option somewhere.

Failing that, it's most likely that the problems are "lower down" in the tree, as failures at the root or TLDs would affect more domains (you'd hope). To increase resilience, you can delegate to a second DNS service to ensure better redundancy in resolution if there are problems with domaincontrol's network(s).

womble
  • 95,029
  • 29
  • 173
  • 228
2

I'm surprised you don't host your own DNS. The advantage of doing it that way is if the DNS is reachable, so is (hopefully) your site.

Paul Tomblin
  • 5,217
  • 1
  • 27
  • 39
  • 1
    well.. it's nice not to put all the eggs in one basket. probably there's more to it then just web hosting - maybe mail services? dns is quite nice from resiliency perspective. probably best is to put primary dns at provider #1 and 2ndary dns server(s) at other provider(s). as long as any of them will be reachable - end user will be able to resolve. – pQd Jul 19 '09 at 14:22
  • 1
    I self-host but list the ISP's DNS servers as primaries, even though they really are secondaries. Yes, this is very naughty, and I fully expect to hear howls of complaints...but the upshot of it is, we get the full control of self-hosted DNS with the redundancy of Qwest DNS servers. The TTL for records are high enough that if we can't figure out how to fix a problem in 3 days, then there are bigger issues than just a broken DNS setup. Oh, and @Paul, +1 for pointing out self-hosting as The Original Option in a time of "outsource everything because we can". – Avery Payne Jul 20 '09 at 03:03
1

From UPC at least, I get this reaction when trying to get your A record from your authoritive server (ns21.domaincontrol.com).

; <<>> DiG 9.5.1-P2 <<>> @ns21.domaincontrol.com serverfault.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 38663
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;serverfault.com.       IN  A

;; Query time: 23 msec
;; SERVER: 216.69.185.11#53(216.69.185.11)
;; WHEN: Sun Jul 19 12:09:40 2009
;; MSG SIZE  rcvd: 33

When I try the same thing from a machine on a different network (OVH), I get an answer

; <<>> DiG 9.4.2-P2 <<>> @216.69.185.11 serverfault.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33998
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0

;; QUESTION SECTION:
;serverfault.com.               IN      A

;; ANSWER SECTION:
serverfault.com.        3600    IN      A       69.59.196.212

;; AUTHORITY SECTION:
serverfault.com.        3600    IN      NS      ns21.domaincontrol.com.
serverfault.com.        3600    IN      NS      ns22.domaincontrol.com.

;; Query time: 83 msec
;; SERVER: 216.69.185.11#53(216.69.185.11)
;; WHEN: Sun Jul 19 12:11:05 2009
;; MSG SIZE  rcvd: 101

I get similar behaviour for a couple of other domains, so I assume that UPC (at least) is silently redirecting DNS queries to their own caching nameserver, and spoofing the replies. If your DNS had misbehaved briefly, this could explain it as UPC's nameservers may be caching the NXDOMAIN response.

Cian
  • 5,777
  • 1
  • 27
  • 40