Event Displays Importance of “Active” Failover and Monitoring From Multiple Locations, Testing of Failover Configurations, Low DNS TTLs and Implementing Redundancy at all Layers of the Network to Ensure Uptime
At Dyn Inc., we’re always paying close attention when DNS shows up in the news. It frankly doesn’t happen very much, no matter how important we think it is and how much we try to promote that importance. Whether it was Authorize.net having a datacenter fire bringing to light their disaster recovery strategy, Register.com’s DNS service experiencing issues and taking down thousands of sites, OpenDNS staking claim to running 1% of the Internet’s recursive DNS traffic, Twitter and Baidu getting redirected by “Iranian Cyber Criminals”, Google launching a recursive DNS service and corresponding Namebench analysis tool, Gap.com having a big outage or Amazon, Newegg, Netflix and Salesforce being hard to reach through Neustar UltraDNS the night before Christmas Eve from the west coastâ€¦ we don’t miss a beat. We can’t afford to. We strongly believe that everyone should be paying close attention. From providers, to companies, to individuals. Large and small. DNS IS IMPORTANT!
Our entire organization is closely connected in the ICANN and IETF circles. Our CTO, Tom Daly, is heavily involved in NANOG and is commonly said to be one of a few dozen people in the world who understand the inner workings of the Internet, and namely DNS, so well. His comments on yesterday’s Wikipedia DNS failover struggles are below. One disclaimer we’d like to mention is that Wikipedia’s sister company, Wikia Inc. (the for-profit version), is a customer of our Dynect Platform. We have close ties to the Wikimedia family and Artur Bergman from Wikia is a good friend of Dyn Inc.
Anyway, on to Tom Daly’s quote:
“This event goes to show the importance of having a robust DNS solution that provides rapidly reconfigurable, low-latency, DNS resolution. In this case, having a highly redundant, globally distributed DNS service would have enabled the use of low DNS TTLs, which allow for quick DNS cache population (some providers allow as low as a 30 second TTL). Moreover, using a system that does aggressive validation checking of DNS zone files could have prevented the suspected invalid zone from being loaded into DNS, which caused NXDOMAIN responses to be given for the zone. These NXDOMAIN responses can be cached per local policy, taking many hours to expire in some cases, and possibly could have extended the event.”
Tom Daly also posted a blog about the event on CircleID. Fascinating insight as usual. http://bit.ly/cKEAx3