Disaster recovery is tricky business. On one hand, you’re never sufficiently prepared for any and all disasters that can bring down your IT infrastructure, and live in a constant state of contingency planning. On the other hand, for every dollar and for every hour you invest in preparing your organization for the worst, that’s one less dollar and one less hour you’re investing in everything else your organization needs to thrive.
At Dyn Inc., we’re trying our best to help our clients find a perfect balance in the middle for their needs. One way we do this is to perform a post-mortem analysis of disasters that have actually occurred, and hypothesize the “What if this circumstance happened to you?” scenario. Here, we’ll aggregate some of the public, albeit largely unverified, information on the Authorize.Net outage, and analyze some of the disaster recovery strategies they got right, and some that could use improvement.
On July 2, 2009 at 11:15pm PST, a fire in Fisher Plaza, Seattle, Washington caused a massive power outage for the primary datacenter hosting the servers for Authorize.Net (as well as many others, including Microsoft’s Bing Travel). From my first hand experience, the https://secure.authorize.net interface used by DynDNS.com became available around 10:30AM PST on July 3, 2009, and was being serviced by a data center in San Jose, California. Accounting for time zone differences, that’s a little more than 11 hours of downtime.
For many organizations, if their one and only datacenter goes offline, there isn’t much they can do but wait. For Authorize.Net, however, they had a fully redundant backup facility in San Jose, California. The question then remains: If a fully redundant facility was in place, why 11 hours of downtime, instead of 11 seconds?
The answer, I believe we’ll find, is in how traffic and application services are managed between two or more geographically distinct datacenters.
Load Balancing and Failover
Load balancing and failover are pretty well understood concepts within the confines of a LAN within a single datacenter. As shown below, multiple servers are connected to multiple load balancers, which are in turn connected to the Internet (switching, routing, and multiple upstream transit connections are over-simplified for the moment, but not to be ignored in actual implementation). Should one of the servers suffer a failure, the load balancers are capable of detecting the failure, and removing it from the pool of servers that are responding to incoming requests. Should one of the load balancers fail, the redundant load balancer is capable of detecting the failure, and taking over. In either case, failover occurs on a scale of seconds (in many cases, milliseconds).
This strategy is successfully deployed by thousands of network operators, including Dyn Inc., to mitigate localized failures within a datacenter. Can we take the same concepts and apply them to failures consuming an entire datacenter? That becomes a little more complicated.
The simplest way to provide load balancing and failover services between datacenters is to use DNS. Each datacenter is provided with its own address space, and propagates its own routes independently to the global Internet. When an end-user requests the IP address for a host such as secure.authorize.net (much like our application servers for DynDNS.com do when they process credit card transactions), the DNS servers can respond with the IP addresses corresponding to either Datacenter 1 or Datacenter 2.
For a load balancing scenario under nominal conditions, the DNS servers can respond with the IP addresses for Datacenter 1 50% of the time, and the IP addresses for Datacenter 2 50% of the time, thereby sending some traffic to both facilities, and balancing load between them (much like the load balancing equipment did in the previous diagram). Under a failover scenario, should Datacenter 1 go down, only the IP addresses for Datacenter 2 are returned as answers by the DNS servers, sending all traffic to the functioning Datacenter 2. For our Dynect Platform customers, we call this global server load balancing and active failover capability our Dynect Traffic Management service.
Why Did This Not Work for Authorize.Net?
The authoritative DNS servers for secure.authorize.net are as follows:
NS-A.PNAP.NET 188.8.131.52 NS-B.PNAP.NET 184.108.40.206 NS-C.PNAP.NET 220.127.116.11 NS-D.PNAP.NET 18.104.22.168
Requesting the IP address for secure.authorize.net from one of the authoritative DNS servers shows the following:
;; QUESTION SECTION: ;secure.authorize.net. IN A ;; ANSWER SECTION: secure.authorize.net. 86400 IN A 22.214.171.124 secure.authorize.net. 86400 IN A 126.96.36.199
The answer section provides two IP addresses for the client to use in a traditional DNS round robin fashion (188.8.131.52 and 184.108.40.206), as well as a time to live, or TTL, value of 86,400 seconds. This TTL value instructs any client, as well as any recursive DNS server performing a lookup on behalf of a client, to cache and re-use this answer for 86,400 seconds (24 hours).
Had the administrators at Authorize.Net updated DNS records to point to their facility in San Jose, California, it would have taken 24 additional hours for the clients and recursive DNS servers around the world to expire the cached answers pointing to Seattle, and request new IP addresses for secure.authorize.net that would point to the functioning San Jose facility.
In this scenario, when it is not feasible to change what addresses secure.authorize.net resolves to, the only feasible option is to alter the advertised routes for the network containing 220.127.116.11 and 18.104.22.168. Quite literally, from a network perspective, this is moving the IP addresses from one datacenter to the other.
There are a couple of reasons why this is difficult to do, and I suspect Authorize.Net had to overcome each of them. I must disclose, however, that the following is speculation on my part:
First, moving the network addresses from one datacenter to the other involves coordination with the collocation facility and upstream transit providers. There are several reasons why this can be problematic:
1. Many ISPs filter the IP address routing announcements made by their customers to prevent malicious activities such as prefix hijacking. As a result, it can take time for ISPs to reconfigure these filters which are often manually maintained. This could have slowed down the process of moving IP addresses from the primary datacenter to the backup datacenter.
2. Reaching collocation and ISP support personnel during a disaster in the middle of the night, especially leading up to a holiday weekend, can be difficult.
Second, in this scenario, the failover facility in San Jose went from receiving none of the traffic trying to reach secure.authorize.net to all of the traffic in a relatively short period of time. Needless to say, this is a highly stressful condition for the network equipment and servers in San Jose that may have induced additional failures, each requiring their own resolution.
Third, this scenario requires the backup datacenter to be fully up and running, with a suitable production configuration to take over responsibility from the primary datacenter. This includes not only network configuration and server resources, but also application configuration and customer data. Judging by the Authorize.Net Twitter feed, services slowly came back online one by one, suggesting that the operations team was forced to manually bring each service online and restore from offline backups. As we all know, this is a very time consuming process.
Customer Relations in the Face of Disaster
Knowing full well that primary communications with customers were impacted, the Authorize.Net folks leveraged Twitter to communicate with customers quickly and effectively. Around 8am EST on July 3, 2009, the Authorize.Net Twitter account at http://twitter.com/AuthorizeNet came online. The folks running this Twitter feed kept thousands of eager system administrators and online store front owners up to date on what was happening, and gave calming reassurance that yes, there was a problem, and yes, their team was doing everything possible to rectify it. This was, in fact, how our operations team learned about what was going on, and learned an expected time for services to be restored. Whomever set this up and managed to keep it up to date during the disaster as well as the entire holiday weekend certainly deserves commendation.
As introduced, the purpose of this article is to document some lessons learned for the benefit of our own operations, the operations of our customers, as well as for other administrators preparing their disaster recovery plans.
Lesson One: DNS TTL
Your domain name servers are powerful tools for diverting traffic from disaster-affected datacenters. They are, however, virtually useless in this regard when the affected records are returned with a large TTL, causing resolvers to cache answers for extended periods. Of course, the lower the TTL, the more load you will be placing on your DNS servers (i.e., since resolvers are no longer caching the answers as long, they will need to query the authoritative servers more frequently).
If you’re currently using DNS provided by your ISP, stop! They likely won’t tolerate you setting the TTL to something low enough to warrant rapid failover (preferably, on the order of seconds). If you’re currently running your own name servers in your datacenters alongside application servers, stop! In the event your datacenter goes offline, and your name servers were hosted there, this powerful failure mitigation tool will be lost to you. Of course, we specialize in managed external DNS, but we’re not the only ones. If your ISP or in-house DNS infrastructure won’t meet your failover needs, there are plenty of providers to help you, and it is less expensive than you may think.
Lesson Two: Know Your Failover Strategy, and Make Certain It Will Work
There are two schools of thought when it comes to active failover between two facilities. One school advocates one hot facility (i.e., a primary datacenter) and one cold facility (i.e., a secondary datacenter). During nominal operations, the primary datacenter receives 100% of the traffic, and the secondary receives 0%; during failover, 100% of the traffic is directed to the secondary facility. The second school advocates two load balanced facilities each operating at less than 50% of capacity; should one fail, the second facility will (theoretically) be able to handle the load redirected from the first facility.
Unfortunately, the worst possible time to test your strategy is during an actual disaster. If you have one hot and one cold facility, you must make certain your cold facility can be reliably brought online to handle the traffic redirected from the hot facility. This means being diligent in rolling patches out to “that other site,” replacing failed drives in RAID arrays promptly even though the systems are for the most part idle, and monitoring secondary systems as adamantly as your primary systems. If you have two facilities actively sharing nominal load, you must make certain that one facility would have the resources to stand on its own. This means understanding what your bottleneck is (i.e, what does less than 50% of capacity actually mean? Is it transit? Is it CPU? Is it storage? Is it cooling? A combination?), closely monitoring it, and when it crosses the point of having 50% capacity utilized, expanding capacity (preferably you do this before it reaches that point).
Lesson Three: When Disaster Strikes, Give WWW More Than a Browser Error
When Authorize.Net went down, the best source of information became their Twitter feed. As useful as this was for those of us on Twitter, including @DynInc, @tomdyninc, and @cvonwallenstein, surely there were many more folks without knowledge of Twitter that were left in the dark. The solution? Redirect your main site URL to an offsite facility showing a status page for you to disseminate facts, before your confused and frustrated customers make up rumors and fiction! According to Tom Daly, CTO of Dynamic Network Services Inc., “Showing real status information on a failover home page instead of a ‘Page cannot be displayed’ browser error retains the confidence and trust you have built with your customer, even in the face of disaster. Just showing your logo and a description of the problem is enough to let customers know that you care.”
Of course, this is only going to be possible for you if you have a low TTL value for the www record on your name servers, and you will need an easy to use method of configuring your name servers for 1) the failure page when disaster first strikes, and 2) the normal web servers when service is once again restored.
Our status page can always be found at http://status.dyn-inc.com, and it is hosted in a separate facility from our other web servers.
Disaster can rarely be predicted, but with some forethought and planning, cost-effective tools are available to automatically failover in the event of a disaster to a backup facility, and if nothing else, to facilitate honest and frequent communication with affected users. We will continue to be customers of Authorize.Net, and we hope they continue to invest in a reliable IT infrastructure. We know they will emerge from this disaster stronger than before, and we’re more than happy to help them along the way.