Last week’s data center outage at Delta dominated the IT news cycle, and provided a “ripped from the headlines” example of A) IT resiliency gone awry and B) the link between IT resilience, business continuity, and business impact (while we don’t yet know the full fiscal impact on Delta’s business of this outage, a similar outage due to a server failure at Southwest Airlines has now been reported to have come at a cost of $54 million).
While we don’t have a full view into what went wrong, Tony Baer’s tweet summary of an excellent ZDNet piece gives us a good picture:
The outage also gave rise to a host of punditry on how the cloud might have saved Delta from its no good, horrible, very bad week. For example, writing in Forbes, Kalev Leetaru extols the benefits of public clouds from Google and Amazon to provide redundancy in this article. He writes:
“In this day and age it is extremely surprising that not a week goes by without news of another major corporate data center outage causing a critical disruption in operations. There is simply no technical excuse not to have geographic redundancy with automated failover. Most major cloud providers like Google and Amazon offer high levels of out-of-the-box redundancy. For example, Google’s standard disk storage writes all data 3.3 times automatically and disk snapshots are globally available, while virtual machines automatically migrate on hardware failure and new machines can be spun up from snapshots into any Google data center worldwide. Numerous offerings by both Google and Amazon and their peers allow applications to automatically span data centers, automatically transferring traffic in case of an outage or other issue.”
While Leetaru makes some very relevant points, his summary misses a much larger and more important point: singular-cloud redundancy is not enough to prevent outages when utilizing public cloud assets – a point that can be underscored in light of Google Compute Engine’s outage from late last week, which proves that even the most robust systems can go down – and take you down with it.
Simply moving assets to cloud infrastructure won’t ensure availability. First, as noted above, relying on a single cloud service provider (or CDN for content-centric applications) simply shifts risk from one single-entity point of failure to another. If the cloud or CDN goes out, so does your application. Second, cloud applications introduce new challenges for visibility and control of internet infrastructure (data centers and cloud included), as you no longer directly control how users are connecting to your applications (or perhaps more accurately, you control them less than you do from a corporate DC).
The simple fact is that many organizations still rely on legacy applications running in corporate data centers. This is almost certainly the case with some of Delta’s core business processes, like scheduling and airport operations. Many organizations are migrating these applications to cloud or hybrid cloud environments, and will be bridging from their on-premises solutions to the cloud for years to come.
The cloud offers many benefits, including strategies to de-risk applications from outages. But simply moving to the cloud without strategies to monitor and manage traffic across multiple clouds, CDNs and data centers won’t solve the problem.
More and more we’re seeing customers leverage their DNS infrastructure (including advanced DNS management that can detect issues and bottlenecks and proactively steer traffic around) them to support mission-critical hybrid- and multi-cloud environments.
Enterprises wishing to mitigate risk and get the most out of their internet performance must employ a multi-cloud and/or multi-CDN approach to a hybrid environment, combined with appropriate monitoring and control – or face the very real risk of outages and degraded service that bring with them significant fiscal and customer goodwill costs.