Sign In

Detecting and Mitigating a Cloud Outage: Internet Performance Management in Action


Solution Sales Engineer

Given Dyn’s global visibility into real-time internet conditions, we’re all too often reminded that between frequent and rising DDoS attacks to cloud service provider outages, the internet is a volatile place. We also know that even the best experience issues. Case in point, on Friday, August 5, 2016, the Dyn Internet Performance Management (IPM) platform detected an outage of the Google Compute Engine (GCE) starting at 3:54 AM ET and ending at 5:35 AM ET (we also show intermittent internet instability in the hours leading up to the event). This outage was confirmed by Google (screenshot below).

Google Cloud Status Dashboard Confirms August 5th Outage
Google cloud status dashboard

If you were a user of the Dyn Internet Intelligence module, you would have seen the following dashboard alert via a NOC display that the GCE outage had occurred (see below). The red bar at the top of the dashboard shows availability during the outage, while the detail below shows 100% packet loss during the incident. In addition to information about the outage, Dyn users would have been aware of the incident nearly an hour before the first post on the Google Cloud Status Dashboard (note the timestamps of the actual incident are consistent with Dyn’s view using ET, but the Google Dashboard post occurs 50 minutes after the incident begins).

Dyn Internet Intelligence Dashboard
Dyn dashboard

In fact, Dyn saw similar behavior across all GCE POPs worldwide at the same time. We’ve included snapshots from the Dyn platform showing availability at four different GCE locations at this time. In other words, the incident was not isolated to a single POP.

Region data

This kind of information is probably self-evidently valuable for conducting root-cause analysis of application performance degradation resulting from this outage (as well as the support tickets piling up). But as we noted in a recent post, knowing about the outage is a step in the right direction, but “monitoring without the ability to take action is likely to cause some serious stress.” That’s because it doesn’t solve the problem of not being able to serve your customers.

However, if you are in a multi-cloud or hybrid environment and have the ability to serve from an instance outside of GCE, Dyn’s DNS-based traffic steering would allow you to failover to a known responding endpoint, either by manually moving traffic or by using Dynamic Steering to automatically shift traffic. Imagine receiving the Google alert but also receiving a notification from Dyn that traffic had been re-routed to a healthy endpoint and users are still being served. This kind of mitigation not only provides business continuity around online services dependent on assets hosted on GCE, but gives your operations team the space and time to diagnose an issue without a constant barrage of aforementioned support tickets, and to switch users back to the primary cloud service provider on their schedule (following testing, etc.). In this case, GCE was back online within a few hours, but imagine if it had lasted longer and/or stretched into peak business hours.

This is the power of combining real-time visibility with internet traffic control using the Dyn Internet Performance Management platform.

Scott Taylor is Solution Sales Engineer at Dyn, a cloud-based Internet performance company that helps companies monitor, control, and optimize online infrastructure for an exceptional end-user experience. Follow Scott on Twitter: @TaylorMHTNH and @Dyn.