Maybe it speaks to a risk-averse nature, but I’ve always been interested in failure and in learning from the mistakes of others – obviously so I don’t have to learn such lessons first hand. This is particularly important when you engage in activities where bad decisions can kill you. But generally, as any book on mountaineering mishaps demonstrates, it takes a series of errors in the “correct order” and at the wrong times to cause you serious harm.
In high risk activities under adverse conditions, it’s not hard to make poor decisions that you would never contemplate from the comfort of your favorite living room chair. But while there is little risk to life and limb on the Internet, its very connectedness means that the blunders of pretty much anyone can impact you. What is important in this environment is the half-life and the reach of the mistakes. Those that are local and die out quickly have little chance of resulting in global mayhem. Others compound with all the other endless screw-ups regularly going on and eventually become a giant avalanche careening down hill, collecting mass and bearing down on the sleeping village below. This is one of those stories. It might be true or it might not. Your opinion depends on how much imagination you think we have!
Our tale starts innocently enough with a global upscale hotel chain, call them Hotel-A. (No, not that one.) Hotel-A decides it is time to get serious about updating the software on their desktops. For a large organization, these updates really should be downloaded once to a local server and pushed or pulled from there. But that takes time and a little expertise; the easier option is to burn your Internet bandwidth by updating all machines directly from the Content Delivery Network (CDN) serving this up. The CDN is hosted by Provider-C, a global Internet transit provider.
Mistake #1: Ignore scaling issues and take the easy way out.
To compound this error, Hotel-A elects to take the defaults, updating all PCs at exactly the same time.
Mistake #2: Ignore scheduling issues. Who cares what happens at 3am anyway?
Hotel-A staff now have a ticking time bomb in place, set to go off on a particular day of the month. Every one of their PCs (probably thousands) will start to suck down a large and identical set of updates at that time. This will go on for many hours and effectively saturate all of their Internet links. ( Like this.) Sure enough, Hotel-A’s self-DOS proceeds on schedule and their network operations staff identifies it as such, failing only to prepend “self” to the problem report. Network operations in turn calls their transit provider, Provider-A, another global carrier.
Mistake #3: Blame the Boogeymen on the Internet before looking in your own backyard.
We’ve all been in similar situations. Provider-A greatly values and implicitly trusts Hotel-A. They are a good customer and pay serious money for services. The time to act is now!
Mistake #4: Accept the reported problem at face value, rather than investigate yourself.
Now, how hard would it have been to notice that the source of the “attack” was a major CDN? Provider-A decides to blackhole all traffic from the source network. But they do more than that – they blackhole all traffic from this network to all of their customers, not just Hotel-A.
Mistake #5: Overreact in a time of crisis.
Next, Provider-A compounds the problem by announcing the black-holed network as their very own. That is, they start originating a network that in fact belongs to Provider-C.
Mistake #6: Carelessly inject your IGP routes into BGP.
The folks at Provider-C now start to get reports about the inaccessibility of the CDN they host. These folks are smart and run a tight ship. They carefully check out the complaints from various points on their worldwide network, but to no avail. Everything looks great from inside their network and from various external points as well. Provider-A ultimately figures out what they have done and withdraws the route, leaving Provider-C scratching their collective heads about exactly what happened, as their traffic suddenly returned to normal. They give Renesys a call to see if we have noticed “anything odd” about the problem network. One quick look shows Provider-A and Provider-C both originating the same network for a short period of time one nice summer morning. Given the reach of both carriers, the Renesys global peering set was roughly evenly divided between the two. So half the world had access to the CDN and half did not. The CDN and Provider-C were both collateral damage from a series of mistakes made by others.
You might think that such a sequence of events is highly unlikely and so probably didn’t happen. Or you might think that we aren’t clever enough to have actually made up such a story. Regardless, I happen to believe that cascading failures are common, although underreported. With billions of mistake prone humans connected to the global Internet, how could this not be the case?
In conclusion, there are really two important morals to take away from this story. First, it is fairly trivial to economically harm even the best run networks. Second, you can’t effectively monitor your network from within your network. While it’s true all the protocols on the Internet are from a happier day and are now seriously broken and in need of replacement, your only real alternative to awaiting for nirvana might be to have someone else watch your (high-value) back and to keep a list of those NOC phone numbers handy.