Time flies. Although it was over 18 months ago, it seems just like yesterday that a small Czech provider, SuproNet, caused global Internet mayhem by making a perfectly valid (but extremely long) routing announcement. Since Internet routing is trust-based, within seconds every router in the world saw this announcement and tried to pass it on. Unfortunately, due to the size of this single message, quite a few routers choked – resulting in widespread Internet instability. Today, over a year later, we were treated to a somewhat different version of the exact same story.
First, let’s review the Czech incident from February 2009. There were many positives to take away.
- It was precipitated by an honest mistake.
- It was an extremely unlikely event, as many stars had to be in exact alignment.
- Most of the Internet’s core survived.
- The response from operators was fast and efficient, with the damage largely contained within an hour.
The complete technical details can be found here.
Deja vu all over again
Fast forward to today: Friday, 27 August 2010. What do you think would happen if another large and unusual routing announcement was made on the Internet? Do you think all the router vendors have perfected their code in the past 18 months? Do you think the entire planet has upgraded to this new, improved and perfect code base? Do you think it makes sense to use the Internet as your testbed? I doubt you answered “yes” to any of these questions.
We’ll begin to describe what happened today with a snippet from a private mailing list. We’ll purposely leave out the technical details so that we don’t inadvertently contribute to the building of a Cybernuke.
On Friday 27 August, from 08:41 to 09:08 UTC, the RIPE NCC Routing Information Service (RIS) announced a route with an experimental BGP attribute. During this announcement, some Internet Service Providers reported problems with their networking infrastructure.
Immediately after discovering this, we stopped the announcement and started investigating the problem. Our investigation has shown that the problem was likely to have been caused by certain router types incorrectly modifying the experimental attribute and then further announcing the malformed route to their peers. The announcements sent out by the RIS were correct and complied to all standards.
While standards compliance is nice, it is foolhardy to assume that all BGP implementations are perfectly compliant, especially given recent history. Over 3,500 prefixes (announced blocks of IP addresses) became unstable at the exact moment this “experiment” started. Not surprisingly, they were located all over the world: 832 in the US, 336 in Russia, 277 in Argentina, 256 in Romania and so forth. We saw over 60 countries impacted by a “correct” announcement that “complied with all standards”. The following graph shows the timeline of the event, followed by a map of the impacted countries by prefix count. Notice that it takes a bit for the Internet to stabilize after RIPE claims to have withdrawn the announcement at 09:08 UTC.
On the positive side, the incident was very brief, the damage was limited to under 2% of the Internet and the responsible parties quickly fessed up, aborting their “experiment”. On the negative side, the Internet remains a very fragile place, even if that fragility is highly localized and different in different places. Standards aren’t followed, code isn’t tested and people make mistakes. That’s life with any complex system and, while we can certainly do a better job, we will continue to see these types of events no matter what safeguards we might take. What puzzles me is how anyone thought it might be a good idea to test fate in this way. The end result was completely predictable.