Sign In

Internet Performance Delivered right to your inbox

House of Cards


Time flies. Although it was over 18 months ago, it seems just like yesterday that a small Czech provider, SuproNet, caused global Internet mayhem by making a perfectly valid (but extremely long) routing announcement. Since Internet routing is trust-based, within seconds every router in the world saw this announcement and tried to pass it on. Unfortunately, due to the size of this single message, quite a few routers choked – resulting in widespread Internet instability. Today, over a year later, we were treated to a somewhat different version of the exact same story.

First, let’s review the Czech incident from February 2009. There were many positives to take away.

  • It was precipitated by an honest mistake.
  • It was an extremely unlikely event, as many stars had to be in exact alignment.
  • Most of the Internet’s core survived.
  • The response from operators was fast and efficient, with the damage largely contained within an hour.

The complete technical details can be found here.

Deja vu all over again

Fast forward to today: Friday, 27 August 2010. What do you think would happen if another large and unusual routing announcement was made on the Internet? Do you think all the router vendors have perfected their code in the past 18 months? Do you think the entire planet has upgraded to this new, improved and perfect code base? Do you think it makes sense to use the Internet as your testbed? I doubt you answered “yes” to any of these questions.

We’ll begin to describe what happened today with a snippet from a private mailing list. We’ll purposely leave out the technical details so that we don’t inadvertently contribute to the building of a Cybernuke.

On Friday 27 August, from 08:41 to 09:08 UTC, the RIPE NCC Routing Information Service (RIS) announced a route with an experimental BGP attribute. During this announcement, some Internet Service Providers reported problems with their networking infrastructure.

Immediately after discovering this, we stopped the announcement and started investigating the problem. Our investigation has shown that the problem was likely to have been caused by certain router types incorrectly modifying the experimental attribute and then further announcing the malformed route to their peers. The announcements sent out by the RIS were correct and complied to all standards.

While standards compliance is nice, it is foolhardy to assume that all BGP implementations are perfectly compliant, especially given recent history. Over 3,500 prefixes (announced blocks of IP addresses) became unstable at the exact moment this “experiment” started. Not surprisingly, they were located all over the world: 832 in the US, 336 in Russia, 277 in Argentina, 256 in Romania and so forth. We saw over 60 countries impacted by a “correct” announcement that “complied with all standards”. The following graph shows the timeline of the event, followed by a map of the impacted countries by prefix count. Notice that it takes a bit for the Internet to stabilize after RIPE claims to have withdrawn the announcement at 09:08 UTC.


Impacted Countries by Prefix Count



On the positive side, the incident was very brief, the damage was limited to under 2% of the Internet and the responsible parties quickly fessed up, aborting their “experiment”. On the negative side, the Internet remains a very fragile place, even if that fragility is highly localized and different in different places. Standards aren’t followed, code isn’t tested and people make mistakes. That’s life with any complex system and, while we can certainly do a better job, we will continue to see these types of events no matter what safeguards we might take. What puzzles me is how anyone thought it might be a good idea to test fate in this way. The end result was completely predictable.

Share Now

  • Martijn Bakker

    I agree where you say that the Internet is not a playground — as I wrote in my previous comment. Again, this is not my primary worry, and the blogpost is covering that topic well enough.
    Saying the internet is “too fragile” for these “games” though, is, in my eyes, ridiculous. First off, I don’t think anyone at Renesys would like it if we compared the sincere and honest work you do with “games”. The same goes for what RIPE NCC’s RIS does. These people are no less than anyone, and their work isn’t either. Working for one and having worked for multiple other RIPE members, I can say their work, as far as it touched me, has never been less than professional.
    Saying the internet is “too fragile” is exactly what my point is about. If this is true, this must change. If we look at the direct consequences, we only see 1 major vendor (who I shall not name) who’s BGP implementation is causing all the damage, as a result of a _completely valid_ BGP package. So let’s start there. I bet that if you would be able to look at this vendor’s BGP implementation, this isn’t the most horrible bug in there.
    Blaming RIS or anyone other than this particular vendor for what happened is like blaming the person operating a light switch for causing an electricly induced fire because of faulty wiring. It was never the intention of RIS to even have a fire drill, let alone triggering a fire. Again, the attribute in question was completely valid.
    If you really believe that the small meltdown is solely the fault of RIS, maybe you should think about taking a holiday.
    The words in this and my previous comment are mine, and mine alone.
    Editor’s note: If you suspected that the manufacturers of vaccines were exaggerating their marketing claims, and human society was “too fragile as a result,” would you go around subway stations infecting people to prove your point? And then blame Big Pharma for the resulting chaos? There are ways to carry out these experiments without burning down the global village, and RIPE RIS has a long history as a sponsor of responsible Internet research. I’m still waiting for RIPE RIS and Duke to explain whether this experiment was in line with their institutional research policies.

  • Martijn Bakker

    Let me see if I get that note:
    Big Pharma = $vendor
    “you” = RIPE NCC’s RIS
    So what you are claiming is that RIS purposefully and wellwillingly poisoned the internet to prove a point? I think I see where your thinking went stray there.
    We already figured these kind of experiments shouldn’t take place on the big bad (and apparently extremely fragile) internets, and I agreed (see 1st comment, 1st sentence). And I think RIPE RIS and Duke will wholeheartedly agree with you on that one too after this incident. It was surely an eye-opener. More lab-testing should be done (if resources permit) before stuff like this is being attempted publicly.
    But, as I said before and again, the experiment they conducted was not the main cause. It just acted as a trigger for the main cause, in this case the malformation of a well-formatted attribute by a router produced by BigVendor and running software produced by BigVendor. This, in turn, led to disconnection of BGP sessions (as a safetymeasure) from routers where this malformed attribute was being forwarded to.
    It could have been anyone, and the same routers would have made the same error. The vendor is the only consistent element in this story. Why all this hate towards RIS? I have a hard time believing Renesys is, in some way (namely pushing the idea that RIS is purposefully harming the internets), trying to make a black sheep out of RIS.
    again, all comments are my own.
    Editor’s note: “RIPE NCC is going to be stricter about the way it runs such experiments and will give Internet operators advance warning in the future.” That’s a win for everyone.

Whois: Earl Zmijewski

Earl leads a peerless team of data scientists who are committed to analyzing Dyn’s vast Internet Performance data resources and applying their expertise to continually improve upon Dyn’s products and services.