A couple of months ago, we discussed how a small Czech provider ended up causing global Internet mayhem by tickling a Cisco bug via a rather ridiculous routing announcement. While it’s easy to fault the instigator of this meltdown, ultimate responsibility belongs with the vendors of poorly tested code. If we’ve learned anything in decades of software engineering, it is that you can’t assume anything about user input. If you don’t check that input for validity, you are not just being careless, you are creating a time bomb that will eventually go off. Another such bomb went off on Sunday, 3 May 2009, taking out a large swath of the Internet. We recount the sorry tale here.
Sundays are usually pretty quiet from an engineering perspective on the Internet, so we were surprised by a sudden and prolonged influx of routing updates at our collection sites. Thankfully, we have some very nifty event correlation software, written by Renesys über-hacker Alin Popescu, to which I turned to see what was going on. Alin’s code sorts through all the routing activity on the Internet in real time and correlates activity via the attributes they have in common. We can then say things like “this spike in outages came from networks in Kyrgyzstan as transited via Golden Telecom in Russia”, potentially providing some insight into the root cause of the problem. It’s really very slick. However, the only problem with pinpointing Sunday’s event was that it occurred all over the place. We saw sharp spikes in outages and instabilities in Indonesia, Bulgaria, Germany, Romania, Brazil, Russia, the United States and many other places. Upwards of a couple thousand prefixes (networks) were impacted during a period of about 6 hours. Given the wide reach of the event, we immediately thought of the Czech situation and wondered if this was another live Internet QA test. Turns out we were right.
To understand the problem, let’s first provide some background. As readers of this blog will know, every organization responsible for Internet routing has two things under their control: block(s) of IP addresses (prefixes) and an identifying number, or ASN. Like IP addresses themselves, ASNs are also running out, at least the old style ones. The original Internet designers allocated only 2 bytes for storing ASNs, which means at most 65,536 of them. Since then, ASNs have been largely allocated consecutively starting at 1, with Phosworks Drift & Services getting the most recent allocation of ASN 49232. Reserved (private) ASNs are at the high end of the range, starting at 64512. So this means that at present we only have about 15 thousand 2-byte ASNs remaining unassigned.
If there weren’t an alternative, the 2-byte limitation for ASNs would mean that before too much longer, there would be no new members of the Internet routing club: no new ISPs and no companies with multiple providers and direct control of their routing. Fortunately, there is an alternative available today, and that is 4-byte ASNs. RIPE has been assigning them by default since the start of 2009. But unlike IPv6, the proposed alternative to dwindling IPv4 addresses, 4-byte ASNs can seamlessly and safely co-exist with 2-byte ASNs. No complete overhaul of the Internet is required.
Returning to our story, if 4-byte ASNs are available and interoperate with 2-byte ASNs, there shouldn’t be any problems with announcing them anyway we want today. Yeah right. The African Network Operators’ Group (AfNOG) will soon be meeting in Cairo, Egypt, and plans to be multi-homed. They acquired the 4-byte ASN 327686, which equals 5 (216) + 6 and hence is also written as 5.6, and planned to use it as stated here:
One of the aims of this setup is to demonstrate that 32-bit ASNs do work and people should not steer away from them, especially since the pool of 16-bit ASNs is shrinking fast. A showcase network goes a long way in this regard.
Now, AfNOG does have an IPv4 prefix, 220.127.116.11/20, and a traditional 2-byte ASN, namely 37095, at their disposal. On 30 April at 17:21:49 UTC, we started seeing the 2-byte ASN announcing this prefix, but with ridiculously long AS paths like
- 24863 37095 37095 37095 37095 37095 37095 37095 37095 37095 37095 37095
AS 24863 is LINKdotNET, a local Egyptian provider. So far, so good. Then on 3 May at 12:00:30 UTC, the above paths get replaced with 4-byte AS prepends, namely,
- 24863 327686 327686 327686 327686 327686 327686 327686 327686 327686 327686 37095
and the global impact is felt immediately, as shown in the following graph.
So what happened? Well, not everyone can afford a Cisco or Juniper router. Some folks make do with cheap Linux boxes running the free routing software called Quagga. And you guessed it! Older versions of Quagga have a buffer overflow problem, dropping the associated BGP sessions like a brick when they saw this announcement. The latest Quagga release, 0.99.11, came out on 2 October 2008. The prior version had the following gem of a comment.
/* ASN takes 5 chars at least, plus seperator, see below. * If there is one differing segment type, we need an additional * 2 chars for segment delimiters, and the final '�'. * Hopefully this is large enough to avoid hitting the realloc * code below for most common sequences. * * With 32bit ASNs, this range will increase, but only worth changing * once there are significant numbers of ASN >= 100000 */
Or until someone prepends a 4-byte ASN a few times. Talk about lame. So this latest experiment in real-time Internet QA identified everyone running old versions of Quagga, thanks to the time bomb left behind by one very lazy programmer. The latest version of Quagga appears to handle this condition a little more carefully, but I can’t help but wonder about any software version that starts with a zero.
Looking for the positive, it was nice to see some diversity on the Internet with respect to routing. Imagine if this had been a Cisco bug? Still we found it surprising to see the number and variety of organizations running Quagga in a production environment. The following graph gives the per-country distribution of prefixes for every country with at least 10 outages due to this bug.
The lessons are obvious. If you don’t check user input and don’t allocate sufficient memory, bad things will ultimately happen. Every possible permutation that you can think of, and perhaps some that you can’t, will eventually be tried by someone, either accidentally or on purpose. So far, we’re mainly seeing the former, which is good news. Ultimately though, bugs need to be found in lab, not in the production Internet.