This weekend, John Markoff wrote an interesting piece for the New York Times entitled Do We Need a New Internet? While his emphasis was largely on security, or rather the lack thereof, the central point Markoff makes is that the Internet may be so hopelessly broken that it could be better to start over, rather than continue to apply band-aids. As if to emphasize this point, SuproNet, a local Czech provider, single-handedly caused a global Internet meltdown for upwards of an hour today. SuproNet accomplished this feat by sending out a rather unusual routing update, one which a lot of routers did not handle very well. The result was Internet bedlam.
Routing on the Internet is strictly a cooperative affair. Neighboring routers tell each other what they know and that information ultimately propagates globally. Eventually everyone figures out how to reach everyone else. And what routers know are prefixes, i.e., blocks of IP addresses, that are routed in the same way. Since there is often more than one way to reach any given prefix, routing announcements include various attributes so that everyone can decide on their preferred path to each prefix. One such attribute is the Autonomous System (AS) path, i.e., the list of organizations that have to be traversed to reach the prefix. For example, I’m typing this blog from my home Verizon DSL connection. If I wanted to reach an IP address in Qatar served by Qtel, Verizon might hand that traffic off to Tata/Teleglobe (AS 6543), which in turn hands it off to Qtel (AS 8781). The prefix in question and its associated AS path are depicted graphically below.
AS path length is one important factor in route selection, with shorter paths favored over longer ones. Suppose that Qtel wanted to used Tata/Teleglobe as a backup provider for this prefix, only to be used when other alternatives failed. They could effect this by making the announced path artificially long. Instead of
- 701 6453 8781
we might see paths like
- 701 6453 8781 8781 8781
In this example, Qtel would have prepended its own AS to the path several times so that this particular route to this particular prefix would tend not to be selected by others.
Now the average path length on the Internet is only around 4. That is, we are all fairly close to one another. So if I make any path seem just a little bit longer, one or two ASes, it generally will not get selected and will accomplish the objective of being the path of last resort. Nothing stops you from prepending your own AS a dozen or even a hundred times, but it is not going to accomplish anything and will only pointlessly consume everyone else’s router memory. It’s also an indication that you don’t know what you are doing. Which brings us to a central problem, you don’t need a driver’s license on the Information Superhighway.
Bedlam on the Internet
Now suppose you just got your Internet learner’s permit yesterday and you really don’t want your backup provider being used unless your main provider is down. You could prepend your AS a few times in the route announcements you make to your backup provider and that would do the trick, but to make really sure you go for a few hundred instead. In a perfect Internet, that wouldn’t matter, but we don’t have one of those. What we think happened next is the Internet equivalent of a massive buffer overflow. While most of the core routers run by major ISPs fared just fine, processing the ridiculous path and sending it on, others choked. Perhaps they weren’t as well maintained or were running buggy software. These routers viewed the update as malformed and so tore down their session with whoever sent them the update. In other words, two routers that were happily exchanging traffic with each other just moments before suddenly stopped all communication. Traffic was lost, alternative paths were explored, and maybe the former cooperating routers recovered and re-established contact. Multiply this by thousands of routers around the world and you can begin to appreciate the ensuing pandemonium. At Renesys, we experienced an almost 100-fold increase in the rate of routing updates from our worldwide array of sensors
SuproNet (AS 47868) normally announces a single prefix, 220.127.116.11/21, to a single provider, CD-Telematika (AS 25512). On February 16th at 16:23:30 UTC, we saw this same prefix via a different provider, Sloane Park Property Trust (AS 29113), but with an AS path exceeding 255 ASNs. Such messages continued for almost exactly one hour or until 17:23:00 UTC. We observed Level 3 (AS 3356), Tiscali (AS 3257) and TeliaSonera (AS 1299) propagating most of these routes globally, with a total of 230 unique ASes ultimately sending us the problematic announcements.
This single Czech provider announcing a single prefix caused a huge increase in the global rate of updates, peaking at 107,780 updates per-second. This peak occurred at 16:30:54 UTC, less than 8 minutes after the first announcement.
At Renesys, we call a prefix impacted in a given hour if either suffers an outage or has a non-trivial amount of instability. In the hour before this event, there were 1215 impacted prefixes globally out of a total of 271,175. During the event, that number surged to 12,920 or 4.8% of all prefixes on earth. One announcement from one provider and we have a 10-fold increase in planetary routing instability for an hour. North America suffered the most, increasing from 0.35% to 4.76%, while South America suffered the least, increasing from 0.52% to 1.75%.
Mapping the Damage
It’s always the middle of the night somewhere, a time when ISPs perform their maintenance. And bad weather and backhoes roam the earth. So routing instability on the Internet is always present, as networks come and go. The following map shows instability levels by country in the hour before this event starting at 15:00 UTC and computed as a percentage of all prefixes geo-locating to the country.
Global Instability by Country – Before
The next map show instability levels by country during the hour of the event, starting at 16:00 UTC. Can you spot the outdated routers?
Global Instability by Country – During
We were heartened to see that most of Internet’s core survived a single odd announcement, but this does speak to a lot of outdated equipment or software at the edge. And if you manage to get all of edge routers to reset, you aren’t going to have many people to talk to no matter what the core is doing. While it might be tempting to bash SuproNet, can anyone really defend a system where a failure in probably one of the weaker links can cause the entire system to unravel? Maybe we really do need a new Internet and for more reasons than better security. The next one needs to come with an operating permit too.