Since the major outage to Panix and others caused this past weekend by Con Edison Communications, a number of people have been asking what was the root cause?. That is to say, what were the circumstances underlying Con Edison Communications’s error in announcing the networks that they announced? In the intervening days, we have learned something about what happened, and there is room to reflect on what it all means for the future stability of the Internet.
Several people, including current and former Panix admins have indicated that Panix was a former customer of Con Edison, and that this might have explained why Con Ed had Panix routes in their RADB as-27506-transit object. Checking Renesys’s records of routing data going back to Jan 1, 2002, I see no evidence of Con Edison Communications (AS27506) and Panix (AS2033) being adjacent to each other in any announcement from any of our peers at any time since then. So I can’t really verify that Panix was ever a Con Ed Comm customer. Can anyone clear this up? So far, it’s not making sense, or at least it’s not adding up to a full picture.
The supposition was that all of the other affected ASes that are not currently customers of Con Ed Comm are former customers. Some appear to be (Walrus Internet (AS7169), Advanced Digital Internet (AS23011), and NYFIX (AS20282) for sure were customers of Con Edison in the recent past) but others don’t. So this theory of former customer status doesn’t appear to be the full explanation.
But this isn’t really the “root cause” that Steven Bellovin was asking for above on the NANOG mailing list. This is really a proximate cause. The root cause or “ultimate cause” is that filtering is imperfect and out of date frequently. This case is particularly interesting and painful because Verio (NTT America, AS2914) are known for building good filters automatically. In fact, they are somewhat notorious for being rigid (and effective) in their automated filtering. So it is particularly distressing that they were implicated in this event in any way at all. In this case, Verio built their filters out of dated, incorrect information. Con Edison added the Panix routes to the RADB route registry as routes that they were allowed to announce. Verio believed it and built their filters accordingly.
Normally in cases of leaks like this, the propagation is via some provider or peer who doesn’t filter at all. In this case, one of the vectors was one of the most responsible filterers on the net. sigh.
So in terms of engineering good solutions, the space is pretty crowded. One camp is of the “total solution” variety that involves new hardware (probably), new protocols (definitely), and a Public Key approach where either originations in the case of soBGP (or any announcements in the case of sBGP) are signed and verified. This is obviously a very good and fairly complete approach to the problem but it’s also obviously seeing precious little adoption. The soBGP and sBGP IETF drafts all appear to have expired, which is disconcerting. Both have been around for several years and neither can point to a single large-scale adoption. And in the mean time we have nothing.
Another set of approaches has been to look at alternate methods of building filters, taking into account more information about history of routing announcements and dampening or refusing to accept novel, questionable announcements for some fixed, short amount of time. There’s interesting work from Josh Karlin (along with Stephanie Forrest (who taught me an intro programming class back at UNM) and (Jennifer Rexford) that suggests a way to build filters that penalize novel, suspicious routing announcements for some period of time while not impeding good, normal routes. This was also part of the work that Tom Scholl, Jim Deleskie and I presented at the last nanog. All of these strategies have the disadvantage of being partial solutions, the advantage of being implementable easily and in stages without a network forklift or a protocol upgrade, but the further disadvantage of being nowhere near fully baked. It’s unclear what it would take to finish the cooking process, but I’m excited to see what arises.
Clearly more people need to keep searching for good solutions to this set of problems. Extra credit for solutions that can be implemented by individual autonomous systems without hardware upgrades or major protocol changes, but that may not be possible.
And in the mean time, routing on the Internet is more or less wide open and minor-scale disasters happen on a regular basis. There’s a talk at NANOG 36 showing that route hijacking is dramatically more common than most people think. Good luck out there.
Update 2006-01-26 20:13 EST: Updated link to Josh Karlin’s paper so that it works now. Also related discussion rages on the NANOG mailing list so I strongly recommend bopping over there for those who are interested.