I’m writing this blog entry from the campground at Vermont’s beautiful Quechee Gorge, where I took the kids after work. Yes, Renesys is located smack in the middle of some of the nicest hiking, camping, and climbing on earth. No, you shouldn’t move here, Northern New England has enough out-of-staters already, thanks. Unless, that is, you are an unusually talented web developer, have worked as a peering coordinator, or know the Internet transit industry inside-out, in which case you should send me your CV, posthaste. thanks, –jim
Here We Go Again.
Imagine an innocent BGP message, sent from a random small network service provider’s border router somewhere in the world. It contains a payload that is unusual, but strictly speaking, conformant to protocol. Most of the routers in the world, when faced with such a message, pass it along. But a few have a bug that makes them drop sessions abruptly and reopen them, flooding their neighbors with full-table session resets every time they hear the offending message. The miracle of global BGP ensures that every vulnerable router on earth gets a peek at the offending message in under 30 seconds. The global routing infrastructure rings like a bell, as BGP update rates spike by orders of magnitude in the blink of an eye. Links congest. Small routing hardware falls over and dies. It takes hours for things to return to normal.
Last time, it was the African Network Operators’ Group, who had the temerity to actually use those fancy new 4-byte autonomous system numbers in a heavily prepended production context, causing certain Quagga routers to reset sessions.
The time before that, it was a small Czech ISP, who tickled a missed input range check in the configuration screen for the relatively obscure Mikrotel routers, which led them to send out updates containing massively bloated AS paths, which caused certain Cisco routers to reset sessions. Cisco promptly issued a security advisory and patched the problem.
And This Week ….
At 17:07:26 UTC on Monday, CNCI (AS9354), a small network service provider in Nagoya, Japan, advertised a handful of BGP updates containing an empty AS4PATH attribute. Yes, a BGP update whose path attribute contained just three bytes: one byte for flags, one byte containing the AS4_PATH opcode, and a size byte equal to zero.
If you were writing this code, would you do something sensible if the ASPATH contained zero autonomous systems en route to the prefix in question? It’s not terribly meaningful, but it’s just a range check. It doesn’t seem like an outrageous corner case to anticipate and handle correctly. Especially if you’re writing the operating system for a big carrier-class router, where this logic is on one of the most important code paths for performing the mission of a very expensive piece of mission-critical hardware.
This one seems to have bitten Cisco’s IOS XR, a relatively newborn “from scratch” rewrite of the venerable IOS, destined to run on big iron, like the CSR-1 or the 12000 series. If you were trying to cause a meltdown of the global Internet, this would be the kind of platform you’d target. Cisco promptly released a security advisory and patched the problem. (We refrained from blogging about this for a couple days in order to give everyone time to get fixed, but I expect that the people who run the kinds of big iron that use IOS XR are probably well-cared-for by Cisco’s support team and didn’t have to wait long.)
Musing at METRICON
Coincidentally, I was at USENIX last week, and sat in on a great workshop on security metrics, where Sandy Clark gave a somewhat controversial presentation on the interaction between software quality and the timing of exploit appearances. (Sandy is a member of Matt Blaze’s secure systems research group at UPenn.)
In a nutshell (and Sandy, please correct me if I mangle your argument), one of the strongest predictors of a significantly large time to the emergence of the first zero-day exploit for a new version of software is the degree to which the release represents a substantial rewrite of the code. Doing a rewrite seems to start a “honeymoon period, ” during which time the system in question is safer from exploitation than it has been in a long time. In fact, the magnitude of the protective effect is so significant, that you might ask yourself whether a dollar spent in pursuit of higher quality code is actually better spent rewriting the code periodically, to whatever quality standard you can achieve. Maybe using robots. Seriously.
I had a chance to chat with Matt and Sandy briefly afterward, and we mused about the implications for the security of the OSes that Internet routers run — Cisco IOS, Juniper’s JunOS, Quagga, etc. To be fair, the threat model is completely different. You typically can’t speak BGP to someone’s router, for example, outside the context of a preapproved trust relationship. But once you’re talking, you’re talking (through them) to every other router on the planet. The stakes are a lot higher, but the value to an attacker is less clear-cut — crashing the Internet doesn’t give you control of tens of millions of valuable computing platforms, the way a zero-day exploit in Internet Explorer does.
So, here we have a serious vulnerability (the first I know of) in a substantially rewritten version of a critical operating system, one which has been out there in the market for at least a few years without making major news in a bad way.
The honeymoon is over, in other words.
How Long Before This Happens Again?
The good news is that the Cisco TAC is at the top of its game, getting in touch with affected customers and rolling upgrades. The bad news is that all of us (not just Cisco) are evidently locked in this completely reactive mode of finding and fixing problems in the world’s critical communications infrastructure. Again and again, we have to wait to discover problems in the real world that should have been caught through automated techniques during development, or during testing in the lab.
The global mesh of BGP-speaking routers that we call the Internet has inherent vulnerabilities that stem from the software quality and policy weaknesses of its weakest participants, and the amplification potential of its best-connected participants. Running sloppy software at the edge of the routing mesh (in enterprises, say) is unlikely to give anyone the ability to propagate large amounts of instability or partition the Internet. But closer to the core, I think we have a serious problem to contemplate.
Remember, if you can get just one provider to listen to you, and not filter your announcements, you can get your message into the ear of just about every BGP-speaking router on the planet within about thirty seconds. And if some subpopulation of those routers can be reset, they act as amplifiers for your instability. Power law outage-size distributions are not a myth — they are a logical consequence of the structure of the Internet, the importance of a few key participants in carrying global traffic, and their reliance for interconnection on technologies that are clearly still in the shaking-out-the-obvious-bugs mode.
The honeymoon is over, folks. We’re staring into the gorge, and it’s a long way down.