I just spent a very pleasant 3 days attending NANOG 45 in the Dominican Republic. The whole thing was a whirlwind of peering, technical presentations, and catching up with the people who keep the North American parts of the internet backbone alive. What can I say? The DR is overflowing with friendly people, great food, warm breezes (82F in Santo Domingo, versus 0F at my house in New Hampshire), and very decent Presidente beer. Very conducive to thinking the big thoughts. The trick is to write them down …
On Monday, I shared about thirty slides describing some of the research I’ve been doing over the last few months, trying to come up with a straightforward, objective way of rating network service providers based on the behavior (or misbehavior) of their on-net customers. The presentation is here if you’d like to read it.
I’ll give you the gist of it, though. The vast majority of the internet’s “control traffic” (the BGP protocol messages that keep every router informed about how to reach the 250,000 or so individual networks on earth) is actually generated by, or on behalf of, a really small number of networks. How small? On any given day, perhaps 1% of the world’s networks generate, directly, or indirectly, 50% to 70% of the total traffic being exchanged by routers all over the world. If you’re a glass-half-full kind of network engineer, that means that 99% of the world’s networks are good citizens — exemplary stability, quietly hanging out in the routing table, very rarely changed, going about their business and contributing very little to the traffic load that the world’s routers have to process.
I’m not one of those glass-half-full guys! I wanted to find out who the worst-behaved 1% were, and more to the point, who was making money by selling them transit without keeping an eye on the mess they were generating.
Pretty straightforward, I thought: there’s a tragedy-of-the-commons thing going on here, and people who provide transit to unstable networks are propagating their “table pollution” beyond their borders, around the world, making us all pay more to keep the Internet alive by their carelessness.
Perhaps, I suggested innocently, customers would seek out transit providers with better stability report cards, and offer them more money, because stability of the neighborhood could be a source of positive differentiation in an ever-more-cutthroat, low-margin industry. Perhaps, the audience responded realistically, transit providers could instead charge unstable downstream customers more, to penalize them for their tendency to flap their routes incessantly (the internet engineering equivalent of blasting your stereo at a red light with the top down). We were in basic agreement, though: someone should be pricing and paying for this pollution. (Overcoming the industry’s history of abject failure at billing for anything more subtle than commodity megabits is a discussion best left for another day.)
As usual, when you put a big idea in front of a bunch of intelligent people, it got more and more interesting. By the time I sat down, I had a few new observations.
First of all, the technical voices in the operator community understand these issues quite well. They know firsthand about CPU loads, traffic trends, and how hard it is to keep your network one step ahead of resource constraints as the Internet continues to grow like a weed. Since every large provider serves a broad range of customers, pretty much everyone has at least a few dozen autonomous systems downstream who exhibit serious instability, and don’t clean it up, or even know that they’re making a mess. But because the offending parties may be two or three signed contracts downstream (customers of customers of customers…), the sales and business development groups at big providers have zero incentive to take an interest. In this economy, if it doesn’t turn up big new revenue, or melt the network outright, it’s off the radar.
The second observation was that instability is really only one of three big routing “externalities” that people inflict on the planet’s infrastructure without significant penalty, the others being deaggregation (littering the routing table with redundant more-specific fragments, rather than advertising a few larger tidy networks) and failure to maintain clean, up-to-date, accurate entries in the routing registries. Being unstable chews up other people’s CPU cycles all over the world. Wanton deaggregation chews up other people’s scarce router memory resources all over the world. Failing to advertise your routing intentions, and stick to them, makes it nearly impossible (barring some future cryptographic miracle in global coordination) to decide whether a given routing advertisement is legitimate or suspicious, when you hear it on the other side of the world.
But these are all regarded as venial sins, not even worthy of confession. Kind of like recharging your car’s air conditioner with CFCs, or burning all your kitchen garbage out behind your house in an oil drum.
The final observation was one that caused me to moderate my thirst for backbone street justice, at least a little bit. Many of the providers who I gave poor instability scores to had something in common: they serve what I call the “adventurous parts of the internet” — regions of the planet that are underserved by internet transit infrastructure because of their political instability, their poverty, their unfortunate accidents of geography or choice of neighbors, or all three.
What I found was that unstable networks tend, more often than not, to be in places like Mexico, or Armenia, or Iran, or Vietnam, or Georgia, or Egypt, or Cambodia. It’s not that all providers in these places are unstable — far from it — but substantial populations are served by providers who just can’t seem to keep their routes under control. Often these are the old national incumbent telephone companies, trying to bring people into the 21st century, hobbled by a regulatory environment that doesn’t discover efficient solutions.
By constructing a metric that ranks providers according to the aggregate stability of their customer base, wasn’t I just penalizing carriers who chose to serve the adventurous parts of the internet? Isn’t there a “positive externality” being created by these providers, in that they bring the internet within reach of everyone on earth?
I took a question from an audience member who suggested, partly tongue-in-cheek, that the instability we had documented was a sign that fewer people should be “speaking BGP” at the edge of the internet. That is, that these unstable edge providers should simply pick the single best provider they can afford, put all their eggs in that basket, and delegate the difficult challenges of maintaining a stable Internet presence to that one provider (be it Sprint, or Tata Teleglobe, or Verizon Business).
To be sure, that’s a conclusion you could draw from the data. But I have the feeling that in the long run, the benefits of keeping the adventurous parts of the internet well-connected via diverse, redundant transit will outweigh the extra noise they generate. I hope that we can use instability scoring as a tool to identify providers who need a little extra help, a few words of advice from the global networking community, maybe even an invitation to participate in global community events like NANOG, so that they, too, can find and fix their internet pollution problems.