As promised last week here’s part two of the story about a rough week at Cogent last week. When last we left our intrepid, optical network, it was depeering wee little British autonomous systems in an effort to gussy itself up for future suitors (we guessed; although there were several other interesting guesses as well. More on that shortly). Well, things went downhill from there.
On Wednesday, April 25, at about 19:25 UTC (15:25 EDT / 12:25 PDT), Cogent had a fairly serious backbone issue. It was reported on NANOG. It was a moderately large event at the time, with a total impact on most of Cogent’s network for about 45 minutes, and at least some part of the network affected for almost three hours. The problem was attributed to a router software bug. Cogent had another problem later in the week, on Friday, that appears to only have impacted customers in Boston.
Part of my interest in these events is personal: Renesys (AS34135) is single homed to Cogent at a development site in Boston. These two outages happened to both hit during the middle of user testing for a new application we’re working on (more on that in the coming weeks). So that was pretty embarassing and frustrating. We’re shopping around for other providers at 1 Summer now, but (as usual) providers are unclear on whether they can offer service in the building and what they might charge to do so. So we’re waiting. Additionally, two of Renesys’s three other service providers in New Hampshire, Worldpath (AS3770) and SEGNet (AS11524) both use Cogent as one of their upstreams as well. So we were impacted by the problems. But being a customer of, or a provider to someone who has a network problem isn’t enough to raise my interest (we have a lot of customers who run networks, strangely enough).
My main interest in the Cogent outage is that it was large enough to be felt across the Internet and gives me an opportunity to look at some of the ways to understand and analyze such events after the fact. So let’s take a look at what happened, not just from the RFO (Reason For Outage) issued by Cogent, but rather what the whole Internet thought of the event.
There are lots of people storing BGP updates and offering some routing analysis tools. Renesys does that, too. What all of these tools have in common is that they ultimately show the user BGP update messages for some set of network prefixes and, with some graphical tools or not, leave the user to interpret the resulting event.
BGP update messages are great for looking at the behavior of a specific set of prefixes. They can show the routing status and dynamics for those prefixes in a particular time window. But only when you know the time window and the prefixes. Sometimes, that is the form of the question: “What happened to my network in Boston on Wednesday afternoon?”
But there is another common form of question: “Is something going on between Cogent and Sprint?” or “Did Cogent’s whole network just melt down?” Both of those questions could, in theory, be answered by looking at prefix updates. But it would take a whole lot of updates. As I’ve mentioned here before, there’s another way to look at this: the edge between the two ASes.
Graph Theory Warning: I swear that I really know basically no graph theory—exactly the amount needed to get a Master’s in Computer Science and no more (probably a little less; I’m good at faking it). But this is about graph theory.
Consider all of the Autonomous Systems (ASes) on the Internet to be nodes and consider the edges between those nodes to be directed, with the direction governed by the direction of traffic flowing (or the inverse: routes being announced—it doesn’t really matter which direction you pick, just standardize on one). If you build a graph like this and then store every update as a series of changes to the edges (prefixes moving on and off edges), lots of things become easier. Now you can ask questions about the edge between AS174 (Cogent) and AS1239 (Sprint). By simply looking at a time series of changes to that one edge, you can understand much more about what’s going on.
Similarly, for this event, we were able to look at the changes to all of the edges touching Cogent (AS174) over the few hours surrounding the Wednesday event. There are two ways to look at the edge data. One is imply to count prefixes moving on and off an edge. During the hour of 19:00 – 20:00 UTC on Wed, April 25, that included edge changes like this
|Affected Cogent Peer/Provider |||AS_AS Edge |||Num. Lost Prefixes|
|Time Warner Telecom||174_4323||2232|
Note that edges are directional so that’s why some ASes are listed twice: we see the connection between that AS and Cogent in both directions. This only includes edges that lost more than a thousand prefixes in this hour. Overall we saw 2662 directed edges between Cogent and some other AS lose prefixes in that hour. That compares to 100 two hours earlier.
It is clear that this event was a big deal. Thousands of prefixes moved off of connections to Cogent. In many cases, this represented a complete shutting down (or at least zero-prefixing) of the connection to Cogent by these networks. The event lasted roughly how long Cogent said it did and then those prefixes mostly returned to where they belong. Aexonius is a smallish British network that buys from Cogent. I can’t figure out why they were so much more affected than a number of other Cogent customers, but it seems clear that they were.
There’s another way (other than counting prefixes) to look at this stuff: PPT, Prefix Peer Time. This is something we talked about considerably in the presentation on the Taiwan quakes that we gave at APRICOT in Bali in February. The basic idea is that you keep track of the total time a prefix is seen on an edge by one Renesys peer and you add it all up. To do this, you have to process every update from every peer and keep track of the full routing table from every peer at a reasonably fine (1s is best) time granularity.
The advantage of PPT is that it rolls big events like this up into a single score. The disadvantages are clear: by doing that, you lose visibility; a big spike in PPT can either be due to a large number of peers selecting a small number of new prefixes on an edge for a moderate amount of time, or a single large leak selected by only a few peers. You can’t tell the difference. Still, it’s an incredibly useful measure, and one that we build into several of our data products already. We’re working on ways to expose it to customers who aren’t Graph Theory geeks.
As an operations guy, there’s one other huge disadvantage to this way of doing things: systems resources. In order to build the edge index and keep track of prefixes moving on and off them, we spend a significant amount of memory. In order to track the history and be able to mine it in the past, it takes a lot of disk. As Renesys’s count of peers exceeds 180 now, this problem is only getting worse. But I think that it’s all worth it in the end. I’m interested in hearing from peers, customers and readers about your interest in metrics like this and what you would do with them if you had them. What problems might this view of the Internet help you solve? What visibility could this give you that you don’t have already.
Back to the peering angle from last week: as far as I can tell, there isn’t one. There seems to be no connection between the depeerings and the outages. Cogent does seem to be in the process of cleaning up its peerings, and maybe not just in Europe. They’re taking a look at peerings that they have with smaller networks and trying to make sure that the the peering relationship makes sense. It will be interesting to watch whether this is just a brief cleanup phase or part of a more prolonged process of peer reduction.