We started this blog thread last week, when we only had two broken cables to consider, but since that time there have been reports of several more failures and they seem to keep coming in. As far as this thread is concerned, the first two parts (here and here) focused on the countries and local providers most impacted on the day of the first two cable failures. We then looked at the providers of some of the harder-hit countries and how they were able to restore connectivity (or not) during the subsequent 48 hours. And along the way, we felt obliged to counter some nonsense circulating on the Internet claiming that Iran had been cut off It’s been a busy week and we’ve barely scratched the surface. But plowing ahead, we will take an extended look at two local providers, Bharti in India and DCI in Iran, and how they weathered the storm. One week later, how are these two local providers gaining access to the global Internet? What has changed? We will use these examples to provide a glimpse into what can be discovered by collecting up enough public routing data from enough carefully selected places, combining it with geo-location information and then doing an enormous amount of processing.
I’m going to start with a word of caution: this will be the most technical of our discussions so far. However, it will not be difficult to follow if we take things one step at a time. The first simple observation is that an organization can be both a customer and a provider, depending on one’s point of view. For example, Bharti is both a provider to numerous companies in India and a customer of Sprint. So whether someone says they are a provider or a customer depends the direction in which the money is flowing, toward them (provider) or away from them (customer). To introduce some terminology, let’s consider customer C with providers P1 and P2. From our data, Renesys will observe these two business relationships, as well as the networks (IP prefixes for you routing experts) that are routed across them. We can infer a lot by watching the routing along these links and how it changes over time. For example, if P1 is having a problem, we might see C suddenly shift some of their networks to P2. We will observe this as a decrease in the number of networks on the C-to-P1 link and a corresponding increase on the C-to-P2 link. On the Internet, traffic equates to money, so P1 just lost some cash flow, while P2 gained some.
With this background, let’s take a closer look at Bharti and the major carriers who connect them to the rest of the global Internet. Before the cable cuts, Bharti was receiving service from several carriers including British Telecom (BT), Deutsche Telekom (DTAG), Cable & Wireless (C&W) and Sprint. We observed these four particular carriers until the cable breaks, and then each of these simply went away. Only Sprint eventually recovered to some degree on 2 February, but ended up carrying far fewer networks. It is not surprising that certain carriers went completely off-line, but why did Sprint come back after two days? No cables were repaired during that time and no new ones were suddenly brought into service.
Sprint has a strong global network and has considerable capacity heading from Asia to the west coast of the US. If their outage could have been corrected by a configuration change, you would think that that would not have taken two days. Are they selling service in India on routers without capacity in both directions? Were they preferring their more expensive MPLS service over IP and had no available bandwidth for IP? If so, what happened after two days to restore IP service? Looking at Sprint’s network maps (see page 9), they claim to have capacity on SEA-ME-WEA 3, which was not impacted by the outages. What exactly was Sprint doing for those two days in India?
As for the providers who gained new traffic, AT&T, SingTel and Level 3 initially picked up new networks from Bharti. However, all of them subsequently fell, perhaps due to another cable cut, with only Level 3 managing to preserve some of their gains. This answers one of our questions from an earlier blog about exactly how Level 3 managed to gain business in India. It was due almost entirely to Bharti, a very large local provider.
Now, let’s consider DCI in Iran. DCI is the only provider in Iran with connections to the outside world. Most of their traffic flowed via TTNet, SingTel or Flag before the breaks, and not surprisingly, Flag lost many of the networks it carried earlier.
But this graph might seem to contradict our previous blog, where we said that the outaged networks in Iran had been quickly recovered. The graph shows a drop in networks carried by Flag, but no corresponding rise in networks for TTNet and/or SingTel. This is explained by the fact that networks can be carried by more than one provider. For example, I might reach a network in Iran via Flag, but you might reach that very same network via TTNet. This is why Iran was able to recover so quickly. DCI could use any one of their three primary providers and, in fact, were using more than one of them for many of their networks. When Flag failed, traffic could easily move to one of the surviving providers. So although total bandwidth into the country was reduced, there was little in the way of a long-term outage for many networks.
From this discussion, we can see that graphing the number of networks over time does always tell the whole story. Here SingTel and TTNet both could have picked up a lot of new traffic because of the failure of Flag, but not necessarily any new networks. How can we observe such situations in the routing data? Well, when Iran had three main providers, the rest of the world would pick these three in some proportion based on various routing attributes, which are beyond the scope of this discussion. However, when Flag went away, there were only two primary Iranian providers left standing. Renesys’ worldwide assortment of routing peers (i.e., data collection points) would have been forced to pick one of the two survivors for traffic into Iran. We can capture this with a metric we call PPT, peer-prefix-time. Basically, for each of the Renesys peers worldwide, we count up the total amount of time this peer routes a network (prefix) in a particular way. Thus, for each network in Iran and each peer, we’ll know how long it was routed via TTNet or Flag or SingTel. Adding up these times for all networks in Iran tells us how popular these providers are for gaining access to Iran on any given day. We show this in the graph below.
Immediately after the cable cuts, TTNet was preferred for access to Iran by a significant majority of the world over SingTel. But then as time went on, the two providers achieved rough parity. This could have been because of routing decisions made by DCI to balance traffic between their remaining providers. This example is to show that simply counting up routed networks, while useful, only gets you so far. You also need to know how the rest of the world chooses between the available options and in what proportion. Flag, which still routed networks to Iran after the cuts, was selected by almost no one.
I want to thank you for getting to this point in my blogs and for all the thoughtful comments I have been receiving both publicly and privately. I wish I had the time to answer all the questions, but I guess if I did we wouldn’t have much of a business. Renesys makes its money by selling such Internet Intelligence to its customers. So with this blog, I am going to close out our discussion of cable breaks for now, except that I’ll soon follow up with some non-technical concluding remarks and lessons learned.