Sign In

Latency Measurements @ Scale


Fellow at Dyn


In my last post I talked about Telemetry @ Internet scale. But how do we measure nearly 6 billion latency measurements per day? First let’s start to talk through what 6B latency measurements a day means.

Let’s start with the obvious: that’s over 250M per hour which is over 4M per minute or nearly 70,000 per second.

Okay, big numbers. But they’re tiny compared the number of packets that flow through the router in an Internet exchange point. On the other hand, if those numbers represented pages delivered by a website as measured by services such as Amazon’s Alexa we’d be talking about one of the top 10 sites on the Internet.

What type of measurements are we talking about here when we say 6B per day? We’re talking about trace routes.

At Dyn we measure the performance of moving data along the networks that connect servers and clients together. We measure, on average, over 1.5M different IPs on a daily basis for a total of 400M+ trace routes per day. We measure that performance from a network of 200+ collectors distributed across both the core and the far edges of the Internet. It’s hard to visualize the complexity of the Internet.


A visual representation of the Internet. Source.

Another way of rendering the complexity makes the Internet seem more like a globe picture at night (based on Facebook connect data):

And then there’s the physical view that shows how the major backbone cables and fibers bind the Internet together.

It’s pretty hard to visualize and/or internalize the complexity of the Internet, but it’s relatively easy to perform measurements about its performance and structure. The Internet is a chaotic distributed system, intentionally designed to be redundant. Yet, there is significant variance on the redundancy goals when looked at on a country basis.

Despite the complexity, there is significant stability (over time) and predictability. One of the goals with the gigantic data sets that we’ve built up over more than a decade is to help understand what is normal and more importantly, what is not.

Related Reading:

The Vast World of Fraudulent Routing
Routing Leak Briefly Takes Down Google
Why Far-Flung Parts of the Internet Broke Today

In the last post I mentioned that we couldn’t afford to haul all the logs back and been working on a variety of telemetry processing approaches for over a decade. What some have been calling lambda architectures  process ever smaller batches and doing real-time stream processing. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

For analysts looking at user behavior in social networks, behavioral trackers, etc., the number of points in the graphs are huge: in the billions. And when segmentation and user profiles are constructed across that, many users it becomes 10s of billions. A simple example is knowing whether or not as I sit on the couch across from my wife watching Mad Men on a Sunday evening, which computer behind a NAT’d address is whose? Of course the obvious answer is the user of cookies. That’s a simple way to differentiate us.

For the types of latency measurements we’re talking about, however, the task is actually a bit simpler. Low-level trace routes don’t actually let us use an HTTP browser construct of a cookie. And we don’t need them even if it did. We care about measuring latency for individual IPs, for the network segments that they belong to, for the routers that they sit behind, for the connections between those routers, etc.

In one of his blog posts, Jim has written about the number of routes that comprise the Internet and what might happen as we pass the 512K limit. 512K is a lot smaller than 1B Facebook users, or nearly 4B IPv4 addresses, or even the hard-to-imagine 42 undecillion (4.2 x 10^37) usable IPv6 addresses.

This makes the job of latency measurement on the current Internet and the resulting analyses a little bit easier.

So how are we doing this? Trace routes. Lots of them. A trace route is a compute network diagnostic tool for displaying the route (path) and measuring transit delays of packets across an Internet Protocol (IP) network.

Thankfully a trace route is nothing like the complexity of a typical web page. The average web page contains over 1600K of data and 112 objects. I’ll start to touch on what we’re doing for web specific latency measurements using RUM techniques in my next post.

If we had to move 1600K for every trace route, we’d have huge bandwidth bills and all of our peers and friends on the Internet who help us gather this critical performance data wouldn’t be very happy with us.

These over 400M traces with nearly 6B measurements end up being a rather modest 100Gb per day in data on disk. Not a big deal when you have well over many petabytes of storage to work with.

Each hop in a trace route is a factor in determining how well organized and optimized traffic flows between two points. Of course, the Internet is designed to allow multiple paths to be taken for a variety of reasons including policy, peering choices, etc.

What we’re not measuring here is the cost of actually getting the packets to flow to/from a web user and the web server. We’re also not measuring how well a particular web server is at handling requests and responses.

Key to this approach is pre-shaping the data for how we’ll eventually do additional cooking: actually what the data scientists on the team call feature analysis.

Because we’re focusing on the network perspective, we’re talking about organizing our datasets in ways that are network focused. No surprise there. Organizing based on prefix. In the user tracking space, they’d call this user profiling or building a social reputation service. When the number of discrete attributes is N and the number of potential entities is great (> 3B Internet users), such a reputation service is going to consume petabytes of raw storage.

We typically focus on network prefixes and that dramatically changes the size of the data space that we build our prefix reputation data stores around. They’re still big numbers of entities in our systems: in the hundreds of millions. But we’re an order or more of magnitude smaller than some of the typical uses in the social networking space.

The great thing about this is that it starts to make the approaches a lot more tractable. We can afford to save all the cooked data. Which lets us do other types of analyses about prefix and route behavior over time. This is fun stuff to do and talk about.

Since we have been doing this for so long, we started to ask ourselves a year or so ago: what if we could do this about actual services that live above the paths and routes themselves? What if we could start to leverage our arsenal of tools and analytical techniques and applied them to what was happening at what most people think of as the Internet. Yes, we know the difference, but the real action is in the HTTP space: CDNs, RESTful APIs, Web Servers, User Beacons, Ads, etc.

We’ll delve deeper into that in the next post: RUM – Real User Measurements @ Internet Scale.

Phil Stanhope is a Fellow at Dyn, working with the Office of the CTO since 2013. Phil's focus varies across engineering, infrastructure, architecture, analytics, operations and emerging technology strategy and planning. Phil is a known thought leader in the industry, having served on numerous advisory boards and technology adoption programs. Connect with Phil on LinkedIn, or follow @Dyn on Twitter.