Sign In

Telemetry @ Scale


Fellow at Dyn


At Dyn we’re in the business of making sure that the Internet functions smoothly. A few years back there was a campaign about trying to make it seem that DNS is sexy. I may personally feel that it is, but the reality is that most people don’t even know what DNS is and how it impacts them.

Part of seamlessly providing DNS is operating global networks of data centers and using the most advanced techniques available to manage the networks and announcements that bind our data centers together. When we get that right, we don’t go down. Ever. The result is that we, in turn, process tens of billions of requests per day. For each and every request, we analyze, account, and correlate because a portion of our business is ensuring that our customers websites are available and they help us make money by providing this service.

We’re easily processing 100B data points per data — just on our enterprise DNS traffic. Those data points help us understand how to defend and operate our network. These data points are shared in the aggregate with our customers in order to help understand how to ensure that their sites are always available. And, of course, we charge for the service provided. A virtuous circle.

I’ve attended many conferences around big data and all of the advances driven by taking all of those log files produced by websites and throwing them into Hadoop and similar batch analytical stores for subsequent analysis. I was at a recent Strata conference and there was a show of hands question about how much data they had in their Hadoop cluster(s). 10TB was a big number for most folks. At Dyn, we shed 10TB in a day. Easily. So we’ve long ago started to do other things that are more inline with the ideas in Lambda Architecture. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

There is no one single approach to building a Lambda architecture-based system. For me, it’s conceptual architecture. The devil is in the details of the implementation. I’ve led a couple of teams here at Dyn over the past few years where my starting premise is simple: we don’t want to ever process log files, nor retain them. Part of it is economically driven. We’re still lean and mean and growing fast. We can’t afford to spend all our time building out petabytes of HDFS storage running ever more jobs against an ever larger dataset. Hadoop will always be part of our core processing infrastructure, but it’s simply not enough.

One of the problems with traditional batch processing backed by relational systems at the scales that we operate at is that we can’t afford to wait for certain answers. Thus aspects of introducing real-time stream processing is a no brainer. But what should those streams contain? What are critical telemetry points that we must know about in real time? What are ones that we can do partial processing (cooking) on and allow to flow further downstream for subsequent analysis, aggregation, reporting?


As a DNS operator, we’re always under attack. That shouldn’t be a surprise to folks reading this. We’re a prime target because if you could get us to comprehensively fail, the Internet as my wife, kids, and friends know it would cease to exist because the websites that they depend on to buy things, run things, share things … wouldn’t have dial-tone. Of course, we don’t go down (completely, everything breaks at scale, but not everything breaks at once thankfully).

What can we do about the endless pressure to know more about what’s going on, sooner? We can’t wait for logs coming machines operating across the backbone of the internet to dribble back to our core processing centers, pass through a variety of scrubbing operations, and then adjust our announcements because of something that happened many minutes our hours ago.

At Dyn, we’re not just about DNS. We also do transactional email and my biggest personal interest is in Internet Performance, writ large. Writ really large. A year ago, I was lucky to work on the team that helped bring to Dyn the folks from Renesys — the folks who created the systems for tracking the underlying protocols that make the Internet what it is starting with BGP. If you’ve been reading this blog for a while, you know that core to that mission is monitoring BGP — the protocol that binds the Internet’s core routers together and forms the basis upon which layer upon layer of other protocols like DNS do what they do.

What we don’t often talk about, however, is the number of measurements that we make on a daily basis. Here I’m talking performance (latency) measurement. The round number currently is 5 billion. Right, that’s 5B with a capital B. So we’ve got tremendous growth pressure on that front as well. We can’t treat each latency measurement like it’s a log line in a web server and process in the same ways that others use the sledge hammer approaches typically talked about in the big data space.

The Internet Performance teams at Dyn have been honing these hybrid real-time processing techniques for over a decade. Since the acquisition of Renesys, we’ve combined forces and pushed forward into new areas over the past twelve months. Those techniques include real-time protocol analysis and real-time streaming APIs that will form the basis of our next set of alerting and measurement products.

We’ve been working hard to retrofit aspects of these approaches into our global DNS network with the goal of shortening the time from action to insight. DNS operates in Layer7 as does BGP. Layer7 is a crowded space. It’s where the Internet meets the web and the Internet of Things. Over the past year we’ve been expanding our appetite. We’re hungry. Data fills our bellies. Performance data. The next frontier is the web as most of our customers and their users and customers know it. What if we had data about the performance of every major network path on the Internet? Cloud providers? CDNs? API Endpoints? Ad networks?

Over the coming weeks and months I’ll be taking us on an ever deeper tour through the data that we ingest and process and make actionable for ourselves and our customers. I like to joke at Dyn that I’m an equal opportunity protocol abuser. Give me some time and I’ll show you how we’ve built infrastructure that can process 100K latency measurements per second, provide real-time views into what’s going on anywhere on the Internet, and allow you to consume a tiny slice of that telemetry stream to mashup into you (our customers) systems to better understand who, what, where when and how internet performance is impacting your business and users.

The other contributors on this blog will dive into what the data means and where we are identifying problems (and opportunities). I’ve got opinions on the what the data means, but I’m not a data scientist. I’ll be focusing on the how we do it. Along with the who, when and where.

Next up: RUM. Real-user measurements. RUM as a tool in the web performance space is well established. There are conferences and webperf groups that almost exclusively focus on helping you understand your page performance. We’re about to change what you can do with RUM techniques and approaches.

Phil Stanhope is a Fellow at Dyn, working with the Office of the CTO since 2013. Phil's focus varies across engineering, infrastructure, architecture, analytics, operations and emerging technology strategy and planning. Phil is a known thought leader in the industry, having served on numerous advisory boards and technology adoption programs. Connect with Phil on LinkedIn, or follow @Dyn on Twitter.

Related Posts