At Dyn, we provide our customers with tools for understanding the performance of the Internet, monitoring their Internet assets and planning their delivery strategy. A critical step in this process is determining where all of the relevant pieces are located, from the servers themselves to their users and everything in between. If your services are provisioned via a cloud provider, you might not even know where they are physically located.
But the Internet is in no literal sense a “cloud”. Understanding the whereabouts of your assets, the available service options and their expected performance is absolutely key to devising a coherent and successful Internet strategy, and then continuing to deliver “the goods” in a rapidly changing environment. In this blog, we’ll look at the problem of placing virtual (Internet) resources on a physical map and why readily available commercial solutions do not meet our needs. As we’ll see, this is a surprisingly difficult undertaking.
Every Internet conversation or transaction involves a pair of IP addresses (either IPv4 or IPv6). And those IP addresses are assigned to pieces of equipment that are located somewhere on planet Earth. IP geolocation refers to the process of determining the actual physical location of an IP address of interest, which can then be used to display content in an appropriate language, serve up ads of local interest or guard against fraud. Despite the numerous applications for accurate geolocation and the fact that this problem has been studied for years, the accuracy of the data provided by commercial suppliers can vary wildly and sometimes ridiculously.
Don’t believe me? Try this site: http://www.iplocation.net/, which allows you to compare a number of different geolocation services (although not all of them by any means) for a specific IP address. As I write this, I’m sitting in Hanover, New Hampshire, USA. The providers queried by this link places me either in Hanover (one provider got it right!), Lebanon, New Hampshire (10km from my actual location), Malden, Massachusetts (200km), Hartford, Connecticut (245km) or just the US with no city at all.
Before you say, “Who ever heard of Hanover, NH anyway?”, we have studied this problem a bit more comprehensively over the years. The concept of “accuracy” really depends on the level of specificity required. If you require only planetary-level accuracy, then all providers will be 100% correct, given that all IP addresses currently in use reside on Earth. If you want an exact latitude and longitude down to 5 decimal places, well then, probably every provider will be 100% incorrect. Generally, we find fairly close agreement at the country-level between the major providers, but things start to break down quickly after that. The following table shows how closely four different providers agree at a city level for all routed IP addresses. We also included registration data as well, even though that is known to be highly inaccurate (as it is often out of date or reflects the location of a corporate headquarters and not where the IP addresses are actually used). Notice that the closest agreement between any two sources is about 50%. If your application requires accurate city-level geolocation, you aren’t going to get it here.
So why is this so hard? With all of the economic incentives, you’d think this would be a solved problem. The issue is the lack of ground truth on the Internet. There are hints, such as registration data (human entered and error prone), service providers in use (very rough, e.g., Russian providers tend to transit IPs in Russia) and DNS names (with perhaps embedded geographic clues). However, the one definitive measure is latency. If the roundtrip time from you to your favorite content is only 1 millisecond (ms), then you know it has be hosted somewhere “nearby”. But the speed of light is pretty fast and so 1ms in fiber implies a distance of up to 100km! So for us to use latencies in our geolocation algorithms, we need a lot of locations from which to measure and a lot of measurements. We make use of all the aforementioned hints (and many more), along with billions of daily latency measurements from hundreds of locations to inform our machine-learning based geolocation algorithms. This lets us more accurately geolocate the things our customers care about, namely, the actual plumbing of the Internet. Plus, the Internet is a very dynamic place: things move around all the time. So IP geolocation needs constant revalidation.
Does this really matter? You bet! Here is just one example. While no one might be surfing the web from recursive DNS servers (and hence seeing ads), we do get billions of queries per day from them and hence knowing where they are in the world allows us to provide better answers on behalf of our customers: answers that are geographically closer (and hence faster) for the customers of our customers. Our Traffic Director product is just one of the many that make use of our advanced geolocation techniques. Here, we regularly monitor and update the geolocation of the popular recursives we observe so that our customers can make accurate geolocation steering decisions.
The following video illustrates the potential impact of accurate geolocation for our DNS services. Here we look at some of the queries we answered for our customers over a short time interval via recursives that have been misgeolocated by one or more commercial services. The size of a circle represents the number of queries from recursives in that presumed location. All of the initial locations are incorrect and their degree of inaccuracy is illustrated by how red they are. As the video progresses, each red circle moves to one or more correct (green) locations on the map. The greener the circle in its final resting place, the more it moved and hence the more inaccurate the initial geolocation. One thing stands out immediately: commercial geolocation services contain a lot of egregious errors, ones that have the potential to misinform (and hence misdirect the provided answers of) millions of DNS queries. This was not an acceptable level of accuracy for us, and was one motivation for our geolocation work over the past 10 years.
So suppose you could have taken full advantage of all of these corrections over this time period, how much latency could you have saved? 148 years as shown at the top of the video! Of course, this type of savings is unrealistic for any one consumer (the graphic includes data for many of them) and the choices any one customer has varies, but the point remains. Accurate geolocation really matters and can have huge impact on performance, not to mention coherent planning and provisioning.
Ok, so there are lots of geolocation errors when it comes to DNS and you’ve got that covered. Anything else wrong? Yes there is, lots of it. Cloud providers, CDNs, servers of any sort, routers, switches, all of the glue that holds the Internet together are frequently misgeolocated. Why? Because no one is surfing the Internet from a router in France to complain about getting results in Polish instead of French. And without those complaints and in the absence of the sophisticated analytics we use at Dyn, those mistakes are unlikely to be corrected.
As one final example, let’s consider Microsoft. A naive geolocation algorithm might put much of their IP space at their headquarters in Redmond, Washington, but that would be a mistake as Microsoft has data centers around the world. So let’s consider the portion of MS space that is amenable to our machine-learning geolocation algorithm discussed earlier. For these IP addresses, how does Dyn compare to two popular commercial sources?
The problem is not limited to Microsoft. We often find Internet assets of large corporations geolocated predominantly to corporate headquarters. Presumably that is because it’s where that company’s IP space is registered. That’s not a bad guess if you have nothing else to go on, but with direct measurement, you can often do much better.
Nothing I’ve written here should be taken as being dismissive of commercial geolocation providers. They provide a valuable service, one that is “good enough” for many applications. But when it comes to geolocating the actual infrastructure of the Internet, we’ve found that we need to apply literally millions of corrections to get an acceptable level of accuracy for our purposes: monitoring, controlling and optimizing our customers’ online infrastructure for an exceptional end-user experience. And it is this geolocation that forms a foundational layer of many of our products, such as Traffic Director (TD), IP Transit Intelligence (IPTI) and Dyn Internet Intelligence (DII). Accurate infrastructure geolocation is a key differentiator for Dyn and one important capability that came over from the Renesys acquisition.