IP addresses have no built-in association with a corresponding physical location, and the resources that they are associated with can change over time — not only can an IP address move between different physical or virtual devices, but it can also move between geographies.
As such, over the last two decades, the practice known as “IP Geolocation” has emerged as a means of associating IP addresses with physical locations, such as continent, country, state/province (where relevant), and city, as well as extended information including (but not limited to) ZIP/postal code, area code, time zone, connection type, etc.
However, the accuracy of IP geolocation information can vary wildly across service providers. Within this post, we will look at how IP geolocation information is used, explore why accuracy varies across providers along with the impact of inaccurate information, how Oracle Dyn improves the accuracy of IP geolocation, and why that improved accuracy is so important.
Note that unless otherwise stated, assume that usage of “IP address[es]” and “IP geolocation” refer to IPv4. The topics covered are largely relevant to IPv6 as well, but geolocation for IPv6 address space has its own unique challenges that are not covered here.
What Is IP Geolocation Used For?
For those doing business on the Internet (like e-commerce sites and OTT video providers), and for those whose business is the Internet (like Oracle Dyn), IP geolocation information is used for a number of different purposes.
On the Marketing side, this information can be used to help customize messaging and advertisements, as well as providing clues for language localization. And because metrics are important to every marketer, IP geolocation information enables analytics tools to provide more insight into where users and customers are coming from and how that influences their online behavior, allowing marketers to take more targeted action.
Figure 1: Customized discount based on geolocation-related information
IP geolocation information has value on a number of fronts for retailers. For those shoppers that still need to visit a brick & mortar location, IP geolocation information can help an e-commerce site suggest “local” stores. To improve the security of online shopping, the information can be used to help prevent fraudulent transactions. IP geolocation information can be used by retailers to customize offers and pricing promotions based on where the user is, or estimate the tax and shipping costs that will likely be incurred.
Although media consumers generally aren’t fans of the restrictions, some media content may not be available globally, due to licensing agreements. Media services can use IP geolocation information to restrict access as appropriate, based on the user’s location. Conversely, media providers can use the information to customize a user’s experience, such as suggesting local radio stations to stream, local sports highlights to watch, or local news stories to read.
Figure 2: Blocking access to media content based on IP geolocation information
From an infrastructure perspective, enterprises can use IP geolocation information to learn where site/application users are located to make more informed decisions about where to deploy server infrastructure, or what service provider to select in order to optimize end user experiences. Once infrastructure is deployed, the information can drive traffic management decisions based on end user location.
IP geolocation information also has geopolitical applications. Academic researchers have examined traceroute paths to understand which cross international boundaries, classifying them as expected (based on routing configurations) or suspicious/malicious, or to identify cases in which traffic detours through so-called surveillance states. In addition, national initiatives intended to keep sensitive data within the borders of the country, make it important to understand where servers and storage systems reside, as well as the countries that paths to those systems go through.
Impact Of Inaccurate Data
Beyond placing default locations at an actual residence, as discussed above, inaccurate IP geolocation data can create problems for both consumer-facing and infrastructure-related use cases.
- Inability to access content: Incorrect IP geolocation information may incorrectly deny a subscriber access to geofenced content, such as the ability to watch a streamed movie otherwise available to viewers in their actual, physical location.
- Poor user experience: Incorrect assumptions made based on faulty IP geolocation information could lead to things like poor customization of offerings, false positives on fraud detection, or incorrect default currency or language choices. Such issues could lead to a user having to do extra work to correct or address them, potentially driving users to competing sites that offer a more streamlined experience.
- Poorly targeted marketing efforts: Offering locally relevant content, or showing locally relevant advertisements, are often key components of a targeted marketing effort. However, if these offers are customized for the wrong geography, then money spent on those marketing efforts is wasted, and the opportunity for conversion lost.
- Higher latency: Some organizations may use end user IP geolocation information to make decisions about where to host server infrastructure or what service providers (CDN, cloud, DNS, etc.) to use. However, incorrect information could lead to sub-optimal decisions, resulting in poor performance.
- Erroneous attribution: For services that scan the Internet, or track Internet ‘events’, correctly attributing the location of discoveries is critical, both for credibility of the service as well as any potential action taken as a result of the findings. Incorrect IP geolocation information can result in misleading observations or assumptions about the state of Internet infrastructure in a given place.
How Is IP Geolocation Data Compiled?
Building IP geolocation databases can be done programmatically, but the process should also incorporate input and tuning from (human) experts. While each commercial IP geolocation solution provider claims to have its own proprietary methods for gathering or improving the accuracy of data, there are several common techniques that can be used to compile and update IP geolocation databases, including:
- Mining registry data: Arguably the foundational data set for any IP geolocation provider. When IP address space or autonomous systems are assigned or allocated to an organization, the regional and/or local Internet registries (RIR/LIR) will have associated registration records including the address of the registrant.
- Interpreting airport/city codes: Nearly every Internet infrastructure device has an associated hostname, with naming conventions (such as including an airport code or city name) often providing clues as to where the device is located.
Figure 3: City codes in hops 9-15 and 17-19 indicate the cities where those routers are located
- Browser Geolocation API: The World Wide Web Consortium (W3C) has developed a specification defining an API to provide scripted access to geographical location information associated with the user’s device. Web sites or applications using this API ask the user for permission to share their location, enabling the association of the user’s IP address with the detected location.
- Correlations: In some cases, IP geolocation solution providers work with customers and partners to correlate user-supplied physical addresses (such as billing or shipping addresses) with IP addresses. This information gives a strong (though not definitive) clue as to where the user is, and where the associated IP address may be as well.
- User contributed corrections: IP geolocation solution providers will often set up demonstration sites, showing the user the geolocation information associated with the user’s IP address. User contributed corrections to erroneous information can be incorporated into the provider’s analysis and validation processes, with updates ultimately making their way into the database.
Why Is IP Geolocation Hard?
IP geolocation has been described as “part art, part science”. While the techniques described above can help providers build their databases, ensuring that the resulting data is as accurate as possible is a non-trivial challenge. A number of factors contribute to making it hard to arrive at accurate results:
- Lack of ground truth: Unfortunately, there is no definitive, comprehensive, publicly available, and most importantly, trustable data set that ties IP addresses to physical locations.
- Unvalidated registry data: Some IP geolocation solution providers take the data that they ingest from registries at face value – assuming that the entries are correct, and not validating the information contained within them. However, the information may be wrong, whether intentionally or accidentally. In addition, sometimes an IP address block registration is associated with a certain location (such as the registrant’s headquarters), but the associated IP addresses are used in other locations around the world.
- Inconsistent naming conventions: In large metropolitan areas, airport codes are often named after the airport itself instead of the city or area it serves – using these codes provides some, but generally not enough, specificity when associated with IP addresses. In addition, there’s little consistency across providers in how cities are referenced within hostnames. Interpretation of city identifiers needs to be reviewed on a provider-by-provider basis before incorporating the derived information into an IP geolocation database.
- Challenges with latency measurements: Highly accurate geolocation via measurement tools like ping, traceroute, and wget requires significant proximity – that is, a widely distributed set of agents from which to conduct these measurements. Additionally, satellite Internet provider IP addresses are extremely challenging to accurately geolocate because the satellites cover such extensive geographic areas.
- Anycast: When an IP address prefix is simultaneously announced from multiple locations, it is said to be “anycast” – a technique commonly used by CDNs, DDoS mitigation services, and DNS providers to route traffic to the location the fewest network hops from the end user. However, because a prefix appears to be in multiple locations at once (depending on the vantage point), it becomes very hard to accurately geolocate.
- VPNs, proxies, and relays: There are a number of reasons that end users may choose to pass traffic through a VPN, proxy server, or relay network, including security and anonymity, as well as attempting to circumvent geographic restrictions on access to media content. In these cases, the ‘true’ IP address of the end user is generally unavailable for geolocation, since the user’s connection appears to come from the IP address of the VPN/proxy/relay endpoint.
- Mobile: Ground truth for IP addresses associated with mobile connections is arguably easy to obtain through the device’s GPS. However, because the associated device’s location may change frequently over a relatively short period of time, it is hard to definitively geolocate the associated IP address to a specific city. In addition, many carriers use centralized gateways through which subscriber traffic reaches the public Internet, resulting in the end user geolocating to one of the cities where the gateways are hosted.
Issues With Existing IP Geolocation Solutions
The providers of commercial IP geolocation solutions presumably endeavor to offer their customers the most accurate data sets that they can. However, claimed accuracy rates vary as the geographic location and accuracy radius get more specific (country → state/province → city), as do the accuracy guarantees (SLAs). In addition, these commercial solutions do not include geolocation information for every IP address, so it is important to understand how they handle the assignment of a “default” location for unknown IP addresses, and just how large the pool of unknown IP addresses is.
Accuracy Claims & Testing
Commercial IP geolocation providers will often tout their accuracy levels in marketing documentation, ostensibly as a point of differentiation against the competition. While specific accuracy claims vary, most are above 99% at a country level, 90% or above at a state level, and above 85% at a city level. However, for our use cases, Oracle Dyn’s testing is often at variance with such claims and to such a degree that we have devoted significant effort to correcting errors.
Coverage & Default Locations
While the accuracy of the information provided by an IP geolocation solution is important, it is also important to look at just how much of the IP address space those accuracy claims apply to. While there are nominal differences in accuracy claims, there appears to be little meaningful difference in the claims made by the providers about the breadth of coverage of their offerings, as the leading commercial providers claim coverage of over 99.99% of routable IP addresses.
Many solution providers include GPS coordinates as part of a response, but these coordinates will generally map to a default location within the relevant geography, often its geographic center. As a rule, these coordinates should not be relied upon to identify the exact location of the user associated with the IP address. However, that rule isn’t universally understood by consumers of IP geolocation information, leading to significant issues when the coordinates align with the location of an actual residence or business. As an example, when one leading IP geolocation vendor knew that an IP address was located somewhere in the United States, but didn’t have associated state or city information, it returned the coordinates (for over 600 million associated IP addresses) of a location near the geographic center of the country, coincidentally located in the front yard of a private home in Potwin, Kansas, ultimately leading to harassment of the residents.
Figure 4: Default lat/long information may have unintended consequences
How Oracle Dyn Uses IP Geolocation
By steering users to the geographically closest infrastructure, an organization can improve user experiences, mitigate risks and boost customer satisfaction. Oracle Dyn’s Traffic Director solution can use IP geolocation information to enable customers to make intelligent traffic management decisions based on geography. Using IP geolocation, Traffic Director customers can group geographic regions (at a country or state/province level) into logical segments and specify how DNS requests from each segment are answered.
Over the last 15+ years, the Oracle Dyn Internet Intelligence team has been monitoring and measuring Internet connectivity and performance, collecting and archiving massive amounts of BGP (routing) and traceroute data. This data powers a number of tools, coupled with IP geolocation information, that provide a geographic perspective on network interconnectivity, routing, transit, infrastructure assets, performance and latency changes, and other network events, and we are also bringing this insight to the Oracle Cloud Infrastructure console to provide predictive and real-time performance analysis for Oracle customers wishing to understand and optimize service delivery strategy for their key global markets.
Figure 5: IP geolocation information plays a critical role for Internet Intelligence research
Additionally, the insight provided by these tools informs research done by the Internet Intelligence team, with findings published on Twitter under the @InternetIntel handle, on this blog, and presented at numerous industry conferences. While the network providers involved in these events provide some clue as to where they are occurring, IP geolocation data allows us truly understand their scope and impact.
Oracle Dyn Enhancements
As outlined in the “How Oracle Dyn Uses IP Geolocation” section above, we make extensive use of IP geolocation information across our traffic steering service, our Internet data solutions, and the research that informs regular blog entries and Twitter posts about observed ‘events’ on the Internet. As such, it is critical that we use the most accurate IP geolocation data possible, but we found numerous errors and inaccuracies in third-party commercial data sets, especially relating to “infrastructure” IP address space (used by cloud & hosting providers, backbone routers, etc.). This led us to develop a set of “enhancements” to these commercial data sets – corrections to observed and measured inaccuracies and errors, as well as the identification of anycasted IP address space. Maintaining these enhancements is an ongoing effort, with updates being made on a daily basis.
Many of the advanced techniques that Oracle Dyn has implemented leverage our deep understanding of Internet routing, built on an archive of BGP records that goes back over 15 years, as well as our vast corpus of historical and current traceroute data, which provides insight around network interconnection and latency and how both change over time. Unfortunately, due to their proprietary nature, a detailed technical discussion of these techniques is outside the scope of this post.
However, several additional more well-known methods are also employed, including:
- Billing address correlation: Customers of Oracle Dyn’s remote access dynamic DNS service provide a billing address upon purchase. Since the service involves knowing the customer’s IP address, we can correlate it with the customer’s billing address, assuming no evidence to the contrary.
- DNS name natural language processing: As discussed above, hostnames will often include geographic clues, such as an airport code or city name/abbreviation. However, the lack of specific industry conventions has led us to develop provider-centric rules that are used as a guide for processing. Results are confirmed with latency measurements.
- Router de-aliasing: Confirming that multiple unique IP addresses correspond to interfaces on the same router is a challenging problem, but there are several published techniques for doing so. Using these techniques, we can assemble sets of IP addresses that are believe to represent the same physical piece of equipment, which would all share a common geolocation.
- Shared infrastructure: The routing and traceroute data we collect provides us with a unique view of failure and restoration patterns across the Internet, revealing shared infrastructure. By analyzing this shared infrastructure, we can identify points of common geolocation.
The enhancements made by Oracle Dyn to the commercial IP geolocation data sets are extremely important for several reasons:
- They allow us to provide more accurate geographic traffic steering.
- The enhanced data provides more accurate insight into the location of infrastructure components, as well as routed IP address prefixes, translating into more reliable insights from our Internet Intelligence tools, which benefits the customers of those tools, as well as the research based on these tools done by our Internet Intelligence team.
IP geolocation is a means of associating IP addresses with physical locations. There are a number of providers that offer IP geolocation tools and services, which generally include (at a minimum) country, state/province, and city-level data, with additional metadata often available in premium offerings. However, accuracy of commercial IP geolocation data varies widely, and inaccurate data can potentially cause significant problems.
While there are a number of common techniques frequently used to compile IP geolocation data sets, there are also several factors that make it challenging to ensure accuracy of the data. Because Oracle Dyn’s traffic steering solution, as well as our Internet Intelligence tools and research, rely on IP geolocation data, we use advanced techniques to enhance commercial data sets, improving the accuracy of such data and enabling a much higher degree of confidence in the results.