Data-driven decision making relies on contextual understanding of how data is gathered and the type of analysis used to arrive at an outcome. The popularity of data-driven decision-making has increased the number of companies using statistics to support a preference or vendor selection. The Internet Performance Management (IPM) market hasn’t been spared, but, unlike other markets where institutions have codified a standard for qualification and quantification, such as the FDA’s nutrition labels, Insurance Institute for Highway Safety, or the Coffee Quality Institute, the IPM market is still in the Wild West stage.
The labels “DNS Performance” and “query speed” are used by market participants, but while details of data collection methodology are included, a number of key details are overlooked. To compare this to the physical world: when staging a race we have a clock, which measures standard units of time as athletes/horses/cars, etc. traverse a set physical space. The goal is to control as many variables as possible to increase accuracy and repeatability. Once controls are defined and measurements instrumented, we stage contests to determine “the best.”
When it comes to studying the internet, we lack the ability to control a number of variables. Packets leave a network interface and may traverse any number of physical media from fiber, to copper, to microwaves. As they travel, there might be sources of delay, such as serialization delay, queuing delay, etc., which aren’t controlled by the packet issuer. All of a sudden, after 350 meters of the 400-meter dash, the track becomes crushed stone, or in the 200-meter freestyle, the middle 50 meters of the pool is Jell-O. This is very much like the relationship between networks and the packets that need to traverse these networks. This is the internet, a network of others people’s networks.
The state of the network is the condition of the competition field where it is only safe to assume there is a lack of uniformity and guaranteed variability. What about the protocol itself? The DNS, from a protocol perspective, is equally fraught with complexity. The DNS operates on a non-uniform code base, is globally distributed, and has a number of different caching semantics which vary by implementation. This is the type of description that leads the reader to the same question as the author, “What is it I want to measure?”
There are a number of free and commercial DNS performance comparison platforms publishing rankings and metrics, both free offerings, such as DNSPerf and SolveDNS, and commercial platforms, such as CloudHarmony and Catchpoint. Each of these platforms measures DNS performance in their own way, which produces different results. When you are consulting a third party metric site or purchasing a commercial monitoring service, what insight are you looking to gain? If one were to crawl and scrape these sites and compare the rankings, things would only get more confusing for the buyer. They don’t measure a uniform set of companies and, even when they do aggregate rankings, vary the difference in response times published. Some of the differences are multiples of the response time of the other service.
(Names of providers are removed as the focus is less the who and more the what)
|Name||Ranking From #1||Ranking Rank from #2||Query Time Diff (mseconds)|
Both of these systems seem to agree that providers 1,2,3 are all relatively good ( top 5 ), but, from a champion of data driven decisions, is that enough? Or does the discrepancy plant the seeds of doubt that make one question what is going on? What is being measured?
DNS performance sites/platforms publish metrics and measures but it is up to the decision maker to perform due diligence in order to decide if the metric captures performance that is relevant to them. Some measurement platforms are widely distributed in eye ball networks and have options for using recursive resolvers; others directly query authoritative DNS servers. Some use dnspython libraries, Net_DNS2 PHP, others run the dig command or proprietary tools. Understanding where the measurements are being run from, how the data is collected, and the monitoring of the measurement platform itself is important. For example: There are time periods where there is an increased response time for all of the DNS providers in a region. It is reasonable to assume that root cause of this is outside of the protocol and DNS operators. It could be an issue with a hosting provider, a network upstream from the point of measurement, or a variety of other issues. The point being the measurement platform should clarify why this behavior is manifesting in their data.
For each nameserver record, the same method is used, for example: dig -4 +norecurse +time=2 +tries=1 $Domain @<DNS Provider NS>, and timing details are collected from the same places. However to assert the contextual value of the observation, there needs to be controls in place to monitor for exogenous network influence or provider deployment. In other words the data collection lacks network awareness.
In a number of these platforms, we see a TCP/HTTP(S) mindset applied to the DNS. To combat distributed denial of services (DDoS) attacks or regional transit provider outages, most Authoritative DNS providers leverage anycast. Anycast is a 5th level hedge wizard IP spell that facilitates the same IP being advertised from and available in many locations. For stateful protocols, like TCP, anycast is a three dimensional chess game to manage due to network variability, however, for UDP, anycast simplifies things. The short conversation of a DNS request and the lack of state paired with anycast help a requestor find the closest or lowest cost transit responder. This means that the query may be answered by different nameservers over time.
This anycast “query goes to the closest endpoint” trick works wonders; an authoritative strategy leveraged by Dyn involves nameserver provider pairing. If you look at your current DNS configuration, you might notice (depending on your provider) 2, 4, or 8 NS records. The recursive layer doesn’t blindly select which of these nameserver records to use; it attempts to make an intelligent decision, so, in reality, the fastest responses should be the focus, or maybe, to be safe, the second fastest. By design, there are two high performance NS records, one middle of the road, and one hedge. The thought is that 99 out of 100 times the more performant records will be chosen, but in the event of a network attack or provider issue, the 1 time in 100 is a safety net. This is where context is king; if the measurement platform observes a change in path alongside the change in response time, they can attribute that change. Similarly, if a response time dramatically changes and there are no other variables in flux, it is easier to attribute to the service provider. These are additional elements of the importance of network awareness.
A number of platforms average response times from each nameserver together; this means the nuance of individual nameserver performance is lumped into a single summary statistic. In some cases the provider maintains a min and max response time, but they aren’t available by provider by nameserver. This reduction in granularity removes the ability to trend or understand the nuances of individual nameserver performance. When paired with network awareness, granular performance data helps not only clarify the DNS providers performance but also implications for your current end users. If the source of measurement is from an autonomous system in which you have a large user population and you’re measuring from the resolver which your users are going to query, you have a customer centric measurement.
Aside from the free measurements/summary stats, there are Enterprise platforms and, as a wise DevOps Thought Leader once told me, “There is always more money in the Enterprise Banana Stand.” These providers operate with impunity because of the “appeal to wealth” or “quality heuristic” and engineered user experience exploits a neuro-linguistic hack. < waves hand > “These are the Internet measurements you are looking for.” These platforms implement measurement frameworks with different bounds than the free platforms. One of the interesting bounds to look at is the timeout threshold. The free platforms cite a timeout of 1 or 2 seconds whereas some Enterprise platforms range from 4.5 to 10 seconds.
When free and enterprise services are describing timeouts, there is a blend of vague, quantitative, and categorical terms FAQs use terms such as “most DNS recursives”, “typical recursive DNS”, etc. The use of these abstractions is a signal that there are conflicting opinions about recursive resolver behavior. If there is one thing I’ve learned trying to climb on the shoulders of giants, it’s that the DNS recursive layer is best imagined as the post apocalyptic landscape of Mad Max. Aside from the varying software versions, their age, and the degree to which they are RFC conforming, there are a suite of other issues. Non-standardized or optional negative caching semantics, changing views/implementations of round trip time banding, and recursive “pre-caching” can hinder the predictability of nameserver selection and caching behavior.
Let’s return to the notion that “the modern network is reliable.” Questions about network stability and planning for instability have lead to @aphyr’s Jepsen test, which clarifies what happens to data stores when there is a split brain scenario. It turns out that the DNS has this same issue: when a transit provider connection drops, or flaps, or shows degraded performance, how does the recursive tier deal with this and how does this impact the end user’s’ experience? Do any of these measurement help the data driven decision maker understand these scenarios?
Since it isn’t explicitly spelt out another way to think about this is: how does the measurement platform deal with network instability, specifically monitoring for packet loss and the event of missing responses? Some platforms take the approach of: if no result is returned, throw the test away, or if the time exceeds a timeout threshold and no response has been seen, something has gone awry and no result is recorded. One of the Enterprise providers will send three packets per 1,500ms and, in a sense, it attributes packet loss to the DNS authoritative providers inability to answer. Lack of response is recorded as a 1,500 ms response time. When you extract the data and look at the summary statistics, these 1,500ms responses form a cluster of anomalies. To customers of authoritative providers these anomalies come across as service degradation.
The goal here isn’t to edge lord over DNS measurements, the goal is to try and provide some transparency into the complexity of taking Internet scale measurements. In the end it boils down to frequency, granularity of measurement, network awareness, and audience. The frequency of your measurement determines the number events you might find. The granularity of your measurement determines the level of specificity at which you can make a conclusion. The measurement platforms network awareness determines your ability to attribute observations to a specific condition, and your audience shapes what you are measuring and from where. This framework is how I would suggest a data-driven DNS decision maker assesses these metrics and approaches using data available on the internet.