Dyn prides itself on being fast, but how do we measure ourselves? How do we compare to everyone else? With all the vagaries of DNS measurement due to caching effects, congestion, and routing irregularity, is it even possible to devise a useful, believable metric, one that anyone could validate for themselves? Dyn Research decided to tackle this challenge and this blog explains our approach. We encourage our readers to suggest improvement and try this methodology out for themselves.
Over the years Dyn has built a high-performing authoritative DNS network using strategic placement of sites and carefully engineered anycast to provide low-latency performance to recursive name servers all over the world. We use our Internet performance monitoring network of over 200 global “vantage points” to monitor DNS performance and our comprehensive view of Internet routing from over 700 BGP peering sessions to make necessary routing adjustments. This synthetic DNS monitoring and routing analysis are important tools to understand performance. But since the ultimate goal is delivering a good user experience, it’s important to measure performance from the user’s perspective. (We have written about the importance of user-centric DNS performance testing in the past.)
User perception of DNS performance depends on how fast a user’s recursive name server can answer a DNS query generated by the user’s application. Measuring recursive server performance is therefore a good indication of a user’s experience of DNS performance. And overall recursive server performance is highly dependent on the latency between the recursive server and the various authoritative servers that it queries. Many of the records at the top of the DNS name space (i.e., from the root zone and TLD zones) will be cached by a busy recursive server and so performance of the authoritative servers for those zones doesn’t affect most queries the server handles. The factor that has the largest influence on a domain’s performance is the latency between the recursive server and that zone’s authoritative servers.
DNS performance can be improved by using authoritative servers that perform well from the perspective of a wide variety of recursive servers. Many zones are hosted by third party authoritative DNS providers. There are several Internet sites that measure performance of large DNS providers on a regular basis, but all the ones we are aware of use synthetic transactions, i.e., they send queries from monitoring agents directly to a provider’s authoritative servers. The results are highly dependent on the location of the monitoring agents, which don’t always reflect real world usage. We have therefore decided to investigate DNS performance from a real-user perspective and this posting describes our preliminary results.
Our performance methodology uses recursive servers to measure performance of authoritative servers. There are many “open” recursive servers on the Internet that accept recursive queries from any source. Our research uses a carefully selected set of nearly 2,100 open recursive servers for testing. To help ensure a server is being actively used by presumed real users, it must have an A (IPv4 address) resource record set (RRset) for google.com in its cache to be included (the presumption being that a significant portion of real users will be visiting a Google property). We endeavored to select servers in all kinds of networks and with good geographic dispersion. The servers are widely distributed in 45 countries as shown below:
We are measuring the performance of Dyn’s authoritative servers as well as the performance of seven other large providers of authoritative DNS service.
A recursive server queries the authoritative servers of multiple zones as part of the normal iterative resolution process. We wanted to isolate and track the portion of the resolution process attributable to querying the providers in the list above. For each provider, we chose two zones known to be served solely by that provider. To measure the portion of the resolution time at a particular recursive server attributable to that provider, we send three successive queries for each domain to the recursive server from our measurement location. For example, if the domain were example.com, we would send:
- A query for example.com, type SOA.
- Another query for example.com, type SOA.
- A query for <random-string>.example.com, type A.
The first query attempts to ensure that the SOA record for the monitored zone is cached by the recursive server. The time for this query is not recorded. The second query is identical to the first and therefore the response is presumed to be cached at the recursive server, so the time to resolve this query measures the round-trip time from our measurement origin to the particular recursive server. (The SOA record should be cached if there is a single, well-behaved recursive server at the destination IP. We recognize that multiple servers and unsophisticated per-packet load balancing or a non-standard server implementation could interfere with this caching assumption.) The domain name in the third query, with its random label, is guaranteed not to exist and therefore will not be in the recursive server’s cache. To resolve this name, the recursive server must query the authoritative server for the zone, which returns a Name Error or NXDOMAIN response (the DNS response indicating a domain name doesn’t exist). The time to resolve the third query represents the combined round-trip time from our measurement source to the recursive server and the round trip from the recursive server to the authoritative server. Since we know the time to the recursive based on the second query, simple subtraction yields the round-trip time from the recursive server to the provider’s authoritative server. This recursive-to-authoritative time value is the focus of our analysis.
Twice per hour, we query each recursive server in the list of nearly 2,100. For each server, we query for a list of domains known to be hosted by the previously mentioned list of providers. For each domain, we use the three-query method described above. Thus our list of domains represents every provider we want to measure, and we query each open recursive server twice per hour for every domain in the list, and for each query obtaining a round-trip measurement from the recursive server to a particular provider’s authoritative server.
We analyzed this data set using a technique we developed and have used successfully for comparing CDN performance data obtained from millions of performance-testing “beacons” running in web browsers as part of our real-user monitoring (RUM) service. The technique involves pairwise comparison of the measurement targets. In this case we are measuring the performance of eight DNS providers. After a recursive server is queried for all the domains in our list during one of the twice-per-hour monitoring runs, we consider all measurements corresponding to every pair of providers. For each pair of providers, we compare every measurement from the first provider in the pair with every measurement from the second provider and note how many times the first provider is better than the second and how many times the second provider is better than the first. For example, assume that for the pair (Dyn, Provider #7) after all the domains are queried during a particular interval at a given recursive, there are N measurements for Dyn-hosted zones and M measurements for Provider #7-hosted zones. We compare every Dyn measurement with every Provider #7 measurement, resulting in NxM comparisons, noting for each comparison whether Dyn or Provider #7 had the better performance. These pairwise comparisons for every provider to every other provider can be rolled up according to any time period desired.
We analyzed for different lengths of time at various starting points and found the results to be fairly consistent, so we’ve opted to show the results from a week of data here. The chart below represents the pairwise comparisons among the eight providers for the time period of November 24-30:
Each square represents the percentage of time that the provider on the Y-axis (labeled in the left column) is better than the provider on the X-axis (labeled across the top). For example, the upper rightmost square indicates that in all our measurements, Dyn’s performance scores better than Provider #7 64% of the time. Each square’s coloring indicates the winning percentage score for Y-axis provider compared to X-axis provider according to the legend at the bottom, but we opted to put the specific percentage values only for the top row to avoid cluttering the diagram.
We intend to keep this measurement running and continue refining our analysis. We welcome feedback and suggestions on this research in the comments.