What is Personal Data?
We all use the internet everyday for a variety of tasks – shopping, banking, and looking at cats. What we don’t see, is all the information we leave behind. Every website you visit, form you fill out, and email you receive is dripping with information about you, your location, and your interests. This is information which advertisers work diligently to collect to target their wares, and nefarious bodies harvest to commit their fraudulent crimes.
In an effort to contain this endless digital paper trail governments, industry consortiums, and legal bodies have begun to produce standards by which Personal Data, sometimes expressed as Personally Identifiable Information (PII) and Sensitive Personal Information (SPI), is defined and its handling regulated. The European Union has long championed privacy as an individual right, so looking at the forthcoming General Data Protection Regulation (GDPR) is a good place to reference for a common definition:
GDPR Art.4(1) – “Personal data” means any information relating to an identified or identifiable natural person (“data subject”); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
It’s also worth knowing the kind of data which is sensitive to exposure. From a data perspective, once you have someone identified, it doesn’t necessarily matter where they buy their shoes but it does matter if it might identify them in a personally sensitive fashion such as health history. If this information is then used for discrimination, it would be a big issue. The GDPR defines sensitive personal data as the following:
GDPR Rec.10, 34, 35, 51; Art.9(1) – “Sensitive Personal Data” are personal data, revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership; data concerning health or sex life and sexual orientation; genetic data or biometric data. Data relating to criminal offences and convictions are addressed separately (as criminal law lies outside the EU’s legislative competence).
What this really comes down to in the DNS layer, is a concern that one would be able to connect an identifier of a person through DNS lookups, to an IP address of a device at a specific location. This is the basic concern of security professionals when it comes to network data systems. In short, can someone link the identity of a user and with what they might know about that user. What might be great for routing, turns out to be terrible from a compliance perspective.
Personal Data in a DNS Query
To determine how much personal data is in the typical DNS query, it is worth revisiting the chain of requests in the lookup. For the vast majority of the world’s users, workstations and devices route through a collective Recursive Resolver usually either the local internet service provider (ISP) or managed by their place of work. While it is possible to operate your own recursive resolver, this is by far the exception and only seriously considered by IT professionals as a hobby.
When the user initiates a DNS lookup, the local environment will first validate if the entry is contained locally in cache from a recent query or in the hard coded host file. Failing that, the user will ask the designated DNS recursive server. The recursive will then check its own cache, and failing that will interact with the Authoritative DNS servers to fetch the answer. There could be multiple chains of queries as the recursive performs this function, which is how it gained its namesake. The query broadly looks like this:
In other words, for the majority of queries, the user is obscured by the recursive resolver. This has a couple implications. One, is that the recursive acts to smooth the traffic of DNS requests by providing a caching layer between the billions of users and hundreds to thousands of authoritative DNS servers performing the majority of the world’s traffic. Another, is that users never directly interact with an authoritative DNS provider such as Dyn’s Managed DNS network.
The recursive, has full view of the user, and must be careful when it comes to handling the information about that user, as sensitive as it could be. Knowing what websites someone is browsing, can both identify them but also show their history which could lead to conclusions on the subjects of sensitive personal data. Thus, when an organization is looking at outsourcing their recursive DNS it would be suggested to ask about policies relating to data privacy and personal data.
For the authoritative DNS, the only information shared with the authoritative is that shared by the recursive on the user’s behalf. No direct interaction with the user means less possibility for exposure of that user’s personal data. This is ideal, as it allows organizations to be more flexible with whom they engage to operate their authoritative DNS. While there are still plenty of areas to evaluate for security such as application security and DDoS protection, personal data and privacy concerns are usually able to be crossed off the list.
Traditional v.s. EDNS0-client-subnet Queries
Conventionally, the recursive which performs the DNS query for the user would be a local ISP. This means it would be geographically and network topologically close to the user. Authoritative DNS providers have exploited this relationship to allow geographic targeting based on the recursive location, so you can hand out smarter DNS responses. East Coast USA can get one answer, West Coast another, Europe yet another. This is great for a way to direct traffic early in the connection.
In recent years, this has been offset by a trend for users to bypass the local ISP to use large global recursive operators such as Google DNS (18.104.22.168) allowing a potential workaround for government censorship or just greater performance. When this started gaining adoption, authoritative providers were hit with a problem, with fewer recursive locations the granularity of geolocation was drastically reduced. For example while there are dozens of countries in Southeast Asia, traffic might be coming from just a single location in Singapore from the vantage point of the authoritative DNS.
To counter this the industry developed and implemented EDNS0-Client-Subnet as a way to enhance a standard DNS packet to provide additional information about the user from the recursive for the purposes of intelligent responses. This is implemented on about 20% of all global queries at the time of writing, and slowly increasing predominantly from increased usage by Google DNS, the largest single contributor.
Now this is a blog about personal data, so the real question of this is how much identifying information is contained in the new shiney packet? As it turns out, this was something both the developers and recursives were aware of. While you can technically put whatever you would like in a packet, the industry has seemed to standardize on passing only the /24 of the user’s subnet. This means the user’s IP address is one of a group of 256 IP addresses. While this is great for understanding roughly where a user is from a routing perspective (Indonesia v.s. Thailand) it is not specific enough for most interpretations of personal data. Typically tying a device to a single IP or in prefix form a /32 subnet is the accepted precision to constitute personally identifiable.
Privacy and Security Matters.
The fact that you have read this far means you take privacy, security, and DNS seriously – so do we. Any vendor you engage with should be prepared to explain exactly how they engage with individual users in the DNS query as covered above, your administration portal for auditing, or any optional add on services such as HTTP redirects which do indeed interact directly with a user. We at Dyn look forward to the opportunity to express how we address those concerns, and deliver a best of breed DNS service.