An emerging ideology in the DevOps community is the notion that we should treat servers like cattle, not pets.
Instead of giving servers cute names and babying them when they get sick, we should coldly and swiftly dispatch them and spin up new services to take their place. If these are kept in “farms” of servers and services, there should never be a moment when there are no services available for an end user.
This approach is a continuation of high availability concepts, with the commoditization of the compute instance a natural conclusion of auto-scaling and self-healing high availability systems. If the system can grow and shrink, the individual instance loses much of its unique importance. A cow is born, only to die.
But servers aren’t the only thing that can break. Everything, with enough cycles, will have its day. This got me thinking about the frequency of certain infrastructure failures, as well as their relative impacts, and the role of DNS in high availability.
Servers fail often
An individual server will die fairly regularly. This might happen on a daily, weekly, or monthly schedule, depending on your application and the size of your organization. But the effects to users and the business as a whole are minimal, because an individual server’s availability is typically obscured to the user by use of a local load balancer or other such mechanism. There might be a single virtual IP to a data center but hundreds of servers behind it. This is a great way to reduce the exposure of the most common type of failure from the view of users themselves.
Data centers fail, too
Sometimes load balancing for server farm availability isn’t enough. All the resources could be down, the load balancer itself could have issue, or the data center may not be reachable externally. Servers die. Fiber gets cut. Things happen. In these instances, it is common to use DNS-based traffic steering to seamlessly shift traffic to a different available front-end load balancer or data center.
Traffic steering in DNS helps you get around those problems before the user enters your stack. If a load balancer that’s the entry point to an entire data center were to die, you would remove it from DNS entirely — preferably automatically. Then you could either fix it or kill it and spin up a new one, changing DNS to match.
DNS vendors fail sometimes
Even the biggest DNS networks will have their occasional bad day. A multi-vendor DNS approach solves this problem. Virtually every DNS vendor in operation today, no matter its size, have had moments of difficulty, and the consequences are significant. Forward-looking organizations have multiple DNS vendors right alongside their multiple data centers, load balancers, power supplies, etc.
If you’re just using plain DNS (not with traffic steering), high availability is easily achieved with primary-secondary DNS configurations, which are widely (but not universally) supported. Any vendor with support for XFRs will allow you to add a secondary DNS, syncing your zones. Once synced, the new vendor may be added to delegation.
If you are using traffic steering, the process is a little trickier, but still very doable.
There might be man-in-the-middle attacks
DNS Security Extensions (DNSSEC) provide a verification mechanism to address attacks that aim to take over your DNS. This ensures an actor may not act as a man in the middle to DNS requests, altering answers and sending users to a malicious page. The challenge is that adopting DNSSEC is very complicated. The biggest difficulty is simultaneously doing traffic steering and DNSSEC, especially across multiple providers.
DNSSEC across two providers is easy, because they are normal records. DNSSEC with traffic steering is offered by some vendors that either pre-compile keys for each possibility or sign dynamically on the fly. To my knowledge, at the time of this writing, no domain has ever performed DNSSEC with traffic steering on multiple providers.
Overall, the risk of man-in-the-middle attacks is rare. If an attack were to occur, however, the effects could be enormous.
There are thousands of things that could go wrong with your infrastructure. It is important to assess your technical risks, the frequency with which things break, and the impacts this has on the business. By building multiple layers of control and redundancy, including using DNS for high availability, any single machine, data center, or vendor can break without causing application downtime.