At Dyn Inc, we’ve built some pretty large scale data processing systems. While we’re not Twitter, CNBC, or 37signals, we have deployed a managed DNS system large enough for those HUGE web properties to depend upon. When I woke up this morning, I spent time thinking about the best lesson working at Dyn Inc has taught me (since 2001):
“Assume every element of the system is unreliable”
Yes, that’s it. Period. Assume everything you’re building your system upon is unreliable. Think about all the time you spend trying to make your system components reliable: Redundant power supplies, RAID for data, dual this, redundant that. Are any of these perfectly bulletproof? Time has taught me that they are absolutely not.
It sounds pessimistic, but it really helps to get one’s brain in a place where proper software engineering can be done – if you cannot rely on the physical infrastructure and computing hardware, you must engineer around the problem at the software layer. This means building your software components to understand how to be “active-active” or “N+1”, just like your network switches or load balancers.
Its why there are multiple nameservers given to each customer for their delegation, and there’s redundant secondary DNS ingress transfer points into Dynect Platform, etc. Its not the hardware that is bulletproof or redundant (but it is), its the fact that the system knows how to heal itself across an array of hardware devices. The system knows how to route around a provider outage, or to deal with a slowing database server, or slim bandwidth throughput to a remote site.
More importantly, this assumption allows everything to be more tolerant of failure, both from a system perspective, but also from a human perspective. This is good for the team running the systems as now not every alarm from the network results in a page to the on-call team – people know that things fail, the system self-heals, and they can fix it at a time convenient for everyone.
This kind of thinking can truly save your team’s sanity and attitudes. Its the difference between a sysadmin running the systems versus the systems running the sysadmin. Its a great way to think and something I invite anyone running critical systems to try out.