(This post was originally featured in a condensed version on Wired without images. Enjoy the director’s cut below.)
At Dyn, we’ve grown. A lot. Growth requires constant evolution of every piece of your company, not the least of which is your technical infrastructure. When your business is the business of Internet Infrastructure as a Service (namely managed DNS and email delivery), you better get good at it.
We typically upgrade our infrastructure and system architecture for one of three reasons:
- For scale (targeting 10x or 100x current capacity)
- For performance (trying to remove a system bottleneck to process work faster)
- For decoupling (for greater reliability, maintainability, and future scaling efforts)
These efforts historically are planned as large forklift upgrades. Folks come up with a three month plan, begin working diligently on the new system, and hope to flip a switch three months later when they’re ready to go live. Often what actually happens is three months turns into six months (things inevitably take longer than expected), and the business is forced through a high risk “all or nothing” flag day exercise to migrate over to the new system.
All the while during the six months, the business had no flexibility for other efforts or changing of priorities. The forklift upgrade is an all or nothing proposition, and until the team finishes the effort and migrates over to the new system, there is no business value delivered.
We needed to find a better way, so we adopted a Dark Architecture approach.
Dark Architecture is a way of thinking about, and technical approach to, solving the scale/performance/coupling problems while enabling the business to succeed and maintaining the sanity of your staff. We do this by:
- Prioritizing migration of “flows” through a system rather than components of a system
- Running legacy and dark architectures in parallel
- Sending system inputs to both systems, collecting two outputs, comparing values of outputs, but throwing one away
Let’s walk through a hypothetical example of a legacy approach to forklift upgrading a system.
On project start, we have inputs going to our legacy system and outputs leaving our legacy system, and 100% of the functionality (or “flows” of information) is serviced by the legacy system. Off to the side, we’ve started day 1 of getting a new system built.
Teams work diligently for weeks or months to complete the system, and eventually emerge victorious with a 100% functional equivalent implementation, and the time has come to migrate systems on a carefully chosen flag day. Likely, the project has slipped behind schedule, and folks were forced to power on to completion to get the system live, and little flexibility was granted to the organization for what other priorities to shift gears to along the way without completely shelving the upgrade effort.
There is no delivered customer value until the system is fully deployed and cut-over. As flag day arrives, the systems are cut-over, and 100% of the functionality is now serviced by the new system, and the legacy system is sent off to the farm to live out its days.
There are specifically three things that we want to improve about this approach:
- Deliver value sooner to our customers. (We don’t want to have to wait for the whole new system to be completed and put live before we deliver any value.)
- Reduce the risk of failure on introduction to production. (We want to avoid an “all or nothing” migration plan.)
- Offer flexibility to the business to switch priorities and at least have delivered the most critical value. (If we need to switch gears, we want to know that we’ve solved and delivered the important solutions first.)
Here’s how this hypothetical example would play out at Dyn applying a Dark Architecture approach:
Before we begin touching code and systems, we begin by prioritizing the “flows” of data through system in order of pain, opportunity, business value, or whatever metric makes sense for your business. Rather than speaking on component terms (e.g., swap the reporting database backend from MySQL for Cassandra), we think in flow terms (e.g., rendering a graph of wildcard queries for customer X is taking 40 seconds to render, while all other graph types for this customer render perfectly quickly, and this graph type for all other customers renders perfectly quickly). This exercise will force you to hone scope to exactly where the pain is so you can focus on delivering the solution to this pain first and save others for later.
Once we have our priority flow through the system (let’s assume it’s 2% of the overall functionality), we begin first not by building that functionality. Instead, we build the scaffolding around it to allow two inputs and two outputs (comparing outputs, logging when they differ, but throwing one output away) to our overall system.
Practically speaking, this might be duplicating web service calls (one legacy, one new) or duplicating database interaction calls (one legacy, one new), and then comparing the return values and logging to a file or server or message bus when they’re different.
With the input/output scaffolding in place, we’re now ready to start writing functionality. We’re going to implement that most painful 2% flow of the system, and as soon as it’s ready, we’re going to push it to the Dark Architecture in production. It will receive production input, yet we will be throwing away the output, but comparing the value to the legacy system output.
If they differ, we’ll log it so we can inspect. We’ll be instrumenting the performance improvement our new system has, and gain operational experience in working with it.
Side note: Have you caught on to why we call it Dark Architecture yet? It’s “dark” because it’s receiving production input but not being used for production output until we’ve gained confidence in its implementation and operation. Only after we’re confident the system behaves the way we expect it to do we turn it “on” for producing production outputs.
Once we’re confident this new system works the way we expect it to and delivers the desired performance or scalability improvement, we’ll switch which output gets thrown away, thereby realizing the value of the new system for solving the most painful 2% of the functionality, while still relying on the legacy system to service the remaining 98% of the functionality.
At this point, our teams have successfully delivered to the customers a solution to the most painful 2% of system functionality in a fraction of the time it would take to re-implement 100% of the functionality.
- Morale is high! (What technologists don’t love seeing their work put to use?)
- Customers are cheering! (Their pain is solved.)
- The business assumed little risk! (The functionality was de-risked by running in production with one output thrown away.)
Let’s assume there are a few more high priority flows to tackle, representing 20% of the overall system flows. Following a Dark Architecture approach, the business will soon find itself with a choice:
- Continue upgrading flows of functionality until 100% has been migrated, or
- Assess the remaining 80% of functionality against other business priorities.
This is a powerful difference between the legacy approach to upgrading infrastructure and a Dark Architecture approach.The business now has a choice partway through the effort. Some circumstances may warrant completing 100% of the functionality migration, some circumstances may warrant shelving future migration all together and find operating two systems in parallel a perfectly reasonable solution (not ideal technically or operationally, but it’s business!), while some circumstances may warrant slowly migrating the remaining functionality as technical debt while also pursuing more pressing endeavors.
That opportunity for choice is a cornerstone of an agile process, and having it in our toolbox for evolving our systems has been pivotal for achieving our scale.