The way I see it, there are as many ways to do DevOps as there are organizations incorporating DevOps.
Here at Dyn Inc., we began thinking about a more collaborative engagement between our Engineering and Operation teams a couple of years ago. Internally we named this blended group EngOps, and we’ve kept that name even as the world has adopted DevOps.
A successful DevOps environment incorporates tools, processes and culture.
Our EngOps was the start of a new culture at Dyn. We didn’t know we were starting down this path, we just knew what we had was not working for us. What we had, was a physically separated Operations and Development staff. The “Wall of Confusion” that Andrew Schafer talks about is even more imposing when it’s a physical wall.
The changes we saw once we had Operations sitting amongst the Engineering team were phenomenal. No more grumbling, and significantly fewer “pageable events”. We had a happy EngOps team because we had happy Service Delivery Network. One of the basic tenets of DevOps is to measure everything, I wish I knew that then, as we can only point to anecdotal evidence to show the success of that change.
So, we have one third of the puzzle — a changed culture. A culture of co-operation and collaboration is now our foundation in EngOps, all things build from here.
Processes for us didn’t change all that much, other than we now had cross-functional teams participating in each stage of our process. This allows us to identify issues and fix them faster.
One successful process we have kept in place are regular twice weekly team code reviews. These are in addition to having every changeset emailed to every member of the team and automated unit and functional testing against all changes. These team code reviews are all about peer review of the algorithms and design pattern implementation while it is going on. We don’t wait for a feature to be fully complete before it’s get reviewed.
By watching the evolution of a feature implementation, peers are able to identify early in the process if an approach might be problematic. And there is no one time consuming code review that has to happen before something can be released. By reviewing incrementally, we’re confident that the completed piece isn’t going to have any surprises in it. We can push it as soon as the final semi-colon is in place.
A critical step in the deployment process is our development testing lab. We are constantly deploying our in-development code against our lab. This lab mirrors our production environment in architecture down the the switches and load balancers. This allows us to be confident when pushing a new feature to production because we’ve done it before, many many times. And of course by using Virtual Machines, we’re able to snapshot specific points in time, and test over and over again any truly destructive processes.
I mentioned automated unit testing, this group review time is also a great opportunity for peer pressure. If tests aren’t included with new features, or regression tests aren’t included against a bug fix, you can be sure your teammates will let you know about it.
Tools. This is the fun part, and where so much of the focus of discussion around this topic focuses. Automation, configuration and monitoring tools are the stuff that makes us drool.
We at Dyn Inc. began working on our own internal automated configuration management tool not long after we formed the EngOps team. (At the time tools like Puppet and Chef were still in their early stages. We would take a different approach if we were starting now.) Having a reliably reproducible approach to getting bare-iron into a known consistent configuration state is invaluable when you are talking about a network of 100s or 1000s of servers.
The beauty of automation is not limited to server configuration, we use controller scripts for all code/service deployments for the same reason: Reliable Repeatability. System administrators have known for decades the importance of scripting repeated routine tasks. Our controller scripts know all about our SVN repository, and what services need to be stopped and started during a code push. The controller knows about our monitoring system, and handles turning on and off the monitors for the services it’s affecting.