An interesting dialogue with system administrators seems to come up when you steer the conversation towards network health and how to monitor it. It sounds like a pretty basic concept at its core: does the network work? What a simple question with an incredibly complex answer.
My reply to that question would be to ask, “What is healthy?” What an individual system administrator considers healthy is as subjective as what is your favorite color. Aside from the base ability to reach your website, you also have to decide what you care most about: whether it’s load time for an individual page, the load on any individual machine in your system, the memory available for consumption or any other nearly infinite metric.
When a problem comes up, the next question you need to ask is, “Where is it occurring from?” For example, if a user cannot reach your website, is the problem on your server, is it at the user’s ISP or is it at a switch along the way? It is critical to discern this information so that you can take the correct course of action to rectify the issue. This also means you really need to think about where you want to set up your monitoring nodes because a monitor in NYC may have the same problem reaching your website as the user leading you to assume your server is down. However, that does not account for a failed switch along the route.
Just as diverse as what you are monitoring are the tools available to monitor your system with. Many tools seem to be able to monitor certain issues excellently while reaching others only scantily. What each does have in common is the ability to fire off an alert to a system administrator to let him know a problem has occurred. As such, when a problem for any system is encountered, that machine should (at least temporarily) be removed from the network load balancing pool until the problem and its location gets resolved. That is precisely what the example posted here will do.
When run, this Python script will sit in the background and wait for emails to be sent to the machine and when one comes in, it will react. Now this one is specifically set up to parse the xml returned from the Gomez Web Performance Management System but can be tweaked to work just as easily with any monitoring system that allows you to send email to an email address (which is pretty much all of them).
In this case, the script will pull out any IP addresses that have alerted as SEVERE from the supplied Dynect user’s Load Balancing A record pool such that there should never be a server in the round robin list which has failed your specific monitoring test. To accomplish this, a Python wrapper for handling A records has been written using the httplib2 package in Python which can be easily extended following the same model to automate any other DNS tasks you may wish to do.