Monitoring and metrics collection is the key for an effective Network Operations Center. No one wants to hear about service issues from their customers. Ideally, any problem should be detected by automated monitors and alert teams to the issue before the customer even notices.
When companies start to implement systems and monitoring, they often focus on monitoring the simple services that run on servers. Knowing that HTTP or SSH is important in troubleshooting, but it doesn’t reflect the impact to the customer.
An end-to-end transactional test of the system is a far more accurate indicator of service health. A test that can log in, perform a transaction, and get the expected outcome is far more telling of what the customer is experiencing. Additionally, ensuring that there are metrics gathered for these types of events can detect systematic issues beyond your own control.
For example, if there is a large ISP outage in a given region, it may impact your business. While there may be little that can be done, you will at least have an understanding of why things are off.
As organizations mature, they start implementing trend-based reporting and alerting. Knowing that a server’s disk space decreased quickly over a short about amount of time is more important than the actual space free. Tools like Graphite and Ganglia can facilitate trend-based alerting.
As important as monitoring, alerts, and metrics can be, they are nothing without proper documentation to resolve any issue. Having a dedicated team to respond to alerts and perform remediation is critical to minimizing outages but that team needs clear processes to response quickly and prescriptively.
When those processes fail, escalations to engineers help diagnose the issue. The result is incorporated into the documented remediation steps and the overall process improves.
In addition to good monitoring and remediation, the third core component to proper operations is the communications aspect. Nobody likes an executive breathing down your neck when trying to resolve a customer impacting issue. Having simple-to-follow and clear processes and communications plans are critical to easing the nerves of the business and the customers when things do go bad, especially for long periods of time.
Severity determination, communications cadence, and pre-scripted (to the extent that it can be) text can all ensure there is rapid communication. Transparency in the impact and the cause is always welcomed by customers.