This post was co-authored by Alex Sergeyev from Dyn Labs.
At Dyn, we obsess about network performance and the proof is in the tools we have built over the years to constantly monitor how our network is running. With a globally deployed Anycast DNS network, constant internal and external monitoring of our performance is critical. This constant surveillance of our network ensures that we keep providing top-notch services. One of the biggest challenges is being at the mercy of third party providers and their monitoring platforms when monitoring our Anycast network.
After all, we cannot monitor ourselves and be fully objective with the data.
One of the issues with using external monitoring providers has always been the ability to pull data from the monitoring provider within an acceptable interval. We really want to be seeing data within a few seconds of a test run completing so that we can correct any issue just as soon as it happens. For a long time, we’ve been working with monitoring providers that can get us our data in five or ten minutes at minimum and on a dashboard they render – hardly usable for our operations.
As we’ve mentioned before, one of our favorite monitoring providers is Catchpoint and thanks to their Data Push API, we’re able to receive a constant stream of feedback from their 50 global monitoring nodes in real time. Every five minutes, a Catchpoint node performs a series of tests against our DNS servers and instantaneously relays that information to the central Catchpoint collector, but also ships a copy of the results to a webserver on our network so we can begin reviewing those results immediately.
Enter the challenging part: how do we build a dashboard with ACTIONABLE data with 50 data sources and 4 targets (over 200 data points) over an hour’s time? Enter Alex Sergeyev from our Dyn Labs team, some ZeroMQ love and work with Websockets and D3.js. Alex built us a very slick visualization application that allows us to really see what’s happening in real time.
The design of the perfect dashboard
We decided to create a system to deliver the stream of data from Catchpoint API down to all connected web-browsers in forms of 50+ little charts. In the past, we have been shipping the data to a Graphite server as well, so we wanted to keep that running too.
We built three parts of the system to scale them independently as needed:
- Catchpoint API receiver that analyses data and sends simple parsed lines to ZeroMQ PUB socket.
- Websockets server that receives data from a ZeroMQ SUB socket and also keeps keeps state of last probes per monitoring node in a memory-mapped file (for some flavor of persistence across restarts).
- A Graphite “shovel” which also receives data from the same ZeroMQ SUB socket and sends results to a Graphite server.
We used Perl for both of the web services parts, because it gives us a way to code simple applications very fast. Both web services were implemented as simple AnyEvent servers.The design above allows us to have sub 100ms latency between receiving data from monitoring nodes and sending it out to a connected web sockets.
After building out the backend, most of the development time was spent to create visual representation of the data we got. We wanted to be flexible with our view and decided to use D3.js to help us with that. It’s an amazing tool which allows to make truly data-driven visualizations. We started from simple line/area charts to emulate visualizations that we are used to look at in Graphite and RRD based tools. We even wrote a SmokePing emulation library for D3.js, used it for our dashboard and plan to release it via our GitHub page in the future.
Here are few examples of charts that it generates (dashboard size):
Taking these basic charts and the smarts of our UX genius Lara Swanson, we transformed boring sets of 50 charts on a page to a display that ended up looking neat and pretty on either laptop screen or Internet TVs around the office. We decided to group charts by monitoring region and sort each section (credits to D3.js for making it simple) by mean response time, plus twice of standard deviation to “promote” trouble spots to the front of each grouping dynamically to make the data actionable!
Our current dashboard is not quite final but we’re proud of what we quickly built:
- a scalable sub 100ms data publishing system that kept our existing Graphite integration working and added our new dashboards.
- a simple, dynamic view of all available data (over twenty thousand data points) aggregated to help us know if there are any regional or local issues affecting the quality of our DNS services.
- an easy spot for our capacity planning and P&O engineers to look at when they are looking to make improvements to the network.