When things go wrong
It’s a horror story that many in the tech world can sympathize with: your internal systems went down at the least opportune time, and you’re playing catch-up to restore normality. Here at Oracle Dyn we have multiple failover services that can help alert you when your endpoints become unresponsive, because it’s always better to be notified as soon as things go awry instead of by a disgruntled customer.
These services rely on monitors to verify your servers’ uptime. Their functionality is enhanced by the global reach of Oracle Dyn’s Anycast network, allowing us to utilize monitoring agents at various locations all around the world. Our monitors will set you up for success, even when things go wrong.
What is a monitor?
A monitor identifies objects on your network to be watched for potential failure. It does this using probes, which utilize customer-configured rules to determine when something is up or down. If the endpoint you are monitoring does not meet the requirements set forth in the rules, it will no longer be eligible for DNS queries and fail over to an alternate endpoint. Simultaneously, the very same monitor can notify a contact on your account that downtime was detected. This allows you and your team to be more proactive about your next steps while we handle the failover for you.
While multiple Oracle Dyn services have this capability, Traffic Director and Active Failover are perhaps the most prominent. Each has its own advantages and features, but both employ similar monitors.
Before we delve into troubleshooting, it’s worth briefly covering some tips and recommendations for successful monitoring. First, you will want to ensure you are using a protocol that best suits your needs. The protocol of choice should be closely related to how others will be accessing your systems. For example, if you employ a secure HTTP connection, you may prefer the HTTPS protocol for your monitor. You may also use other protocols, including HTTP, SMTP, TCP, and PING.
The more customized your monitor is, the better. When our probes know exactly what to expect when they reach your endpoints, it allows them to be much more specific with error reporting. For example, including a port, file path on a server, host headers, or expected data can help you narrow down exactly where in the connection the issue occurred. If you prefer a simpler solution where our agents will only check for a “200 OK” response, that is an option as well.
Insane in the mainframe
Okay, you’ve set up your monitors and have received your first “down” message. Now what? While our world-class support team will always be ready to assist in these scenarios, our monitoring agents can tell you a lot right off the bat. We keep logs of what our probes are seeing, when they saw it, and where in the world they tried to reach you from.
Here are some common error messages you might see and some initial troubleshooting steps to take:
Timeout in connect phase – This error message, while simple, can be an indicator of a variety of issues. First and foremost, you’ll want to ensure you’re able to connect to the endpoint yourself. If you’re using a simple configuration to monitor a hostname, BIND’s “dig” command is a great place to start. You’ll want to see that your hostname is resolving to the IP address as expected. In the example below I’ve added “+short” to simplify the output, but that can be removed to show greater detail. Adding “+trace” in its stead will ensure your query doesn’t use cached data.
Command: dig dnslover.com +short
Your service’s logs will also tell you where in the world our monitors are trying to reach you from. If you have any mechanisms for testing digs or traceroutes from that region, doing so will give you more specific insight. Additionally, if you have Traffic Director you are also able to set preferences for those monitor locations, allowing you to probe your endpoints from agents more relevant to the users of the endpoint or from a location more geographically adjacent.
As transient DNS issues can occur outside of your local network or Oracle Dyn, you may also wish to increase the number of retries utilized on the monitor. For each retry on your monitor our agents will reattempt to reach your endpoint in the event the first probe fails. This can help rule out fluke failovers that might have only occurred for a very small window of time.
Connection refused/reset by peer – Most commonly when this error pops up it is due to our monitoring agents not being whitelisted by your network, server, or firewall. The solution is straightforward, as you will want to add our monitoring IP addresses to your whitelists to ensure that our probes are able to reach you.
Timeout in header phase – If you are using HTTP or HTTPS monitoring you may find it beneficial to add a custom port, file path, or host header to make your probe results more concise. When error messages related to the header arise, the cURL command is the perfect place to begin your investigation. This can be best illustrated with an example.
Let’s say you are looking to monitor the 184.108.40.206 IP address on port 80, and within the “directory” path we expect to find the “dnslover.com” host header. To start, we will check the response code returned when checking the IP, port, path, and host header. If you have not configured any specific data to look for, our probe will only check for a “200 OK” response at the specified location.
Here “–silent” is used to hide extra cURL results, and the “| head -n 1” will pipe the response code to the output of this call. Removing those parameters will instead show you the full cURL output.
Command: curl –head “http://220.127.116.11:80/directory/” –header “Host: dnslover.com” –silent | head -n 1
Response: HTTP/1.1 200 OK
Should your response supply a redirect code (301, 302, etc) it is recommended that you instead monitor the host and/or path referenced in the “Location:” section of the cURL. In this example, you will instead want to monitor the host header of “differentsite.com” and the path of “anotherpath.”
Command: curl –head “http://18.104.22.168:80/directory/” –header “Host: dnslover.com” –location
Response: HTTP/1.1 301 Moved Permanently
Expected not found – When this message appears, our probes were unable to find the monitor’s expected data within the HTML response from your server. These responses are case-sensitive, so you will first want to ensure your server is storing the exact same response specified on your monitor.
If you have configured your monitor to check for specific expected data at your endpoint, we will need to alter our cURL a bit. Below you will notice that while much of the output is still silenced, we will now look to obtain your expected data using the “grep” command. Due to the nature of grep it is not necessary to request the full string of data, only a specially identifiable piece of what you need. This example continues from the above and references the expected data of “I eat DNS for breakfast!”
Command: curl “http://22.214.171.124:80/directory/” –header “Host: dnslover.com” –silent | grep “break”
Response: <I eat DNS for breakfast!>
Finally, if you receive no response for any of the specific data above you may remove various elements of the cURL to try to narrow down the specific source of the issue.
SSL handshake failed/Can’t connect – Specific to HTTPS, any cURLs used to troubleshoot this monitor result will require a few adjustments. As you might have guessed we will need to change “http://” to “https://”, but the port and parameters will also require an update.
In this example, the common HTTPS port of 443 was specified, but your network may employ a different one. The “–insecure” parameter was also specified to allow our cURL to connect without the necessary certificates. If this cURL is successful, the likelihood of the SSL handshake being the main issue is much more probable. If any SSL errors are detected they will be provided.
Command: curl –head –insecure “https://126.96.36.199:443/directory/” –header “Host: dnslover.com” –silent | head -n 1
Response: HTTP/1.1 200 OK
Invalid code – When addressing these SMTP errors, you will need to employ different techniques. SMTP probes will check for a “220” response from your server, signifying a successful connection. Telnet is the preferred troubleshooting method for your SMTP or TCP monitor. To verify if your endpoint is accessible you will want to telnet against the port number and the applicable IP address or hostname. For SMTP, this will commonly be 25 or 587, but your network may use another port.
Command: telnet dnslover.com 587
Response: Trying 188.8.131.52…
Connected to dnslover.com
Escape character is ‘^]’.
220 dnslover.com Mail Server ESMTP
Once connectivity is verified you may attempt to perform any specific actions you wish to test. If you receive another code, there may be trouble with the mail server being monitored.
Need to delve deeper into some troubleshooting or have a different problem? Our help site has guides to help explain and address the different probe results you might come across. Interested in a monitoring or failover service and not sure what’s right for you? Learn more about what Oracle Dyn has to offer on our website.