Over the last week, I’ve been talking about internet issues and how to ensure service delivery more effectively. Today, I’m bringing the whole story together in one place.
A couple weeks ago, my colleague Mikel Steadman wrote a post visiting seven top-of-mind business challenges CIOs face every day. Normally, I like to write about the good things on the internet – how to make configurations easier, performance better, run faster, jump higher, that sort of thing. However, one of the most common questions we receive in the field is how can we as technology leaders, ensure service delivery as things like maintenance, security updates, DDoS attacks, or internet outages keep us in a constant state of fire fighting? This week we’ll be exploring the bad stuff out there. All the garbage hammering against our collective walls, and some of the things we can all do about it.
Maintenance and Zero Day Events
The rate of known exploits seem to be growing at an ever faster pace. Keeping up with every zero day seems to be a full time job in and of itself. Which begs the question – why do it in the first place? Not that you shouldn’t have up-to-date systems but why are you doing it yourself? It’s 2016, we live in an age where outsourcing things like DNS and Traffic Management aren’t just easy, they’re becoming the norm even for traditional enterprise.
Here at Dyn we use a version of BIND which has been molded to allow all our fancy traffic management services. Yes, I said we use BIND. Proudly too. With millions of users running the package, the odds that a zero-day is found before the baddies is stacked in our favor. We patch rapidly, and have a service contract with ISC for immediate warnings of vulnerabilities. Which is one less thing your team will ever have to think about again. Even if you want to maintain control of DNS management, there are options like setting yourself up as a hidden master which can keep your current way of managing systems but take advantage of us as your edge.
Meanwhile on your hardware load balancers, just to perform something simple like continental routing it will take your team the better part of a sprint to set it up. Just in time for a refresh and relearning the systems. Why does it have to be this hard? Dyn has capabilities like Traffic Director which can act as global load balancers pushing traffic to any of your online assets. Need to load balance your FRA and HKG data centers? No problem. Looking to target your CDNs by market? Easy. All the while, with a cloud solution Dyn will roll out changes automatically. No more refreshes, and your team can get back to that new green field deployment.
There was a time, when DDoS attacks lived in obscurity to us tech folks. Now it’s on the nightly news and I talk about it at art shows. What the heck happened?! Well as more resources of our everyday life move online, there are more targets than ever. Meanwhile, the cost of performing a DNS based amplification DDoS attack has at least stayed steady, if not easier by the fact that there are more connected devices to take advantage of in a botnet. This would be harder if internet service providers conformed to the standards outlined in BCP38 to prevent that sort of thing, but they haven’t.
Look, you can try to ride out a DDoS if you really want to, but again – it’s 2016. There is an easier way. Some of the most common DDoS attacks are on the DNS, because it’s easy. So outsource your DNS already. This means that your DNS vendor(s?) will be the distributed edge to take the attack for you. On the chance that you are the target, your vendor will have the staff, training, hardware, and connectivity to easy thwart a DDoS. It’s worth looking at vendor’s track records and asking what their strategy is, as you will be entrusting them with your domain. Us? We’ve been delivering industry leading availability since our network launched over a decade ago.
So what about if it isn’t a DNS attack? Unless you’re running a global anycast network like we are, it can be hard to isolate an attack and move it around to where you have the resources to handle it. Because Dyn exists on DNS, we are able to geo-target traffic before it enters the pipe for your DC. This is a unique place to route, as it means traffic can be moved around the world allowing your major sites a break, or even to a different provider.
Just how do those guys work anyway? Most of those services work by becoming the new upstream for your IP prefixes that you want scrubbed in BGP. If you decide to activate it on the fly, as most do, when you call upon the service in the moment of need the scrubber will remove the bad traffic before passing the traffic back to your origin for normal production operation. That’s all great, but how well do they take control of your BGP? As it turns out, these services can do a haphazard job taking control over the prefix, which causes two problems. If there is a route to your prefix around the scrubbing service because the route didn’t fully propagate, the attackers may find it and push their attack around your scrubbing service as well. Now you’re paying a lot of money for a link which isn’t doing it’s job. The second scenario could be as our diagram above. As you can see, the scrubbing service botched the propagation which caused periods of time in which there were no routes to the destination. You activated a scrubbing service and that caused an outage!
It’s not my place to pick favorites, and I’ve found there is no such thing as a perfect provider. Your best shot, is to actively monitor your vendors – on all levels, including us – to keep them honest that you’re getting what you paid for. If you’re not, and they won’t fix it or offer a reasonable explanation for it – time to find a new one.
The BGP Blues
BGP is a beautiful, simple protocol, but it’s a miracle this whole thing we call the internet works in the first place. It is based ultimately on gossipy routers which freely share information in a trust based system. There is no central authority, so internet operators have to go on what their peers tell them. Unfortunately sometimes that information is wrong, or at least not what we intended. In the worst case, a network can pass themselves off as your AS and hijack your traffic. This could have disastrous security impacts, as traffic could be affected by a man-in-the-middle scenario, or even terminated at the hijacker where they might mimic your destination. Think about that: everything matched. Right domain, even DNSSEC, but the IP you were using was stolen. In less malicious scenarios, you can find your traffic gets “leaked” to networks that shouldn’t have a direct route to you. This can cause misdirection, impacting performance, but also has its own security implications with traffic now freely passing through unfriendly waters.
What do you do about this? The first thing, like anything, is to monitor it closely and be alerted as soon as something appears. Ok, then what? If you were hijacked by someone announcing a more specific route, you can match or raise them. Otherwise you might want to swap out the prefix altogether to something not under attack. Then have a conversation with your upstream provider. Were they the one who leaked the route? Could they use their own leverage in the space to sanction the bad actor? And this isn’t just you, this can and does happen to both entire countries, and major brands.
Cable cuts, reachability, availability – oh my!
All that was just what could happen on your network. The INTER-net is a network of networks, so your upstreams and peers have their own upstreams or peers, and your client destination ISP does as well. By a concoction of brilliant engineering and pixie dust, we’re able to get from one destination to another, creating the building blocks for the information age. Voila!
What happens when one of those connections doesn’t work though? Or poorly? The average connection across the internet takes 3-4 AS hops and maybe a dozen IP hops. Because these are ultimately all independent transactions you can find traffic being pulled far off course, called hairpinning, which can will cause major impacts to performance. This might happen because a business decision created a peering change of an intermediary, or even a problem in the physical realm like a cable cut.
Well who cares if so-and-so changed such-and-such route to east wherever, you ask? Your customers. That’s who. Your network stack might be humming along just fine, fully available and with great performance when tested from your APM and NPM solutions – but if users are unable to reach your service, it’s all for naught. Reachability is just as important as availability in a commoditized internet world. They might not understand how the technology works – though ultimately does it really matter? If they can get to Facebook, but not your website the blame will immediately fall to you. It doesn’t matter where the technical fault really lies.
This begs a move to a full multi vendor, fault tolerant environment where anything may fail and the network adapts in a self healing ecosystem. This could route between different links if you have independent destination IPs for them, different data centers, or even different cloud providers. But to do that you need to have a management plane which is agnostic to each provider, and will allow you to be nimble enough to adapt to any condition. Dynamic Steering, using RUM technology to route users to the best resource in real-time is the tool we have been looking for. This is baked into our DNS platform for a seamless experience.
Waiting for Godot?
For every problem out on the internet, there are a number of possible solutions. It’s really not a lack of choice, but a lack of time and focus. It is an unfortunate truth that often organizations wait until they have a major incident before changing their internet performance strategy. Which is a real shame, as this has never been easier, and so many internet issues that will affect business performance are easily preventable or easily manageable. Dyn makes monitoring your internet posture and asset management strategy a breeze, letting your team get back to conquering the world – so what’s your excuse?