At Oracle Dyn, we are always working to improve our operations and tools. We were able to dramatically simplify processes and thus reduce our reaction time during malicious traffic incidents by leveraging an Internet standard called FlowSpec.
Network and systems engineers design and build networks that can withstand traffic volumes well above what is considered “normal”, from service provider links all the way down to servers. However the Internet is constantly growing, while upgrades and changes to physical assets take time. In the meantime, it is necessary to be prepared with efficient tools that help the operations teams divert and filter traffic to protect services quickly. For example, one may need to filter packets that are causing a link to be saturated, or stop a flow of traffic that is overloading one or more servers. In those situations, network engineers have traditionally used access control lists (ACLs) configured on routers.
Previously, during an attack incident, an engineer would have to a) determine which sites were affected b) log into each router’s command line interface c) type in many lines of packet filtering rules on each router without typos or inconsistencies d) wait for the attack to stop e) reverse all previous steps without forgetting routers and leaving unnecessary configurations behind.
But what about configuration automation? It is true that we could leverage other tools used for change management, but these work by generating configuration files with templates, running tests, approving and merging changes and finally deploying them in bulk. This process works very well for permanent changes, however not for emergency situations, such as when under attack, where speed to deploy is extremely important.
The famous Border Gateway Protocol (BGP) is most commonly used to exchange routing information between networks operated by different entities, and also to carry that information within a network. However, because the protocol itself is agnostic of the content it transports, its use can be extended to carry not only routes, but other information as well.
FlowSpec is a standard defined in RFC 5575. This document describes a mechanism by which network operators can use BGP to transport packet filtering rules that can either discard the traffic or reduce its rate. The feature was proposed in 2009 and it is now well supported by major router vendors.
Because Oracle Dyn operates a distributed network of 20 data centers in 5 continents, the manual procedure described earlier is less than ideal. We wanted to leverage FlowSpec to define packet filtering rules in one place and have those propagated network-wide (or to a subset of the network) as fast as possible. So we started a project called “Flux”.
Flux is a RESTful API that interacts with an open source BGP routing daemon called ExaBGP. We were familiar with ExaBGP because we use it to announce anycast routes from multiple servers and locations. It is particularly flexible because one can hook any piece of software to ExaBGP to tell it what to announce to its BGP peers.
Flux is written in Python. It has only three REST resources that can be created or deleted via HTTP/HTTPS:
- Flows: These are the packet filtering specifications. A flow has several fields that tell the routers what packets to select: for example, source and destination IP addresses, port numbers, protocol (TCP/UDP/ICMP), etc. A flow resource also includes an action which can be to discard (default) or rate-limit the traffic.
- Sites: These are mappings between site names and router names. The purpose is to streamline the operation because it is easier to remember short site codes than multiple router names.
- Refresh: This resource allows the operator to resend existing flow routes to one or more routers in the event that the BGP session between ExaBGP and the router(s) has been lost.
Flux also uses a key/value store for persistence of flows and sites.
The beauty of an API is that it can be used in different ways. One option is to build a web user interface. The other, more interesting option, is to call the API from another program.
ChatOps is a fun trend among network operators. Normally, operations teams interact using a chat service (Slack, HipChat, etc). Some of these chats allow users to develop programs (called “chat bots”) that they can call from within a channel. These bots can interact with one or more services and applications, for example, opening or closing tickets, scheduling on-call duties, showing metrics graphs, etc. The great advantage of ChatOps is that a member of the team can use these services while everyone in the team is aware of who did what when, and the outcomes.
Because our NOC was already using chatbots, it was easy for them to use the new Flux API. Today, stopping or rate-limiting malicious traffic is a matter of typing some instructions on the chat, and voilá!, no more logging into routers, making mistakes, leaving configurations behind, and best of all, what before may have taken 10-20 minutes can now be done in seconds.
APIs like this can also be tied to more sophisticated detection mechanisms. For example, it would be possible to leverage Netflow data and analytics to stop an attack immediately without human intervention.
If you would like to see more, check out our recent presentation at the NANOG 71 conference in San José, CA.