Detecting Network Abuse with Automation

Network automation can be used for many things. The other day Jon Hudson posted a question on Twitter. “Can anyone point me to a Troubleshooting example in an automated network?”. This was a timely question as I had actually just done that exact thing in the DevNet Sandbox network to help deal with a bit of network abuse that we were experiencing.

If you’re not familiar, DevNet Sandbox is a free cloud service open to anyone interested in exploring Cisco APIs, or building applications on one of the variety of platforms available. This is an amazing resource that Cisco offers to the community, but it can also be a tempting target for users with less positive motives. While we have designed and built the architecture to limit the potential for abuse, we also want to keep it as open as possible to allow our users freedom to be creative. With this in mind, it should come as no surprise to anyone that we occasionally find “misbehaving users” playing in our Sandbox, it certainly doesn’t surprise the Sandbox engineering team.

By now we all recognize that computer systems and networks will always be targets for abuse, so let’s look at an example of using network automation as part of the day to day operations of a production cloud service!

Keeping the Sandbox Clean with Network Automation

Hank furiously coding away…

A few weeks ago, we had a new case of abuse pop up within the environment. The original notification actually came from Cisco InfoSec who manages the extreme network edge monitoring for Cisco, and the Sandbox sits behind them. They noticed some excessive outbound SSH traffic from our environment that looked like a network scan in progress from public IPs assigned to DevNet Sandbox. A case like this is very serious so it was escalated up to our network engineering team to research and resolve as soon as possible.

The Initial Research

At any point in time we can have hundreds of sandboxes active within the environment, so the first step was to determine which of the pods was originating the traffic. With hundreds of different potential points in the network to check, I needed to place where all the pod traffic came together, but was still distinguishable to which pod it came from. In our environment there is a core router that provides the first level of NAT that was perfect for this.

A quick execution of “show ip nat translations” on the router provided me with a flood of data that had the output I needed, but unfortunately also had a ton of legitimate traffic also listed.

What I needed was a way to sort, organize, and find patterns and groups in the data. There are tons of tools and ways to do this with and without automation and programmability. For a single point in time case like this, my go-to tool for manipulating data is… Excel. So I captured all the output of the command to a text file, and used the import tools in Excel to break it down into key details of Inside/Outside and Global/Local addresses, along with the ports in play.

With that break down done, it was pretty easy to sort the data and find the specific “Inside local” address that had thousands of outbound SSH connections active at a time. That address represented the “Outside” interface on the ASA firewall for the specific pod and sandbox that was causing the problem. And while that was a huge initial step in identifying the source of the abuse, I still needed to narrow it down to the specific virtual machine and user causing the abuse.

Next I logged into the ASA firewall I had identified and obtained the user who had reserved this sandbox. This is simply done by checking the username of the VPN account that was setup when the reservation was enabled. While in there, I shutdown the active VPN connection and reset the account password to prevent them from reconnecting. A quick note to a fellow Sandbox team member and that user was disabled and banned from future reservations.

After that I needed to identify which type of sandbox lab and which element of the lab was being used for the attack. For this I once again went back to checking NAT traffic, though in this case it was more a PAT traffic. All outbound traffic from the pod used the same outside IP address, so a check of “show xlate” on the ASA helped me quickly see that only one of the lab resources was active at all, and exactly which internal IP address it used. Because I knew which sandbox pod was in use, I also knew the underlying VLAN that was in play. These two pieces of data were all I needed to find the exact virtual machine and shut it down ending the active attack.

The final steps were to fully document and notify all the interested parties about what was found and the actions we took to end the attack.

Bring on the Automation!

So you might be wondering where the automation comes in, that’s a great question. We immediately knew that if this happened once, it probably would happen again and we wanted to prevent it before Cisco InfoSec called again (that is NOT a call you want to get). Unfortunately the timing of this attack happened to be right at the beginning of a very busy events period. This meant that I needed an option that could be implemented quickly with low risk and low initial investment in time.

Ideally we’d want to look at updating our security policies to prevent the abuse from ever occurring in the first place, but that was going to take more time than we had right at the moment (but we have subsequently done, of course). Instead I set out to implement a bit of code that would monitor for the symptom of the abuse and be able to notify our team should it start up again. This new code had the following simple goals:

Run regularly to quickly search for the trigger condition

Gather all needed data on the event

Send the details out to the full engineering team

The trigger I wanted to watch for was excessive outbound SSH connections coming from our environment. I widened the trigger out to excessive outbound connections in general, knowing that abuse could target more then just SSH. As in my manual troubleshooting, the easiest place to see this trigger and react was going to be core router where the original NAT translations were found.

Step 1: Getting the NATs

My go-to language for automation is Python, and while we’re working to upgrade the platforms in our infrastructure, much of it still only provides a CLI based interface. This meant I turned to Netmiko to help with the network connection aspects of the code. Even for simple scripts like this I build functions to do the key work needed. So my initial function to connect to the device and get the NAT translations looked like this.

This simple bit of code would connect to my device, run the “show ip nat translations” command, and then create a Python list with each line as a new row. A great start, but I still needed a way to break down the individual lines into relevant bits of data. Specifically the Inside/Outside and Global/Local IPs (4 IPs in total) along with the service ports in play. This sounds complicated, but as the table in the output is pretty straightforward a series of Python split functions was needed. If you aren’t familiar with split(), it takes a string input, and creates a list by “splitting” the string at certain characters. A bit of white space is the default character, great for word separation, but you can use any character for the splitting. With a link of output that looks like this:

It took a couple runs of trial/error to get the split logic exactly as I wanted it, but in the end it was perfect. I now had a list returned of all the active NATs on the device, and could key off of IPs and ports.

Step 2: Detecting Abuse

The next step was to build some logic that would be able to look at the data about NATs and determine if any pods were engaging in abusive behavior. This was going to require a few steps.

Count the number of connections each Inside Local IP address currently had active

Create a list of the devices that were to be considered “high counts” and indicate abuse.

The code for this was some fairly straightforward Python counts, loops, and sets. Here are the key bits.

One bit worth mentioning in that code block is the function “get_prefix(ip)”. While the NAT entry tells me the IP address, it doesn’t easily provide what VLAN and Pod the IP in question belongs to. We leverage NetBox as our IPAM server in Sandbox, so this functions makes a quick REST API call to NetBox and returns key details about the prefix, such as the VLAN and Pod that the IP belongs to.

Step 3: Notifying the Team of the Issue

In DevNet Sandbox, we aggressively practice ChatOps with Webex Teams. It only made sense to send the details on any potential abuse to our team in Webex Teams. Luckily the very easy-to-use Webex Teams Python SDK made this super easy. This simple function sends a message off to the indicated room.

Within this code you’ll see that I’m pulling in the TEAMS_TOKEN and TEAMS_ROOMID from the os.environ – or Environment Variables. This is one great way to avoid putting “secrets” directly within your code. Simply set them as runtime environment variables and then read them in with your code. Makes changing the destination room pretty straight forward as well.

Step 4: Setting up to Run Regularly

Once I had the full Python script written and tested, I just needed it to run at some regular cadence. I decided every 20 minutes was a good rate to catch any behavior before it got too out of hand. For this, I turned to the tried and true “cron” utility built into Linux and setup this job on the server

Step 5: Reacting to Alert

With this new code running, we started getting notifications every 20 minutes of any sandbox environment where there were a high number of outbound connections. The message included details on which service ports were actually being used, so we had a bit more context as well. Whenever the team saw an instance that looked like abuse, we would manually remediate by shutting down the services and disabling the account that was in question.

You may be wondering…”why not just automate those steps too?”…and that’s a great question. The main reason was that gathering and processing information from the network is pretty safe, and low risk (notice I didn’t say NO risk). But executing the commands to kill VPNs, reset passwords, and shutdown VMs…well those types of actions are something that we tend to want a bit more testing before we automate. Thankfully the abuse wasn’t happening so often that the manual response was too burdensome.

Long-term Solution

While the automation solution we developed worked great for keeping an eye out for misbehaving sandboxes, clearly it is not a good long-term solution. It was a great short-term solution to allow us to get through a very hectic period of time and have a chance to explore our security policy and find an opportunity to tighten it up and prevent the abuse from happening at all. I’m happy to say that we have done that, and instances of this type of abuse has stopped. We know it won’t be the last time we need to tackle a challenge like this, and we’ll be ready when we do.

If you’d like to dive more into network automation, here are some great suggestions.