Using Lambda and Datadog to Monitor VNS3 IPsec Tunnels

At Trek10, we use the IPsec VPN functionality of Cohesive Networks’ VNS3 controller to connect our customers’ AWS networks to everything from 3rd party payment providers to financial market feeds to the classic connection back to the corporate network. For all of our customers, any downtime for an IPsec tunnel results in significant business impact. This is why we built a Lambda function in Python that leverages the VNS3 API to check all IPsec tunnels for any outages and then posts a custom Datadog metric to notify our 24/7 CloudOps team of any IPsec tunnel issues. In this blog post, we’ll will provide you with all of the steps and code to implement this monitor.

Configuring the Lambda Function

AWS Lambda is a serverless compute service released by AWS which executes code based on events or time triggers. For this use case, we leverage Lambda to execute a function each minute to confirm the status of IPsec tunnels. The high level process flow of the function is as follows:

Lambda function executes each minute based on the CloudWatch Events Schedule trigger (or, a we like to call it, Lambda Cron). We run the Lambda function from within the VPC instead of over the internet so that we can lock down the admin port (8000) of the VPN controller with a security group rule.

The Lambda function loops through each IPsec tunnel asking for the status – connected or disconnected. The below screenshot of the VNS3 GUI shows the status column, which is what the Lambda function is querying using the VNS3 AP.

Using the Datadog Metrics API (api.Metric.send), the script posts a 1 if the status of the tunnel is connected and a 0 if the tunnel is disconnected.

As of November 2016, Lambda added support for environment variables. We use four different environment variables to configure the function across our customers’ VNS3 VPN implementations. The variables are the DD_API_KEY, DD_APP_KEY, VPN_, VPN (optional) and VPNENV. More details on these environment variables can be found in the README.

Reaching the Datadog API requires internet access, so your Lambda function must be configured in subnets with internet access, either directly through the IGW (in a public subnet) or through a NAT Gateway (in a private subnet).

In order to securely allow the Lambda function to access VNS3 VPN controller, you should allow traffic over port 8000 from a source of the security group you associated with your Lambda function. Once you have configured your security group, you should receive an “OK” upon testing your Lambda function. Now that we have the appropriate metric posted to Datadog, we next need to configure the monitor.

Configuring the Datadog Monitor

As mentioned previously, the script uses api.Metric.send to post the custom metric to Datadog (a 1 for connected tunnels and a 0 for disconnected tunnels). Follow the steps below to configure your Datadog monitor. This section of the script is where most of the Datadog logic resides:

In the Datadog console, create a new monitor based on the vpn.tunnel.status metric. The vpn.tunnel.status metric is the custom metric being imported into Datadog (a 1 or a 0). We want the min by metric, which means that Datadog will take the minimum value across each of the tunnel’s metrics. For example, if the controller has 30 tunnels configured, each minute has 30 different data points of 1 or 0. If any of those 30 tunnels post a 0, Datadog uses that metric when evaluating the alert conditions.

The metric is over the vpn_environment:production tag. The Lambda function posts the custom metric to Datadog and creates the tag key of vpn_environment. The value of this tag is equal to the VPNENV environment variable in the Lambda function config. If you are using multiple Lambda functions to monitor different VNS3s, then you would change this tag to the appropriate value based on which controller you are monitoring.

If you would like to receive one generic alert if any/all tunnels go down, select Simple Alert. If you would like to receive an alert for each individual tunnel (with the value pulled into the alert), you can choose Multi Alert. The downside to Multi Alert is that if the connection to your IPsec peer(s) drop, you will receive a noisy, separate alert for each tunnel that goes down vs. one generic alert.

To ensure that an alert is triggered if any tunnel goes down once (a 0 metric), the alert should trigger if any data point is below 1 over the 5 minute evaluation of data points.

Below is a screenshot of what a monitor might look like in the Datadog console:

Conclusion

Datadog also supports webhooks, which we leverage to generate support tickets for our 24/7 CloudOps team. Using Lambda is a great way to monitor infrastructure, and we use it religiously across Trek10 for many of our systems. We suggest you give it a shot as well!

You can find the code, along with the README, which explains the implementation in more detail, on our GitHub page.