AWS Lambda functions currently have a five minute time limit to execute and while this is not a big problem many functions, it becomes problematic when you’re executing a task that has some inherent latency. I created a function that stops all instances, create snapshots of all attached EBS volumes and starts those instances back up. This was easily feasible in my personal environment, but when you get to larger environments, the amount of time it takes to stop all instances – and back up all those volumes without hitting a CreateSnapshot limit – can easily exceed five minutes.

The solution is two-fold.

First, make sure you insert an increasing or variable sleep timer between creating snapshots. I had to do this for the CreateSnapshot limit issue.

Second, in order to shut down all your instances properly, create snapshots of volumes and start instances back up, I had to use three separate functions and chain them together through the magic of CloudWatch and SNS.

Here’s how it works:

The function will output logs in CloudWatch. When you find those logs, you’ll usually see something akin to “END RequestId” when the function has completed. You can create a metric filter in that log group that looks for “END RequestId.” Once that filter is created, you can create an alarm with it. The alarm will trigger when the metric filter has been met and, if configured to do so, it can send a notification to an SNS topic of your choice.

The SNS topic can be tied to a Lambda function and should be considered a trigger to get the next function started. Tie your CloudWatch alarm for the function that shuts down instances to the SNS topic that is tied to your backup function. Go through the same process of creating a CloudWatch metric filter with an alarm and have that alarm notify a second SNS topic.

The second SNS topic should be tied to a Lambda function that will start your instances back up again.