Lessons learned from implementing lambda based microservices in the real world

Just about two years ago, we had the opportunity to build a large scale real-time fleet management system for a client.

After some investigation, we decided the time was right, and the project was a good fit, to delve into serverless computing — specifically a serverless microservices architecture based around AWS Lambda and Cloudformation.

As always when starting with a new platform, we had to pull together many bits and pieces of knowledge from various sources as we learned its intricacies. This article describes our current approach as well as the more notable and unexpected things we learned along the way.

Development Environment and Tools

Lambda Node.js 8.10 runtime.

Typescript for all code, transpiled during deployment.

PostgreSQL and MongoDB databases for data storage.

Docker for local instances of databases etc to test against.

Jest for tests.

CircleCI for CI and deployment.

We are very happy with these choices and are continuing with them in the future.

General Principles

Testing

When developing a microservices based system, automated testing is even more important than usual, as good test suites allow services to be developed and deployed reliably in isolation.

We’ve followed a three-tiered approach:

Unit test coverage of business logic,

Integration tests running against local instances of databases,

End to end tests running against staging deployments.

To facilitate this, we separate all business logic into classes and modules separate from our handler functions, which are implemented as very standardised entry points. This allows all the vast majority of development to happen locally on a developers machine in a TDD way, with the end to end tests run against actual deployments verifying that infrastructure details are correct.

Architecture

Breaking up services

We use SAM with Cloudformation stacks to manage each service (although for newer projects we have used the Serverless Framework), which deploys its own lambdas, queues, API gateway and other infrastructure as required.

This approach has worked well, as it provides a clear separation of resources and allows clean management of different services. Dependencies are also well modelled as will be detailed below.

Different deployments

We deal with different deployments (production, staging, etc.) in one account by namespacing stack names depending on the environment being deployed to. So for example, a service called “users” will be deployed as “production-users” or “staging-users” by the deployment scripts.

Values exported by the stacks will be similarly namespaced, so if the users stack exports a variable called “validate-lambda-arn” the full export name would become “production-users-validate-lambda-arn”.

Additionally, we deploy staging or other test environments to a different AWS account. This is because it is currently impossible to safely isolate different lambda microservices in the same account due to global account limits — one very important example being concurrent executions.

To explain a bit further, if there was a runaway execution of a particular function in a staging deployment, it could cause a denial of service to production lambdas due to using up the unreserved concurrent function limit for the account.

Communication between services

Synchronous communication between services is facilitated as follows. Each stack exports the ARNs of lambdas that are callable by other services. Other stacks can then retrieve the ARNs in their own templates, and pass these values into specific lambdas that need to call them. A benefit of this approach is that service dependencies are explicitly modelled in Cloudformation, by virtue of the fact that if service A depends on service B, it will import a variable exported by service B. Thus, it will not deploy successfully if the services it has explicit dependencies on it do not exist. Similarly, the deletion of service B will not be allowed while service A exists and depends on it. I’ve included an example of a template with dependencies like this later on.

Of course, in such an architecture synchronous communication should be avoided whenever possible. Asynchronous communication is implemented using an event bus. The event bus is itself implemented as a simple microservice which exports the ARNs of two lambdas — one which allows services to register to be notified of specific events, and another which they can call to post events.

Registration is implemented using Cloudformation custom resources. This allows a service to declaratively specify in its template which events it would like to listen to, and when deployed, the registration function of the event bus is invoked by Cloudformation as part of the stack deployment process.

Similarly, if the service is deleted, the deletion of the Cloudformation stack will invoke the event bus registration function with a deletion event, and the listener will be removed.

The event bus can directly invoke functions, or place events onto an SQS queue specified by the receiving service. The decision of which approach to take depends on whether or not the service cares if it misses events. Because ordering is not guaranteed with this approach, all events contain timestamps and or incrementing ids if applicable, so that receiving services can check against previously received events if ordering matters.

Authentication

There are a few different approaches one could take here. As currently all our interfacing with external systems happens through API Gateway, we authenticate using authorizer Lambdas.

Services that are responsible for a particular type of authentication (for example, external clients, or internal users) export authorizer lambdas that can then be referenced by other services.

Thus, authentication is implemented in the service where it makes sense, but can still be performed as needed at the entry point for any given service. The authorizer lambdas, acting as authenticators, then add information to the call from which services can make their own more finely grained authorization decisions.

Monitoring

The metrics of interest are quite different in this kind of architecture — we’ve found the following to be useful in a CloudWatch dashboard:

Total concurrent executions — a great way to instantly see if something is going awry, or if the dreaded Lambda invocation loop has reared its head.

Lambda errors — we graph errors for every single Lambda in the system as a whole on one graph — this allows a very quick visual overview of any errors that may be occurring.

SQS message ages — a good way to see if some part of the system is not keeping up.

API gateway invocations — help to see how the internal state of the system correlates to the external demands being placed on it.

Total invocations per service — for specific services we are interested in how busy they are. Lambda invocations is a great way to track this.

Alarms can be set on any of the above, depending on your needs and what would constitute concerning behaviour.

Gotchas

As with any platform, the devil is often in the details. And some of the details are very devilish. These are some of the more unexpected and tricky issues we’ve dealt with.

RDS Access

For lambdas to be able to connect to an RDS instance without routing through the external internet, quite a bit of configuration is required. They need to be in VPC subnet that both has access to the RDS server in question, and access to the internet if required. We’ve followed the below approach:

Create a VPC with two subnets. One has a NAT and will host the lambdas. The other will host the RDS instance and may have an internet gateway depending on whether or not you want your RDS instance to be publicly accessible.

Create a security group for the lambdas and one for the RDS instance.

Allow access from the lambda security group to the RDS one.

Make sure your routing is correct for each subnet, sending 0.0.0.0/0 to the NAT or the internet gateway respectively.

Specify the VPC subnet and security group for the lambdas in question in your Cloudformation stacks.

If you require high availability, create additional subnets in each availability zone.

Having done all of this, you may notice that your lambdas are now inexplicably timing out. This is due to the way Lambda deals with the Node.js event loop — as long as there are events waiting, the Lambda’s execution will not complete, even if the callback is called. To deal with this, always close persistent connections (such as to an RDS database) when execution of the Lambda is complete.

There are two further considerations when placing lambdas in a VPC — cold start times and account ENI limits.

Cold start times are not a show stopper, but something to be aware of. When a lambda in VPC cold starts, there is an additional time penalty — in our experience, about 2 seconds.

ENI limits are something that can potentially ambush you. When lambdas are in a VPC, they use ENIs in your account for each concurrent execution. As your ENI limit is by default a lot lower than your lambda concurrency limit (300 vs 1000 at the time of writing), if the majority of your lambdas are in VPCs, you are likely to run out of ENIs before your concurrency limit. This can be very confusing and unexpected if you’re not aware of it.

MongoDB Access

Connections to MongoDB take a long time to initialise, so the approach taken with RDS of always closing connections is not feasible. Fortunately, there is a different solution. Setting context.callbackWaitsForEmptyEventLoop to true will cause your lambdas to immediately complete, regardless of the status of the event loop.

I don’t recommend doing this unless it is required, as it is a good way to introduce subtle bugs in repeated invocations, as events from previous invocations can then complete in the next one that uses the same container. However, for this use case, I believe it is justified.

Using this option, connections to your MongoDB database can persist between invocations of a lambda function. However, the connection may still have died under certain circumstances. So to complete the functionality,

Check if there is an existing connection.

Check if it is still good.

If not, just create a new connection.

Memory leaks

For lambdas that are frequently invoked, their container may persist for a very long time. This means that the same node environment is being used over and over, and even tiny memory leaks will eventually cause them to be terminated or potentially timeout inexplicably.

Because you will often be using lambdas with the lowest memory limit of 128MB, and often using different patterns than you would in a normal Node.js environment (such as the connection closing above), this can show up only after thousands or millions of invocations and can be a difficult problem to diagnose — especially if you’re not looking for it.

One thing to keep an eye out for if you are seeing strange timeouts is whether the lambda is reported as using its maximum memory when terminated — this is often a clue that the reason it timed out was actually due to using all available memory.

One specific instance we saw was that the PostgreSQL library we were using kept references to closed connections, which after enough invocations eventually used up all memory available.

Lambda returns greater than 6MB

Lambda functions can’t return more than 6MB. If you are passing significant amounts of data, you will run into this, probably at the least convenient time.

The solution we adopted is to detect in the standard middleware we include in the handler whether or not the return is greater than 6MB. If so, the payload is uploaded to an S3 bucket instead of returned directly, and the actual response is changed to a standard one indicating that this has happened, and including a signed URL to retrieve the actual payload.

We use standard wrappers around calls to lambdas to detect this response and transparently retrieve the actual payload from S3 so that callers don’t have to be concerned about this detail.

Intermittent HTTP / networking errors

We first noticed this with a lambda that was invoked repeatedly at high frequency to add items to an SQS queue. Occasionally — approximately once every 5000 invocations or so — it would time out.

After some investigation, we narrowed this timeout down to waiting for items to be added to the queue. We eventually resolved it by initialising the SQS connection with a timeout of 1 second and a connect timeout of 1 second, and increasing the timeout of the overall lambda slightly, as it seems that a second attempt in the same invocation will succeed.

We’re not sure at this point if it is unique to HTTP or if all networking in lambdas occasionally exhibits this behaviour. Database connections don’t seem to be affected but it’s possible the libraries are retrying behind the scenes.

Example Template

They say a gist is worth a thousand words, so here’s an example of a Cloudformation template showing lambda ARNs imported from the event bus service, a custom resource being used to add an event listener, lambdas being assigned to VPC subnets and security groups, a lambda ARN being exported to be called by other services.

Open Questions / Areas For Improvement

As with any software project, the list is endless. Next up for us currently:

Experimenting with SNS for the event bus. Latency is a concern here as the current lambda only implementation generally has about a 200ms latency.

Incorporating AWS IoT for communication with some user devices instead of HTTPS through API gateway.

Looking into canary deployments with this kind of architecture.

That’s It!

If you’ve made it this far, thanks for reading! Many of the points mentioned here could be expanded into articles of their own, so let me know in the comments if you would like to know more about anything specific, you’ve spotted any mistakes, or you have some insights we’ve missed.