Orchestrating Serverless - a step in the right direction

This blog post looks at our experience of using AWS Step Functions to help us migrate customers into a new OVO energy platform. It does not aim to be another ‘getting started’ tutorial (there are plenty of those available, including the AWS tutorials), but aims to answer the questions I always want to ask when starting out with a new piece of tech, like “why this approach” and “what are the tradeoffs”. Don’t worry there is some code to look at as well.

One of the benefits of working at OVO in tech is the level of autonomy each team is given, to choose what they consider to be the most appropriate technologies, in an open and supportive culture.

What follows is drawn from our experiences over the last 6 months, going from a blank sheet of paper to a successful migration system.

Rational

The design decisions when creating a new software solution are often the most important, setting the direction of travel and so are discussed here to help provide a context for choosing Step Functions.

Our use case was to migrate customers in a managed, controlled and quality assured way; starting small, learning by doing and scaling up; running over an extended period of time and so managing change in source and sink systems.

The work has these characteristics:

a batch system, with migrations to be run in a timely fashion,

similarities to an Extract, Transform, Load problem,

a workflow problem, orchestrating calls to many different services,

has to deal with an eventually consistent sink system (the new energy platform is built around an async event driven model),

an intermittent workload, migrations are lumpy, because of timing constraints in our eligibility requirements.

We considered the following approaches, listed in order from ‘off the shelf’ to bespoke:

ETL tooling. Discounted because not a traditional data store -> data store migration. We are migrating into a bespoke system, not a vendor package.

Step functions (more detail below)

Workflow frameworks. Discounted because there is still a need for considerable bespoke code and frameworks can rapidly become a headwind to work against.

Bespoke services. Although full control is attractive (to devs) there can be too much boiler plate code to manage the non-functional requirements around resilience and scalability.

Why we decided on Step Functions:

Easy to get started and to experiment using the online IDE for fast feedback.

Models a workflow problem, a series of steps to be run.

We were keen to leverage the advantages of serverless, particularly with a good match to our intermittent workload and need to scale out easily.

Manages state with input and output for each step

Provides support for unreliable networked services with error handling and retries.

Steps are implemented in bespoke code, giving good flexibility and focus on the business logic, using a range of supported languages (in our case mix of Node.js and Python)

Better control than a distributed set of event driven services where the system behaviour can become complex and difficult to reason about.

Support for longer running workloads (the Step Function limit is 1 year!), to overcome the 15 min limit of AWS Lambda.

What does it look like

A Step Function is an AWS resource which is defined by JSON configuration expressing the states language. This consists of states and the transition between them. States include:

tasks, to do some work: either a Lambda, EC2 implemented activity or service integration. We worked with Lambdas only.

choice, to branch based on input

wait, to delay a state

parallel, more than 1 execution path

stop, to end an execution with a fail or succeed state

The migration Step Function to run a single migration of one customer is shown in figure 1. It has been reduced for sake of space, from the original 17 states.

Error handling

Most of the effort goes into specifying how errors should be handled, in our case:

Retry is used to catch temporary errors. It is suggested to think about the different types of errors in the Lambda task and wrap those that can be retried into common error names. Note that retry delay, count and backoff rate can all be specified.

Catch is used where errors which cannot be recovered from are encountered, here we branch to separate tasks to enable a rollback of the migration.

Note that all tasks have error handling except for the last. If no error handling is specified then the Step Function will end at that task. It is cleaner to ensure that all successes or failures arrive at the last step. The code in the last step decides what is a failure based on input and throws an error to indicate that the step function has failed.

Sync vs async tasks

The tasks either call REST endpoints synchronously or send a Kafka message to the new energy platform. A key task is to check whether our customer balance is as predicted. As the new platform is eventually consistent we use a polling approach. In the CheckBalances task the Lambda raises a customer error if the balance does not match and the Retry is used to loop around after a delay.

Replay

Another consideration is how to recover from or replay partially failed Step Functions. Our solution is to have automated back out defined in the states. The intention is that if we hit an error then we could rerun the step function at a later time without manual intervention.

Interfaces

The input is provided at execution time and in its simplest form is a single account number. In figure 2 each step is instructed to store its output against a separate key, so for example the Extract task stores output against key $.extract. If an error occurs then the output is stored against key $.extract.stepError. This ensures that the output is captured in all cases, which is useful for auditing, logging and provides a record of all the resources created.

We found it helpful to adopt a clear convention to isolate the output of each step so that as the Step Function evolves new tasks do not interfere with the output of existing tasks.

Triggering the Step Function

We wanted to trigger our migration Step Function in an automated but controlled manner. With the ability to stop migrations after a certain number of failures and to set the concurrency at run time. What better way than another Step Function!

Figure 4. Trigger Step Function Definition

Here a Choice and Wait state are used to react to output from the task and decide to continue or not.

The trigger step receives input to determine how many concurrent migrations are allowed along with the number of failures/rollbacks. It maintains the migration running state via DynamoDB and reads an SQS queue to trigger migrations via the code in figure 5.

Other considerations

The following, listed in no particular order, are some other considerations when using Step Functions.

How to start

It is easy to experiment with Step Functions using the web console to create and edit the configuration. I strongly suggest creating one or more simple Lambdas, again using the web console, to hook up to a Step Function to experiment with the states language and to understand how input and output are propagated.

Continuous Deployment

Defining a Step Function in the web console is helpful for exploratory development, but eventually you will want to move this into source control and be able to deploy using a CD pipeline.

This requires an AWS Cloud Formation Template, with the Step Function definition stored as a JSON string. As this is hard to read and maintain an intermediate transpiling step is needed. We use a custom python script leveraging the troposphere library. Alternative solutions include:

Testing

We have not found a satisfactory way to automate the testing of the logic in the Step Function and so rely on manual testing when changes are made. This means deploying a branch version to AWS and triggering manually via the web console. On the whole the logic is focussed on different failure modes, and these can be difficult to run automated testing on.

Reliance on manual testing is one of the reasons why you should guard against letting business logic leak into the Step Function configuration. Keep business logic in the tasks where it can be more effectively tested.

Task size

To take advantage of the error handling and retry logic it is important that tasks are, where possible, designed to be idempotent. In our case this means not combining multiple workflow steps into one task. However there are always trade offs, including efficiency and code duplication. Think about how to share code between Lambdas (without getting into the “coupling through libraries is bad” debate), to encourage small focused tasks.

Parallel execution

We tried to optimise the workflow by running some state transitions in parallel branches, but we encountered some complexity trying to record the output from all the tasks. Once a task fails in one of the parallel branches then the Step function jumps to the next state and abandons the successfully running branch.

Output and logging

The input and output to each task is recorded in the Step Function log. However the retention is not configurable and is stored for 90 days before being automatically deleted.

If this information needs to be persisted for the longer term, for example to satisfy audit requirements, then it must be stored elsewhere. In our case there is a task at the end of the Step Function to store state to S3 (see Figure 1).

In addition, the output cannot be deleted, without deleting the Step Function. In our case care needed to be taken around personally identified data in the Step Function state as this may need to be removed on request, e.g. for GDPR. Our solution is to pass more sensitive data between states using S3 buckets with a lifecycle rule to remove the data after a short period.

Wrap up

Not mentioned so far has been the compute cost. Because of the sparse workload the serverless solution has led to a cost of just over 2p per migrated customer, averaged over a 6 month period.

It is worth mentioning that there are some new Step Function integrations that have been recently announced. These include new tasks to interact with SQS or DynamoDb without writing Lambda code.

Overall we have been happy with the utility of StepFunctions. It has worked well in our use case and is another tool to enable the serverless approach.