This is a guest post byRichard Freeman, Ph.D., a solutions architect and data scientist atJustGiving.JustGiving in their own words: “We are one of the world’s largest social platforms for giving that’s helped 27.7 million users in 196 countries raise $4.1 billion for over 27,000 good causes.”

After the data is stored in DynamoDB, further systems can process the data as a stream; we persist the data to S3 via Amazon Kinesis Firehose using another Lambda function. This gives us a truly serverless environment, where all the infrastructure including the integration, connectors, security, and scalability is managed by AWS, and allows us to focus only on the stream transformation logic rather than on code deployment, systems integration, or platform.

This post shows you how to process a stream of Amazon Kinesis events in one AWS account, and persist it into a DynamoDB table and Amazon S3 bucket in another account, using only fully managed AWS services and without running any servers or clusters. For this walkthrough, I assume that you are familiar with DynamoDB, Amazon Kinesis, Lambda, IAM, and Python.

Overview

Amazon Kinesis Streams is a stream processing engine that can continuously capture and store terabytes of data per hour from hundreds of thousands of sources, such as website clickstreams, financial transactions, social media feeds, web logs, sensors, and location-tracking events.

Working with a production environment and a proof of concept (POC) environment that are in different AWS accounts is useful if you have web analytics traffic written to Amazon Kinesis in production and data scientists need near real-time and historical access to the data in another non-operational, non-critical, and more open AWS environment for experiments. In addition, you may want to persist the Amazon Kinesis records in DynamoDB without duplicates.

The following diagram shows how the two environments are structured.

The production environment contains multiple web servers running data producers that write web events to Amazon Kinesis, which causes a Lambda function to be invoked with a batch of records. The Lambda function (1) assumes a role in the POC account, and continuously writes those events to a DynamoDB table without duplicates.

After the data is persisted in DynamoDB, further processing can be triggered via DynamoDB Streams and invoke a Lambda function (2) that could be used to perform streaming analytics on the events or to persist the data to S3 (as described in this post). Note that the Lambda function (1) can also be combined with what the function (2) does, but we wanted to show how to process both Amazon Kinesis stream and DynamoDB Streams data sources. Also, having the second function (2) allows data scientists to project, transform, and enrich the stream for their requirements while preserving the original raw data stream.

Setting up the POC environment

The POC environment is where we run data science experiments, and is the target environment for the production Amazon Kinesis data. For this example, give the POC the account number: 999999999999.

Create a target table in the POC account

This DynamoDB table will be where the Kinesis events will be written to. To create our table we will show an example using the AWS console:

Under Tables , select the table prod-raven-lambda-webanalytics-crossaccount and configure the following:

Choose Capacity.

Under Provisioned Capacity :

For Read capacity units , enter 20.

For Write capacity units , enter 20.

Choose Save.

Notes:

Optionally, and depending on your records, you can also create a global secondary index if you want to search a subset of the records by user ID or email address – this is very useful for QA in development and staging environments.

Provisioned throughput capacity should be set based n the amount of expected events that will be read and written per second. If the reads or writes are above this capacity for a period of time, then throttling occurs, and writes must be retried. There are many strategies to deal with this, including having the capacity set dynamically depending on the current consumed read or write of events, or having exponential backoff retry logic.

DynamoDB is a schemaless database, so no other fields or columns need to be specified up front. Each item in the table can have a different number of elements.

Create an IAM policy in the POC account

First, create an IAM role that can be used by the Lambda functions to write to the DynamoDB table.

Select the policy created earlier, prod-raven-lambda-webanalytics-crossaccount.

Choose Review , Create Role.

This role will be created with an Amazon Resource Notation ID (ARN), e.g. arn:aws:iam::999999999999:role/DynamoDB-ProdAnalyticsWrite-role. You will now see the policy is attached and the trusted account ID is 111111111111.

Setting up the production environment

In the production environment, you use a Lambda function to read, parse, and transform the Amazon Kinesis records and write them to DynamoDB. Access to Amazon Kinesis and DynamoDB is fully managed through IAM policies and roles. At this point, you need to sign out of the POC account and sign in as a user with IAM administration rights in the production account 111111111111.

Create a Lambda policy

Create a policy that allows a production user to assume the role of a user in the POC account .

In the production environment, open the IAM console and choose Policies.

Choose Create Policy , Create Your Own Policy.

For Policy Name , enter prod-raven-lambda-assumerole.

For Description , enter “This policy allows the Lambda function to execute, write to CloudWatch and assume a role in POC” or similar text.

For Batch Size , enter 300 (this depends on the frequency that events are added to Amazon Kinesis).

For Starting position , choose Trim horizon.

4. Configure the function:

For Name , enter ProdKinesisToPocDynamo.

For Runtime , choose Python 2.7.

For Edit code inline , add the Lambda function source code (supplied in the next section).

For Handler , choose lambdafunction.lambdahandler.

For Role , choose the role that you just created, LambdaKinesisDynamo.

For Memory ,choose 128 (this depends on the frequency that events are added to Amazon Kinesis)

For Timeout , choose 2min 0 sec (this depends on the frequency that events are added to Amazon Kinesis).

5. Review the configuration:

For Enable event source , choose Enable later (this can be enabled later via the Event sources tab).

6. Choose Create function.

Create the Lambda function code

At the time of this post, Lambda functions support the Python, Node.js, and Java languages. For this task, you implement the function in Python with the following basic logic:

Assume a role that allows the Lambda function to write to the DynamoDB table. Manually configure the DynamoDB client to use the newly-assumed cross account role.

Convert Amazon Kinesis events to the DynamoDB format.

Write the record to the DynamoDB table.

Assume a cross account role for access to the POC account DynamoDB table

This code snippet uses the AWS Security Token Service (AWS STS) which enables the creation of temporary IAM credentials for an IAM role. The assume_role() function returns a set of temporary credentials (consisting of an access key ID, a secret access key, and a security token) that can be used to access the DynamoDB table via the IAM role arn:aws:iam::999999999999:role/DynamoDB-ProdAnalyticsWrite-role.

You can now use any DynamoDB methods in the AWS SDK for Python (Boto 3) to write to DynamoDB tables in the POC environment, from the production environment.

In order to test the code, I recommend that you create a test event. The simplest way to do this is to print the event from Amazon Kinesis, copy it from the Lambda function to CloudWatch Logs, and convert it into valid JSON. You can now use it when you select the Lambda function and choose Actions , Configure Test Event. After you are happy with the records in DynamoDB, you can choose Event Source,** Disable** to disable the mapping between the Lambda function and Amazon Kinesis.

Here’s the Python code to write an Amazon Kinesis record to the DynamoDB table:

The first part of the source snippet iterates over the batch of events that were sent to the Lambda function as part of the Amazon Kinesis event source, and decodes from base64 encoding. Then the fields used for the DynamoDB hash and range are extracted directly from the Amazon Kinesis record. This helps avoid storing duplicate records or idempotent processing, and allows for rapid retrieval without the need to do a full table scan (more on this later). Each record is then converted into the DynamoDB format, and finally written to the target DynamoDB table with the put_item()function.

As the data is now persisted in the POC environment, data scientists and analysts are free to experiment on the latest production data without having access to production or worrying about impacting the live website.

Records in the POC DynamoDB table

Now I’ll give technical details on how you can ensure that the DynamoDB records are idempotent, and different options for querying the data. The pattern I describe can be more widely used as it will work with any Amazon Kinesis records. The idea is to use the Amazon Kinesis sequence number, which is unique per stream, as the range, and a subset of it as a composite hash. In our experiments, we found that the taking the last two minus one digits (here, the 80 in red) from the sequence number was better distributed than the last two (02).

This schema ensures that you do not have duplicate Amazon Kinesis records without the need to maintain any state or position in the client, or change the records in any way, which is well suited for a Lambda function. For example if you deleted the Amazon Kinesis event source of the Lambda function and added a new one an hour later with a trim horizon, then this would mean that the existing DynamoDB rows would be overwritten, as the records would have identical hash and range keys. Not having duplicate records is very useful for any incremental loads into other systems and reduces the deduplication work needed downstream, and also allows us to readily replay Amazon Kinesis records.

An example of a populated DynamoDB table could look like the following graphic.

Notice that the composite key of hash and range is unique and that other fields captured can be useful in understanding the user click stream. For example you can see that if you run a query on attribute useremail_ and order by nanotimestamp_, you obtain the clickstream of an individual user. However, this means that DynamoDB has to analyze the whole table in what is known as a table scan, as you can only query on hash and range keys.

One solution is to use a global secondary index, as discussed earlier; however, you might suffer from the data writes being throttled in the main table should the index writes be throttled (AWS is working on an asynchronous update process), This can happen if the hash/range are not well distributed. If you need to use the index, one solution is to use a scalable pattern with global secondary indexes.

Another way is to use DynamoDB Streams as an event source to trigger another Lambda function that writes to a table specifically used for querying by user email, e.g., the hash could be useremail_ and the range could be nanotimestamp_. This also means that you are no longer limited to five indexes per main table and the throttling issues mentioned earlier do not affect the main table.

Using DynamoDB Streams is also more analytics-centric as it gives you dynamic flexibility on the primary hash and range, allowing data scientists to experiment with different configurations. In addition, you might use a Lambda function to write the records to one table for the current week and another one for the following week; this allows you to keep the most recent hot data in a smaller table, and older or colder data in separate tables that could be removed or archived on a regular basis.

Persisting records to S3 via Amazon Kinesis Firehose

Now that you have the data in DynamoDB, you can enable DynamoDB Streams to obtain a stream of events which can be fully sourced and processed in the POC. For example, you could perform time series analysis on this data or persist the data to S3 or Amazon Redshift for batch processing.

To write to S3, use a Lambda function that reads the records from DynamoDB Streams and writes them to Amazon Kinesis Firehose. Other similar approaches exist but they rely on having a running server with custom deployed code. Firehose is a fully managed service for delivering real-time streaming data to destinations such as S3.

First, set up an IAM policy and associate it with an IAM role so that the Lambda function can write to Firehose.

First, the Lambda function iterates over all records and values, and flattens the JSON data structure from the DynamoDB format to JSON records, e.g., ‘event’:{‘S’:’page view’}becomes{‘event’:’page view’}. Then, the record is then encoded as a JSON record and sent to Firehose.

For testing, you can log events from your Amazon Kinesis stream by printing them as JSON to standard out; Lambda automatically delivers this information into CloudWatch Logs. This information can then be used as the test event in the Lambda console.

Configuring the Amazon Kinesis Firehose delivery stream

For S3 bucket , choose Create new S3 bucket or Use an existing S3 bucket.

For S3 prefix , enter prod-kinesis-data.

Choose Next to change the following Configuration settings.

For Buffer size , enter 5.

For Buffer interval , enter 300.

For Data compression , choose UNCOMPRESSED.

For Data Encryption , choose No Encryption.

For Error Logging , choose Enable.

For IAM role , choose Firehose Delivery IAM Role. This opens a new window: Amazon Kinesis Firehose is requesting permission to use resources in your account.

For IAM Role , choose Create a new IAM role.

For Role Name, choose firehosedeliveryrole.

After the policy document is automatically generated, choose Allow.

Choose Next.

Review your settings and choose Create Delivery Stream.

After the Firehose delivery stream is created, select the stream and choose Monitoring to see activity. The following graphic shows some example metrics.

After the DynamoDB Streams event source for the Lambda function is enabled, new objects with production Amazon Kinesis records passing through the POC DynamoDB table are automatically created in the POC S3 bucket using Firehose. Firehose can also be used to import data to Amazon Redshift or Amazon Elasticsearch Service.

Benefits of serverless processing

Using a Lambda function for stream processing enables the following benefits:

Flexibility – Analytics code can be changed on the fly.

High availability – Runs in multiple Availability Zones in a region

Zero-maintenance and upgrade of the running instances, all services. are supported by AWS

Security – No use of keys or passwords. IAM roles can be used to integrate with Amazon Kinesis Streams, Amazon Kinesis Firehose, and DynamoDB.

Serverless compute – There are no EC2 instances or clusters to set up and maintain.

Automatic scaling – The number of Lambda functions invoked changes depending on the number of writes to Amazon Kinesis. Note that the upper concurrent limit is the number of Amazon Kinesis streams and DynamoDB Streams shards.

Low cost – You pay only for the execution time of the Lambda function.

Ease of use – Easy selection of language for functions, and very simple functions. In these examples, the functions just iterate over and parse JSON records.

Summary

In this post, I described how to persist events from an Amazon Kinesis stream into DynamoDB in another AWS account, without needing to build and host your own server infrastructure or using any keys. I also showed how to get the most out of DynamoDB, and proposed a schema that eliminates duplicate Amazon Kinesis records and reduces the need for further de-duplication downstream.

In a few lines of Python code, this pattern has allowed the data engineers at JustGiving the flexibility to project, transform, and enrich the Amazon Kinesis stream events as they arrive. Our data scientists and analysts now have full access to the production data in near-real time but in a different AWS account, allowing them to quickly undertake streaming analysis, project a subset of the data, and run experiments.

I also showed how to persist that data stream to S3 using Amazon Kinesis Firehose. This allows us to quickly access the historical clickstream dataset, without needing a permanent server or cluster running.

If you have questions or suggestions, please comment below. You may also be interested in seeing other material that digs deeper into the capabilities of Lambda and DynamoDB.

The full code samples are available in the JustGiving GitHub Repository https://github.com/JustGiving/ Note that it is Copyright (c) 2015-2016 Giving.com Ltd, trading as JustGiving, or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 license.