Create an ETL solution using AWS Step Functions, Lambda and Glue

AWS Glue is a fully managed Extract, Transform and Load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can lookup further details for AWS Glue here.

AWS Glue Crawlers

A crawler can crawl multiple data stores in a single run. After completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, Transform and Load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

For deep-dive into AWS Glue crawlers, please go through official docs.

Creating Activity based Step Function with Lambda, Crawler and Glue

In my previous article, I created a role which can access Step Functions, Lambda, S3, and Cloudwatch logs. Edit the role to access Glue also. For that

Go to IAM console -> Roles and select the created role

Click on Attach policies and add AWSGlueConsoleFullAccess

The role has access to Lambda, S3, Step functions, Glue and CloudwatchLogs.

We shall build an ETL processor that converts data from csv to parquet and stores the data in S3. For high volume data storage, parquet format is more preferable than storing in CSV format. In most cases, the incoming data is in csv/txt format. For this use case, incoming data is dumped into particular location in S3 (s3://file-transfer-integration/raw/data/) in csv/txt format.

Now run the crawler to create a table in AWS Glue Data catalog.

Create the crawler

Go to AWS Glue console -> Crawlers

Click on Add crawler and give a name to crawler

Specify crawler source type as Data stores which are the default

Specify the path to which need to crawl data into Glue Data catalog i.e., s3://file-transfer-integration/raw/data/

Select option Choose an existing IAM Role and select AWSGlueServiceRoleDefault

Leave Schedule as Run on demand which is the default

In Output, option configure database where crawler creates/updates the table

Review and finish

Run the crawler

Go to AWS Glue console -> Crawlers

Select the created crawler andchoose Run Crawler option

Go to Athena console and select the Database that provided in the crawler configuration, the table has been created.

Create Lambda Functions to run and check the status of crawler

In case you are just starting out on Lambda functions, I have explained how to create one from scratch in my previous article.

Create a Lambda function named invoke-crawler-name i.e., invoke-raw-refined-crawler with the role that we created earlier. And increase Lambda execution time in Timeout to 5 minutes.

Place the following code in invoke-raw-refined-crawler Lambda, which will run the crawler from lambda function.

Create a Glue job

The purpose of the Glue job is to take care of the ETL process and convert incoming csv/txt format to parquet format. If required, transformations are also possible in this ETL job such as applying filter conditions, data typecasting and more.

Go to AWS Glue console -> select jobs under ETL click on Add job

Enter the name of the job i.e., file-type-conversion and choose IAM Role as AWSGlueServiceRoleDeffault.