Populating the Data Catalog Using AWS CloudFormation
Templates

AWS CloudFormation is a service that can create many AWS resources. AWS Glue provides
API operations to
create objects in the AWS Glue Data Catalog. However, it might be more convenient
to define and create
AWS Glue objects and other related AWS resource objects in an AWS CloudFormation template
file. Then you can
automate the process of creating the objects.

AWS CloudFormation provides a simplified syntax—either JSON (JavaScript Object Notation)
or YAML
(YAML Ain't Markup Language)—to express the creation of AWS resources. You can use
AWS CloudFormation templates to define Data Catalog objects such as databases, tables,
partitions, crawlers,
classifiers, and connections. You can also define ETL objects such as jobs, triggers,
and
development endpoints. You create a template that describes all the AWS resources
you want,
and AWS CloudFormation takes care of provisioning and configuring those resources
for you.

If you plan to use AWS CloudFormation templates that are compatible with AWS Glue,
as an administrator, you
must grant access to AWS CloudFormation and to the AWS services and actions on which
it depends. To grant
permissions to create AWS CloudFormation resources, attach the following policy to
the IAM users that work
with AWS CloudFormation:

The following table contains the actions that an AWS CloudFormation template can perform
on your behalf.
It includes links to information about the AWS resource types and their property types
that
you can add to an AWS CloudFormation template.

To get started, use the following sample templates and customize them with your own
metadata. Then use the AWS CloudFormation console to create an AWS CloudFormation
stack to add objects to AWS Glue and any
associated services. Many fields in an AWS Glue object are optional. These templates
illustrate the
fields that are required or are necessary for a working and functional AWS Glue object.

An AWS CloudFormation template can be in either JSON or YAML format. In these examples,
YAML is used for
easier readability. The examples contain comments (#) to describe the values that
are defined in the templates.

AWS CloudFormation templates can include a Parameters section. This section can be changed
in the sample text or when the YAML file is submitted to the AWS CloudFormation console
to create a stack.
The Resources section of the template contains the definition of AWS Glue and related
objects. AWS CloudFormation template syntax definitions might contain properties that
include more detailed
property syntax. Not all properties might be required to create an AWS Glue object.
These samples
show example values for common properties to create an AWS Glue object.

Sample AWS CloudFormation Template for an AWS Glue Database

An AWS Glue database in the Data Catalog contains metadata tables. The database consists
of very few
properties and can be created in the Data Catalog with an AWS CloudFormation template.
The following sample
template is provided to get you started and to illustrate the use of AWS CloudFormation
stacks with AWS Glue.
The only resource created by the sample template is a database named
cfn-mysampledatabase. You can change it by editing the text of the sample or
changing the value on the AWS CloudFormation console when you submit the YAML.

The following shows example values for common properties to create an AWS Glue database.
For
more information about the AWS CloudFormation database template for AWS Glue, see
AWS::Glue::Database.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CloudFormation template in YAML to demonstrate creating a database named mysampledatabase
# The metadata created in the Data Catalog points to the flights public S3 bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
CFNDatabaseName:
Type: String
Default: cfn-mysampledatabse
# Resources section defines metadata for the Data Catalog
Resources:
# Create an AWS Glue database
CFNDatabaseFlights:
Type: AWS::Glue::Database
Properties:
# The database is created in the Data Catalog for your account
CatalogId: !Ref AWS::AccountId
DatabaseInput:
# The name of the database is defined in the Parameters section above
Name: !Ref CFNDatabaseName
Description: Database to hold tables for flights data
LocationUri: s3://crawler-public-us-east-1/flight/2016/csv/
#Parameters: Leave AWS database parameters blank

An AWS Glue table contains the metadata that defines the structure and location of
data that
you want to process with your ETL scripts. Within a table, you can define partitions
to
parallelize the processing of your data. A partition is a chunk of data that you defined
with
a key. For example, using month as a key, all the data for January is contained in
the same
partition. In AWS Glue, databases can contain tables, and tables can contain partitions.

The following sample shows how to populate a database, a table, and partitions using
an
AWS CloudFormation template. The base data format is csv and delimited by a comma (,). Because
a database must exist before it can contain a table, and a table must exist before
partitions
can be created, the template uses the DependsOn statement to define the
dependency of these objects when they are created.

The values in this sample define a table that contains flight data from a publicly
available Amazon S3 bucket. For illustration, only a few columns of the data and one
partitioning
key are defined. Four partitions are also defined in the Data Catalog. Some fields
to describe the
storage of the base data are also shown in the StorageDescriptor fields.

Sample AWS CloudFormation Template for an AWS Glue Grok Classifier

An AWS Glue classifier determines the schema of your data. One type of custom classifier
uses a grok pattern to match your data. If the pattern matches, then the custom classifier
is
used to create your table's schema and set the classification to the value set in
the classifier definition.

This sample creates a classifier that creates a schema with one column named message and sets the classification to greedy.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a classifier
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
# The name of the classifier to be created
CFNClassifierName:
Type: String
Default: cfn-classifier-grok-one-column-1
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create classifier that uses grok pattern to put all data in one column and classifies it as "greedy".
CFNClassifierFlights:
Type: AWS::Glue::Classifier
Properties:
GrokClassifier:
#Grok classifier that puts all data in one column
Name: !Ref CFNClassifierName
Classification: greedy
GrokPattern: "%{GREEDYDATA:message}"
#CustomPatterns: none

Sample AWS CloudFormation Template for an AWS Glue JSON Classifier

An AWS Glue classifier determines the schema of your data. One type of custom classifier
uses
a JsonPath string defining the JSON data for the classifier to classify. AWS Glue
supports a subset of the operators for JsonPath, as described in Writing JsonPath Custom Classifiers.

If the pattern matches, then the custom classifier is used to create your table's
schema.

This sample creates a classifier that creates a schema with each record in the
Records3 array in an object.

Sample AWS CloudFormation Template for an AWS Glue XML Classifier

An AWS Glue classifier determines the schema of your data. One type of custom classifier
specifies an XML tag to designate the element that contains each record in an XML
document
that is being parsed. If the pattern matches, then the custom classifier is used to
create
your table's schema and set the classification to the value set in the classifier
definition.

This sample creates a classifier that creates a schema with each record in the
Record tag and sets the classification to XML.

An AWS Glue crawler creates metadata tables in your Data Catalog that correspond to
your data. You
can then use these table definitions as sources and targets in your ETL jobs.

This sample creates a crawler, the required IAM role, and an AWS Glue database in
the
Data Catalog. When this crawler is run, it assumes the IAM role and creates a table
in the
database for the public flights data. The table is created with the prefix
"cfn_sample_1_". The IAM role created by this template allows global
permissions; you might want to create a custom role. No custom classifiers are defined
by this
classifier. AWS Glue built-in classifiers are used by default.

When you submit this sample to the AWS CloudFormation console, you must confirm that
you want to create the IAM role.

Sample AWS CloudFormation Template for an AWS Glue Connection

An AWS Glue connection in the Data Catalog contains the JDBC and network information
that is
required to connect to a JDBC database. This information is used when you connect
to a JDBC
database to crawl or run ETL jobs.

This sample creates a connection to an Amazon RDS MySQL database named devdb. When
this connection is used, an IAM role, database credentials, and network connection
values
must also be supplied. See the details of necessary fields in the template.

Sample AWS CloudFormation Template for an AWS Glue Crawler for JDBC

An AWS Glue crawler creates metadata tables in your Data Catalog that correspond to
your data. You
can then use these table definitions as sources and targets in your ETL jobs.

This sample creates a crawler, required IAM role, and an AWS Glue database in the
Data Catalog.
When this crawler is run, it assumes the IAM role and creates a table in the database
for
the public flights data that has been stored in a MySQL database. The table is created
with
the prefix "cfn_jdbc_1_". The IAM role created by this template allows
global permissions; you might want to create a custom role. No custom classifiers
can be
defined for JDBC data. AWS Glue built-in classifiers are used by default.

When you submit this sample to the AWS CloudFormation console, you must confirm that
you want to create the IAM role.

An AWS Glue job in the Data Catalog contains the parameter values that are required
to run a
script in AWS Glue.

This sample creates a job that reads flight data from an Amazon S3 bucket in csv
format and writes it to an Amazon S3 Parquet file. The script that is run by this
job must already
exist. You can generate an ETL script for your environment with the AWS Glue console.
When this
job is run, an IAM role with the correct permissions must also be supplied.

Common parameter values are shown in the template. For example, AllocatedCapacity (DPUs) defaults to 5.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using the public flights S3 table in a public bucket
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
# The name of the job to be created
CFNJobName:
Type: String
Default: cfn-job-S3-to-S3-2
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
CFNIAMRoleName:
Type: String
Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
CFNScriptLocation:
Type: String
Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-test2
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create job to run script which accesses flightscsv table and write to S3 file as parquet.
# The script already exists and is called by this job
CFNJobFlights:
Type: AWS::Glue::Job
Properties:
Role: !Ref CFNIAMRoleName
#DefaultArguments: JSON object
# If script written in Scala, then set DefaultArguments={'--job-language'; 'scala', '--class': 'your scala class'}
#Connections: No connection needed for S3 to S3 job
# ConnectionsList
#MaxRetries: Double
Description: Job created with CloudFormation
#LogUri: String
Command:
Name: glueetl
ScriptLocation: !Ref CFNScriptLocation
# for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"
# script uses temp directory from job definition if required (temp directory not used S3 to S3)
# script defines target for output as s3://aws-glue-target/sal
AllocatedCapacity: 5
ExecutionProperty:
MaxConcurrentRuns: 1
Name: !Ref CFNJobName

An AWS Glue job in the Data Catalog contains the parameter values that are required
to run a
script in AWS Glue.

This sample creates a job that reads flight data from a MySQL JDBC database as defined
by
the connection named cfn-connection-mysql-flights-1 and writes it to an Amazon S3
Parquet file. The script that is run by this job must already exist. You can generate
an ETL
script for your environment with the AWS Glue console. When this job is run, an IAM
role with
the correct permissions must also be supplied.

Common parameter values are shown in the template. For example, AllocatedCapacity (DPUs) defaults to 5.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a job using a MySQL JDBC DB with the flights data to an S3 file
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
# The name of the job to be created
CFNJobName:
Type: String
Default: cfn-job-JDBC-to-S3-1
# The name of the IAM role that the job assumes. It must have access to data, script, temporary directory
CFNIAMRoleName:
Type: String
Default: AWSGlueServiceRoleGA
# The S3 path where the script for this job is located
CFNScriptLocation:
Type: String
Default: s3://aws-glue-scripts-123456789012-us-east-1/myid/sal-job-dec4a
# The name of the connection used for JDBC data source
CFNConnectionName:
Type: String
Default: cfn-connection-mysql-flights-1
#
#
# Resources section defines metadata for the Data Catalog
Resources:
# Create job to run script which accesses JDBC flights table via a connection and write to S3 file as parquet.
# The script already exists and is called by this job
CFNJobFlights:
Type: AWS::Glue::Job
Properties:
Role: !Ref CFNIAMRoleName
#DefaultArguments: JSON object
# For example, if required by script, set temporary directory as DefaultArguments={'--TempDir'; 's3://aws-glue-temporary-xyc/sal'}
Connections:
Connections:
- !Ref CFNConnectionName
#MaxRetries: Double
Description: Job created with CloudFormation using existing script
#LogUri: String
Command:
Name: glueetl
ScriptLocation: !Ref CFNScriptLocation
# for access to directories use proper IAM role with permission to buckets and folders that begin with "aws-glue-"
# if required, script defines temp directory as argument TempDir and used in script like redshift_tmp_dir = args["TempDir"]
# script defines target for output as s3://aws-glue-target/sal
AllocatedCapacity: 5
ExecutionProperty:
MaxConcurrentRuns: 1
Name: !Ref CFNJobName

Sample AWS CloudFormation Template for an AWS Glue On-Demand Trigger

An AWS Glue trigger in the Data Catalog contains the parameter values that are required
to start a
job run when the trigger fires. An on-demand trigger fires when you enable it.

This sample creates an on-demand trigger that starts one job named cfn-job-S3-to-S3-1.

An AWS Glue trigger in the Data Catalog contains the parameter values that are required
to start a
job run when the trigger fires. A conditional trigger fires when it is enabled and
its
conditions are met, such as a job completing successfully.

This sample creates a conditional trigger that starts one job named
cfn-job-S3-to-S3-1. This job starts when the job named cfn-job-S3-to-S3-2
completes successfully.

---
AWSTemplateFormatVersion: '2010-09-09'
# Sample CFN YAML to demonstrate creating a conditional trigger for a job, which starts when another job completes
#
# Parameters section contains names that are substituted in the Resources section
# These parameters are the names the resources created in the Data Catalog
Parameters:
# The existing job to be started by this trigger
CFNJobName:
Type: String
Default: cfn-job-S3-to-S3-1
# The existing job that when it finishes causes trigger to fire
CFNJobName2:
Type: String
Default: cfn-job-S3-to-S3-2
# The name of the trigger to be created
CFNTriggerName:
Type: String
Default: cfn-trigger-conditional-1
#
Resources:
# Create trigger to run an existing job (CFNJobName) when another job completes (CFNJobName2).
CFNTriggerSample:
Type: AWS::Glue::Trigger
Properties:
Name:
Ref: CFNTriggerName
Description: Trigger created with CloudFormation
Type: CONDITIONAL
Actions:
- JobName: !Ref CFNJobName
# Arguments: JSON object
#Schedule: none
Predicate:
#Value for Logical is required if more than 1 job listed in Conditions
Logical: AND
Conditions:
- LogicalOperator: EQUALS
JobName: !Ref CFNJobName2
State: SUCCEEDED

Sample AWS CloudFormation Template for an AWS Glue Development Endpoint

An AWS Glue development endpoint is an environment that you can use to develop and
test your AWS Glue scripts.

This sample creates a development endpoint with the minimal network parameter values
required to successfully create it. For more information about the parameters that
you need to
set up a development endpoint, see Setting Up Your Environment for Development
Endpoints.

You provide an existing IAM role ARN (Amazon Resource Name) to create the development
endpoint. Supply a valid RSA public key and keep the corresponding private key available
if
you plan to create a notebook server on the development endpoint.

Note

For any notebook server that you create that is associated with a development endpoint,
you
manage it. Therefore, if you delete the development endpoint, to delete the notebook
server,
you must delete the AWS CloudFormation stack on the AWS CloudFormation console.