You can return to the original look by selecting English in the language selector
above.

Data Cataloging

The earliest challenges that inhibited building a data lake were
keeping track of all of the raw assets as they were loaded into the
data lake, and then tracking all of the new data assets and versions
that were created by data transformation, data processing, and
analytics. Thus, an essential component of an Amazon S3-based data
lake is the data catalog. The data catalog provides a query-able
interface of all assets stored in the data lake’s S3 buckets. The
data catalog is designed to provide a single source of truth about
the contents of the data lake.

There are two general forms of a data catalog: a comprehensive data
catalog that contains information about all assets that have been
ingested into the S3 data lake, and a Hive Metastore Catalog
(HCatalog) that contains information about data assets that have
been transformed into formats and table definitions that are usable
by analytics tools like Amazon Athena, Amazon Redshift, Amazon
Redshift Spectrum, and Amazon EMR. The two catalogs are not mutually
exclusive and both may exist. The comprehensive data catalog can be
used to search for all assets in the data lake, and the HCatalog can
be used to discover and query data assets in the data lake.

Comprehensive Data Catalog

The comprehensive data catalog can be created by using standard
AWS services like AWS Lambda, Amazon DynamoDB, and Amazon
Elasticsearch Service (Amazon ES). At a high level, Lambda
triggers are used to populate DynamoDB tables with object names
and metadata when those objects are put into Amazon S3 then
Amazon ES is used to search for specific assets, related metadata,
and data classifications. The following figure shows a high-level
architectural overview of this solution.

HCatalog with AWS Glue

AWS Glue can be used to create a Hive-compatible Metastore Catalog of data stored
in an
Amazon S3-based data lake. To use AWS Glue to build your data catalog, register your
data
sources with AWS Glue in the AWS Management Console. AWS Glue will then crawl your
S3
buckets for data sources and construct a data catalog using pre-built classifiers
for
many popular source formats and data types, including JSON, CSV, Parquet, and more.
You
may also add your own classifiers or choose classifiers from the AWS Glue community
to
add to your crawls to recognize and catalog other data formats. The AWS Glue-generated
catalog can be used by Amazon Athena, Amazon Redshift, Amazon Redshift Spectrum, and
Amazon EMR, as well as third-party analytics tools that use a standard Hive Metastore
Catalog. The following figure shows a sample screenshot of the AWS Glue data catalog
interface.

Figure: Sample AWS Glue data catalog
interface

Javascript is disabled or is unavailable in your
browser.

To use the AWS Documentation, Javascript must be
enabled. Please refer to your browser's Help pages for instructions.