Many Amazon Web Services (AWS) customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository.

The AWS Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. These include AWS managed services that help ingest, store, find, process, and analyze both structured and unstructured data. To support our customers as they build data lakes, AWS offers the data lake solution, which is an automated reference implementation that deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets.

This webpage provides high-level best practices and guidance for building data lakes on AWS and introduces the data lake solution.

Many companies leverage a data lake to complement, rather than replace, existing Data Warehouses. A data lake can be used as a source for both structured and unstructured data, which can be easily converted into a well-defined schema for ingestion into a Data Warehouse, or analyzed ad hoc to quickly explore unknown datasets and discover new insights. With this in mind, consider the following best practices when building a data lake solution:

Configure your data lake to be flexible and scalable so that you can collect and store all types of data as your company grows. Include design components that support data encryption, search, analysis and querying.

AWS offers a data lake solution that automatically configures the core AWS services necessary to easily tag, search, share, transform, analyze, and govern specific subsets of data across a company or with other external users. The solution deploys a console that users can access to search and browse available datasets for their business needs. The diagram below presents the data lake architecture you can deploy in minutes using the solution's implementation guide and accompanying AWS CloudFormation template.

The solution leverages the security, durability, and scalability of Amazon S3 to manage a persistent catalog of organizational datasets, and Amazon DynamoDB to manage corresponding metadata.

Once a dataset is cataloged, its attributes and descriptive tags are available to search on. Users can search and browse available datasets in the solution console, and create a list of data they require access to.

The solution keeps track of the datasets a user selects and generates a manifest file with secure access links to the desired content when the user checks out.

The solution manages a persistent catalog of organizational datasets in Amazon S3 and business-relevant tags associated with each dataset. It allows companies to create simple governance policies that require specific tags when datasets are stored in the data lake solution.

Q: What type of datasets does the data lake solution support?

You can register existing or new datasets of any file type or size because the solution leverages the flexibility of Amazon S3.

Q: How do I upload my data to the data lake?

You can upload data files from the data lake solution console, or directly to an Amazon S3 bucket and then register them in the data lake.

Q: Can I use the data lake if I have existing data in Amazon S3?

Yes. You can register datasets with descriptive tags of your choice that point to existing objects in Amazon S3.

Logs, alarms, error rates and other metrics are stored in Amazon CloudWatch and are available near real-time.

Q: How do I add and manage users in the data lake solution?

After the data lake solution is deployed, you can invite users to self-register to start using the data lake. You can continue to manage users, groups, and permissions to the data lake in the Administration section of the solution console.

Q: Does the data lake integrate with my enterprise Active Directory?

The data lake does not integrate with Active Directory at this time.

Q: How is data transmitted to the data lake?

You have several options to add data to the data lake solution: use the data lake console or data lake CLI to upload files, or link to existing content in Amazon S3.

Q: Can I deploy the data lake solution in any AWS Region?

You can deploy the solution’s AWS CloudFormation template only in AWS Regions where Amazon Cognito is available. However, once deployed, you can invite users from around the globe to access the solution.

In addition to service-availability requirements, we recommend you deploy the data lake in the AWS Region where your data is stored for better performance and user interactivity.