Data Lake Products

Data Lake Products

Harness the Value of Exploding Data Volumes

Data Lakes have emerged in recent years in response to organizations looking to economically harness and derive value from exploding data volumes. New data sources such as web, mobile, and connected devices along with new forms of analytics such text, graph, and pathing have necessitated a new Data Lake design pattern to augment traditional design patterns such as the Data Warehouse.

Companies are beginning to realize value from Data Lakes in the areas of:

New Insights from Data of Unknown or Under-Appreciated Value

New Forms of Analytics

Corporate Memory Retention

Data Integration Optimization

Yet confusion regarding the definition of a data lake abounds in the absence of a large body of well understood best practices. Drawing upon many sources as well as on site experience with leading data driven customers, a data lake is defined as a collection of long term data containers that capture, refine, and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream facilities may draw upon.

Data Lake Design Pattern

A design pattern is an architecture and set of corresponding requirements that have evolved to the point where there is agreement and best practices for implementations. How you implement it varies from workload to workload, organization to organization. While technologies are critical to the outcome, a successful data lake needs a plan. A Data Lake design pattern is that plan.

The data lake definition does not prescribe a technology, only requirements. While Data Lakes are typically discussed synonymously with Hadoop – which is an excellent choice for many Data Lake workloads - a Data Lake can be built on multiple technologies such as Hadoop, NoSQL, S3, RDBMS, or combinations thereof.

"Data lakes can be based on HDFS, but are not limited to that environment; for example, object stores such as Amazon Simple Storage Service (S3)/Microsoft Azure or NoSQL DBMSs like HBase or Cassandra can also be environments for data lakes." — Gartner, 2015

Data Lake Architecture

As the trusted advisor to the world’s leading data driven organizations, Teradata can help with the design, implementation, and support to ensure your organization avoids the typical pitfalls and realizes maximum value from your Data Lake initiative by ensuring critical capabilities and design principles based on best practices.

Inspired by over 150 data lake implementations by Think Big, Kylo is an open source, enterprise-ready, data lake management software platform that simplifies pipeline development and common data management tasks, resulting in faster time to value, greater user adoption and developer productivity. With Kylo, no coding is required, and its intuitive user interface for self-service data ingest and wrangling helps accelerate the development process. Kylo also leverages reusable templates to increase productivity. Kylo was built using the latest open source capabilities such as Apache® Hadoop®, Apache Spark™ and Apache NiFi™. Kylo is a Teradata sponsored, open-source project that is offered under the Apache 2.0 license.

Presto is an open source SQL-on-Hadoop query engine designed for running interactive analytic queries against data sources of all sizes. Through a single query, Presto allows you to access data where it lives, including in Apache Hive™, Apache Cassandra™, relational databases or even proprietary data stores. Presto was created by Facebook for the analytics needs of extremely large data-driven organizations.

Easy to use, multi-genre advanced analytics at scale to enable business analysts and data scientists to quickly discover insights in their Hadoop data lake. Aster delivers over 100 pre-built parallel analytic functions that runs natively on Hadoop to analyze data directly on HDFS. Aster Analytics is also YARN integrated to support multiple instances of Aster from sandboxes to production use cases in the same Hadoop cluster.

Teradata IntelliBase is a compact, fully-integrated environment for data lakes and other designs that require low-cost data storage. The versatile platform enables a mixture of Teradata and Hadoop nodes to meet your workload requirements—all installed into a single cabinet to preserve valuable data center floor space. Teradata IntelliBase with Hadoop is delivered ready-to-run and fully supported by Teradata.

Download Open Source Presto

An open source distributed SQL query engine designed for running interactive analytic queries against data of all sizes. Via a single query access data where it lives, including in Hadoop, Apache Cassandra™, MySQL and PostgreSQL or even proprietary data stores.

Solution Showcase: Teradata’s Compelling Open Source Strategy

Open source software provides many opportunities for the tech industry, particularly around innovation and community building. In this paper written by Nik Rouda, senior analyst with Enterprise Strategy Group (ESG), learn how Teradata leverages open source technologies to support a commercial software strategy that benefits both the company and its customers.