Search form

Pentaho, Hadoop, and Data Lakes

Earlier this week, at Hadoop World in New York, Pentaho announced availability of our first Hadoop release.

As part of the initial research into the Hadoop arena I talked to many companies that use Hadoop. Several common attributes and themes emerged from these meetings:

80-90% of companies are dealing with structured or semi-structured data (not unstructured).

The source of the data is typically a single application or system.

The data is typically sub-transactional or non-transactional.

There are some known questions to ask of the data.

There are many unknown questions that will arise in the future.

There are multiple user communities that have questions of the data.

The data is of a scale or daily volume such that it won’t fit technically and/or economically into an RDBMS.

In the past the standard way to handle reporting and analysis of this data was to identify the most interesting attributes, and to aggregate these into a data mart. There are several problems with this approach:

Only a subset of the attributes are examined, so only pre-determined questions can be answered.

The data is aggregated so visibility into the lowest levels is lost

Based on the requirements above and the problems of the traditional solutions we have created a concept called the Data Lake to describe an optimal solution.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.