Waterline Data Catalog

Waterline Data Fingerprinting™ works by analyzing the data values in each data set and profiling the data. Waterline then uses that information to create a “fingerprint” for each column of data—using machine learning to intelligently and automatically tag and match data fingerprints to glossary terms and populate the data catalog. Users can then refine matched terms, and remaining unmatched terms, through crowdsourcing.

Overcoming technical challenges

Waterline fingerprints data while overcoming two difficult technical challenges. The first is the problem of generating too many false positive matches. This has been addressed by tuning our proprietary matching algorithms using years of experience with real customer data. The second is dealing with the massive amount of data that modern enterprises need to inventory. Waterline is the only data catalog designed from the ground up to run directly on Hadoop and Spark. The result is that we are designed to scale directly with our customers’ infrastructure for today and tomorrow.

Automated data fingerprinting at scale provides a huge leap over approaches that use only crowdsourcing and allows Waterline customers to get value out of the data catalog in a matter of hours or days.

Curation, ratings and reviews.

We recognize that users don’t want to turn over the entire process to a machine — nor should they. Waterline uniquely combines at scale automation with a curation process that allows data stewards to accept or reject the automated tags, or even add their own tags to the data catalog. As stewards curate the data, machine learning algorithms improve the automated discovery process.

Combining automation & tribal knowledge

Waterline allows end users to provide ratings and reviews of data sets. — users can comment on aspects of the data that can’t be easily determined algorithmically, preserving and leveraging tribal knowledge. Users might comment that a certain data set was good for HR, but not for finance. Or they might note that a specific data set was being used by data scientists as a sandbox and shouldn’t be used for other purposes. The combination of automation with data democratization provides the fastest and the best approach to build and maintain a data catalog.

Search

Waterline catalog allows business analysts to spend less time trying to find the data they need and more time doing actual analytics. Waterline search provides a wide variety of facets for users to narrow down their searches. Customers can also define their own facets by creating custom properties to tag data and then use those properties to later filter searches. Search results also show crowd sourced ratings and reviews for data … yet another way users can judge the quality of the data they are viewing in search results.

Industry standard search technology

Waterline uses the Apache SOLR engine to power the search of the catalog. SOLR delivers scalability, high availability (HA) and disaster recover (DR). Waterline also exposes a REST API for search to make it easy to integrate search queries and results directly into third party applications like data wrangling and business intelligence tools.

Data Lineage

A data catalog isn’t complete without data lineage — a critical requirement in establishing user trust in data as well as a requirement for compliance with regulatory laws. Lineage lets users know the sources from where data comes, and also where it is ultimately consumed. Better understanding of data provenance provides users with more confidence and trust in downstream data and aids in solving acceptable use questions.

Imported & derived lineage

Waterline Smart Data Catalog uses both imported lineage as well as derived lineage. Imported lineage comes directly from the metadata available from Apache Atlas, Cloudera Navigator or ETL tools (via our REST API). To fill in the gaps, Waterline also employs derived lineage to discover lineage by analyzing the data values within a hive table, SQL table or file in HDFS identifying similar signatures to identify potential lineage candidates. Using smart proprietary algorithms, the data catalog identifies the best candidates and uses time stamps to determine the upstream and downstream tables in the data flow.

Access Control

One consequence of the big data era is that there is just too much data coming into organizations to be able to keep track of manually. New data sources are introduced into organizations with increasing velocity. Often this new data lands in a quarantine zone to be reviewed and organized before it can be effectively used. Typically, to get out of quarantine, a data steward must manually review the data, classify it and then configure appropriate access controls based on user roles and domains. This process can take so long (if ever) that supposedly new data can get stale before it is ever put into use!

Tag based access control

Waterline significantly accelerates this process with tag-based access control. This approach intelligently and automatically tags new data sets as they arrive. Data stewards only have to review the suggestions. In addition, through integration via REST APIs, the tags are propagated directly into platform access control mechanisms (Apache Ranger or Cloudera Sentry,) where access can be immediately managed based on existing roles and domains. Data clears quarantine much more quickly and new data is immediately put to use. Instead of being paralyzed by the flood of data, Waterline allows enterprises to harness big data to be agile and gain critical business insights.

Easy Integration: Open architecture and ecosystem

To make a data catalog useful, it needs to be integrated into the rest of your ecosystem so your organization can take action more effectively and leverage the full value of your data assets.