Auditing and Data Governance Overview

Note: This page contains references to CDH 5 components or features that have been removed from CDH 6. These references are only applicable if you
are managing a CDH 5 cluster with Cloudera Manager 6. For more information, see Deprecated Items.

Organizations of all kinds want to understand where data in their clusters is coming from and how it is used. Cloudera Navigator Data Management component is a fully integrated data
management and security tool for the Hadoop platform that has been designed to meet compliance, data governance, and auditing needs of global enterprises. Without Cloudera Navigator, Hadoop clusters
rely primarily on log files for auditing. However, log files are not an enterprise-class real-time auditing or monitoring solution. For example, log files can be corrupted by a system crash during a
write commit.

Cloudera Navigator Data Management

Cloudera Navigator captures a complete and immutable record of all system activity. An audit trail can be used to determine the particulars—the who, what, where, and when—of a data
breach or attempted breach. Auditing can be used to not only identify a rogue administrator who deleted user data, for example, but can also be used to recover data from a backup. Enterprises that
must prove they are in compliance with HIPAA (Health Insurance Portability and Accountability Act), PCI (Payment Card Industry Data Security Standard), or other regulations associated with sensitive
or personally identifiable data (PII) are required to produce auditing records when asked by government or other officials, such as banking regulators.

Auditing also serves to provide a historical record of data and context for data forensics. Data stewards and curators can use auditing reports to determine consumption and use patterns
across various data sets by different user communities, for optimizing data access.

Cloudera Navigator also lets administrators generate reports that list the HDFS access permissions granted to groups. Cloudera Navigator tracks access permissions and actual accesses
to all entities in HDFS, Hive, HBase, Impala, Sentry, and Solr, and the Cloudera Navigator Metadata Server itself to help answer questions such as - who has access to which entities, which entities
were accessed by a user, when was an entity accessed and by whom, what entities were accessed using a service, and which device was used to access. Cloudera Navigator auditing supports tracking
access to:

Data collected from these services also provides visibility into usage patterns for users, ability to see point-in-time permissions and how they have changed (leveraging Sentry), and review and
verify HDFS permissions. Cloudera Navigator also provides out-of-the-box integration with leading enterprise metadata, lineage, and SIEM applications.

Metadata Management

Cloudera Navigator features complete metadata storage and supports data discovery. It consolidates technical metadata for all cluster data and enables automatic tagging of data based on
the external sources entering the cluster. The consolidated metadata store is searchable through the Cloudera Navigator console, a web-based unified interface.

In addition, Cloudera Navigator supports user-defined metadata that can be applied to files, tables, and individual columns, to identify data assets for business context. The result is
that data stewards can devise appropriate classification schemes for specific business purposes and data is more easily discovered and located by users.

Furthermore, policies can be used to automatically classify and applying metadata to cluster data based on arrival, scheduled interval, or other trigger.

Lineage

Cloudera Navigator lineage is a visualization tool for tracing data and its transformations from upstream to downstream through the cluster. Lineage can show the transforms that produced
upstream data sources and the effect the data has on downstream artifacts, to the column level. Cloudera Navigator tracks lineage of HDFS files, datasets, and directories, Hive tables and columns,
MapReduce and YARN jobs, Hive queries, Impala queries, Pig scripts, Oozie workflows, Spark jobs, and Sqoop jobs.

Integration within the Enterprise

Monitoring and reporting are only part a subset of enterprise auditing infrastructure capabilities. Enterprise policies may mandate that audit data route be able to integrate with
existing enterprise SIEM (security information and event management) applications and other tools. Toward that end, Cloudera Navigator can deliver or export audit data through several mechanisms:

Using syslog as a mediator between raw-event stream generated by Hadoop cluster and the SIEM tools.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.