Dataguise customers span a broad range of industries — financial services, insurance, healthcare, government, technology, and retail — and include some of the world’s largest, industry-leading companies. We work with organizations that embrace the tremendous potential of big data and are committed to being responsible data stewards.

What business goals does Dataguise help enterprise's achieve?

We help our customers achieve two increasingly critical business goals:

How does Dataguise work within Hadoop distribution security that is already in place?

Most customers in regulatory markets are already using existing Hadoop security capabilities around authentication access control (Kerberos), file system, and network isolation and segmentation (ACLs and network Firewalls), file or volume encryption, as well as activity monitoring (logging, auditing, and data lineage). Dataguise can fit and enrich existing systems in a simple, non-blocking manner by operating at the sensitive data element level (e.g., locking IDs or names).

Why is sensitive data detection so critical?

Businesses are rapidly adding new data sources to Hadoop analytics, including: logging data, clickstream data, customer feedback, and sentiment data. As a result, all of this new data going into Hadoop is increasingly “gray” — it’s harder to retain or maintain its structure, harder to cleanse it, harder to determine its location and the amount of sensitive data within it. Sensitive data detection helps organizations understand and protect this new, co-mingled, raw, and noisy data inside Hadoop.

How does the Dataguise detection process work?

The Dataguise detection process begins by defining a security policy. Organizations select which sensitive elements they need to detect. The rest of the process is automated. Through agents for data ingest (Flume, Sqoop, FTP) as well as agents for at-rest data (HDFS, HIVE, PIG), Dataguise analyzes all data and filters and counts sensitive data elements for .txt, csv, logging, AVRO, SequenceFile, as well as common unstructured data formats (Word, Excel, PowerPoint, SMS, Email).

What data types can Dataguise encrypt?

Our encrypt engine runs as an automated process (“agent”) for data loaders (FTP, FLUME, SQOOP). We also support native field and row encryption inside an HDFS encryption agent. More generically, we have a JAR for invoking encryption and decryption and have built decryption UDFs for Pig, HIVE, and MapReduce.

What distributions does Dataguise support?

Dataguise is certified on all three major Hadoop distributions: Cloudera, Hortonworks, and MapR. In addition, downloadable Sandbox trials are available on both the Hortonworks and MapR partner websites. We also have production Hadoop customers using DgSecure for Hadoop with Apache Hadoop, Amazon Elastic MapReduce, and Pivotal HD.

What is DgSecure detection?

DgSecure detection helps organizations identify, locate, and classify sensitive data by cataloging and summarizing it. Dataguise customers can determine which sensitive data types exist and where they reside in enterprise data sources including relational databases, file systems, and big data Hadoop platforms. Using purpose-built lightweight software agents that run directly against such data repositories or within data pipelining tools (FTP, Flume, Sqoop), DgSecure detection can provide a comprehensive inventory of sensitive data across the enterprise landscape. The discovery results are presented via intuitive dashboards and reports that detail sensitive data by location and type.

Which repositories and platforms can DgSecure scan for sensitive data?

DgSecure detection can scan for sensitive data elements across a wide variety of common enterprise data sources and automatically detect/handle common file formats across data stores, including :

Assist with forensic investigations or data breach preparedness at an enterprise by cataloging the potential universe of sensitive information vulnerabilities and by correlating access patterns to sensitive information with the extent and timing of the breach events.

Help improve compliance with local, national and international privacy regulations in a practical, repeatable, and efficient manner.

Provide data categorization to help build suitable policies to secure private and sensitive data. This includes helping build the foundation to determine what data “needs” to be protected and how, and to whom it needs to be accessible in a de-protected form.

What types of sensitive data can DgSecure detection find?

Dataguise customers can search through structured, semi-structured, or unstructured content to find a variety of sensitive data elements, including credit cards, Social Security Numbers, names, addresses, medical IDs, ABA bank routing numbers, and financial codes. In addition to pre-defined templates for these sensitive data types, our customers can also extend and build their own custom sensitive data elements through a sophisticated regex builder.

What is unique about DgSecure detection?

DgSecure detection is a highly scalable, resilient, fault-tolerant, and customizable enterprise-class service for identifying and summarizing sensitive data at the element level.

DgSecure detection capabilities:

Handles high volumes of disparate, constantly moving, and changing data.

Supports a fluid or flexible information governance model that has a mix of highly “invested” (curated) data as well as raw, unexplored (gray) data including IoT (Internet of Things) data, clickstreams, feeds, and logs.

Handles a variety of data stores such as traditional relational databases and enterprise data warehouses as well as non-relational big data sources (Hadoop) and file repositories (SharePoint and file shares).

Selecting options such as sample size and black/whitelist of sensitive items to use for scanning.

Choosing whether to re-execute the detection at pre-set intervals incrementally.

Once the scan is complete, DgSecure detection presents the user with a consolidated view of all the sensitive data elements of interest and their locations. Users can then drill down into specific areas and set up appropriate filtered views of their data. From the same interface, users can apply appropriate data protection to the discovered elements using DgSecure’s masking or encryption services.

Yes, DgSecure uses structural (e.g., schemas, indices, file structures, etc.) and contextual cues wherever possible to speed up discovery. For example, it can determine the file type of each file in the scan path automatically and infer the structure of the files by using heuristics or user provided set of schemas and/or structure definitions to speed up matching during the scan. In addition, Dataguise is continually enhancing our DgSecure detection capability to leverage the full context of data. For example, recognizing sensitive data when only partial information is available, using ontologies to disambiguate data, building Bayesian models to update certainty of sensitive data found after processing more data, and processing unstructured text data using NLP integrated into the DgSecure detection engine.

Does DgSecure detection support a distributed deployment and can it be scaled across distributed architectures?

Yes, DgSecure detection can be deployed in distributed environments. It is designed to leverage resources optimally in a multi-node or multi-host distributed deployment at scale. For example, DgSecure detection for Hadoop leverages distributed computing by having an agent-based architecture that runs all discovery tasks natively as Java Map-Reduce jobs on the Hadoop cluster. DgSecure detection for databases also uses an agent-based architecture to run as a multi-threaded service across database instances within an enterprise.

How does this solution scale out? What are the performance impacts of data-centric discovery?

While discovery can be a resource-intensive operation, even in a distributed architecture, DgSecure detection can be tuned to fit within an enterprise’s infrastructure constraints. For example, the Hadoop HDFS agent can be throttled by limiting it to a certain number of maps. Our experience with large, global production deployments at customer sites proves DgSecure detection scales with a low performance overhead of 5-10%. Dataguise is continually working on minimizing the performance overhead to improve discovery performance.

How does DgSecure detection handle false positives and false negatives?

DgSecure detection uses three different techniques to minimize false positives/negatives: