What is data profiling and how does it make big data easier?

Data profiling, the act of monitoring and cleansing data, is an important tool organizations can use to make better data decisions. Learn how it helps with data problems – big and small.

In our increasingly connected world, the amount of data – and the sources of this data – continue to rise. Data profiling is an often-visual assessment that uses a toolbox of business rules and analytical algorithms to discover, understand and potentially expose inconsistencies in your data. This knowledge is then used to improve existing data sets and systems as an important part of monitoring and improving the health of these newer, bigger data sets.

The need for data profiling is only going to grow. Companies must interact with increasingly diverse and intimidatingly large sets of data from different sources like blogs, social media and emerging big data technologies like Hadoop. In the industrial world, the Internet of Things introduces a multitude of devices generating data, while organizations can access data from biometrics and human-generated sources like email and electronic medical records.

The amount of data is only one side of the equation. Data that isn’t formatted right, standardized or correctly integrated with the rest of the database can cause delays and problems that lead to missed opportunities, confused customers and bad decisions.

Data profiling helps you to get ahead of these issues. By ensuring that you run a diagnosis and examine the data that you have, you can proactively create a plan to fix many of your data problems before they can affect your organization.

Learn how a data management platform solves big data challenges

Most organizations have spent the last decade acquiring data integration tools to profile, manage, govern and utilize data. This white paper describes the evolution of data integration tools and the benefits that can be achieved with a comprehensive data management platform.

Why do you need data profiling?

Data profiling helps you discover, understand and organize your data. It should be an essential part of how your organization handles its data for several reasons.

First, data profiling helps cover the basics with your data, verifying that the information in your tables matches the descriptions. Then it can help you better understand your data by revealing the relationships that span different databases, source applications or tables.

Beyond just uncovering hidden nuggets of information buried within your own data, data profiling helps you ensure that your data is up to standard statistical measures, as well as business rules specific to your company. For example, a state column might use a combination of both two-letter codes and the fully spelled out (sometimes incorrectly) name of the state. Profiling would uncover this inconsistency and inform the creation of a standardization rule that could make them all consistent, two-letter codes.

What are the different kinds of data profiling?

Many of the data profiling techniques or processes used today fall into three major categories: structure discovery, content discovery and relationship discovery.

Structure discovery, also known as structure analysis, validates that the data that you have is consistent and formatted correctly. There are several different processes that you can use for this, such as pattern matching. For example, if you have a data set of phone numbers, pattern matching helps you find the valid sets of formats within the data set. Pattern matching also helps you understand whether a field is text- or number-based along with other format-specific information.

Structure discovery also examines simple basic statistics in the data. By using statistics like the minimum and maximum values, means, medians, modes and standard deviations, you can gain insight into the validity of the data.

Content discovery is the process of looking more closely into the individual elements of the database. This can help you find areas that contain null values or values that are incorrect or ambiguous.

Many data management tasks start with an accounting for all the inconsistent and ambiguous entries in your data sets. The standardization process in content discovery plays a major role in fixing these little problems. For example, finding and correcting your data to fit street addresses into the correct format is an essential part of this step. The potential problems that could arise from non-standard data, like being unable to reach customers via mail because the data set includes incorrectly formatted addresses, are costly and can be addressed early in the data management process.

Finally, relationship discovery involves discovering what data is in use and trying to gain a better understanding of the connections between the data sets. This process starts with metadata analysis to determine key relationships between the data and narrows down the connections between specific fields, particularly where the data overlaps. This process can help cut down on some of the problems that arise when data sets are not aligned.

Data profiling can be used to troubleshoot problems within even the biggest data sets.

What are some data profiling techniques?

There are four general methods by which data profiling tools accomplish their tasks: column profiling, cross-column profiling, cross-table profiling and data rule validation.

Column profiling scans through a table and counts the number of times each value shows up within each column. This method can be useful to find frequency distribution and patterns within a column of data.

Cross-column profiling is made up of two processes: key analysis and dependency analysis. Key analysis examines collections of attribute values by scouting for a possible primary key. Dependency analysis is a more complex process that determines whether there are relationships or structures embedded in a data set. Both techniques help analyze dependencies among data attributes within the same table.

Cross-table profiling uses foreign key analysis, which is the identification of orphaned records and determination of semantic and syntactic differences, to examine the relationships of column sets in different tables. This can help cut down on redundancy but also identify data value sets that could be mapped together.

Finally, data rule validation uses data profiling in a proactive manner to verify that data instances and data sets conform with predefined rules. This process can be achieved either through batch validation or an ongoing validation service.

What are some examples of data profiling in use today?

Data profiling can be used to troubleshoot problems within even the biggest data sets by first examining metadata. For example, by using SAS® metadata and profiling tools with Hadoop, you can troubleshoot and fix problems within the data to find the types of data that can best contribute to new business ideas.

In SAS Data Loader for Hadoop, you can profile Hadoop data sets using a visual interface and store the results in a report. The solution provides data quality metrics, descriptive measures, metadata measures and other charts to help you understand your data.

These tools can have real-world impacts. For example, the Texas Parks and Wildlife Department used SAS Data Management and its data profiling features to improve the customer experience. The data profiling tools helped identify incorrect spelling and address standardization and geocoding attributes within data sets. This information was used to help improve the quality of customer data, creating a better experience for Texans using the millions of acres of park lands and waterways available to them.

Data profiling is a critical component of implementing a data strategy, and informs the creation of data quality rules that can be used to monitor and cleanse your data. Organizations can make better decisions with data they can trust, and profiling is an essential first step on this journey.