Create Reports from Unstructured Data

Create Reports from Unstructured Data

In this Oceans of Data series article, I will share a tip on creating reports from unstructured data. Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured data is by far the majority of data in our glorious world. Email, invoices, inventory documents, government forms, saved report files, the list of unstructured data could go on and on and on.

Recently while reviewing Datawatch Monarch data prep, I stumbled on an Equifax case study where they automated data extraction from unstructured documents to save time and money. By automating that previously tedious and “un-fun” work, data quality also improved. Here is a link to the Equifax story that inspired me to test unstructured data extraction.

Monarch Auto Define

Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.

Monarch’s Auto Define feature is essentially the easy button
for what used to be a challenging task.

If you are copying and pasting values from unstructured documents into Excel for reporting – here is a better approach. You can automatically extract data from an Adobe Acrobat PDF file or other type of file for reporting in a few clicks.

Extracting Unstructured File Data

To get started, download a free trial of Datawatch Monarch. After installing Monarch, look in the file directory for the Invoices example file located at C:\Users\Public\Documents\Datawatch Monarch\Reports\Classic.pdf.

Launch Monarch, click the Data Prep Studio icon at the top.

Exit the Tutorial pop-up and choose Open Data. Then select PDF Report.Navigate to C:\Users\Public\Public Documents\Datawatch Monarch\Reports, select Classic.pdf, and then click Open. You are then brought to the Report Discovery window. Scroll through window and notice that there are many invoices in that single file – not just one.Essentially you can use this approach for onesie or bulk data processing from unstructured documents. If you have many unstructured files to process, you can automate or schedule these steps. For more information on that option, check out Datawatch Server Automator.

Now click the Auto Define button on the toolbar and watch how Monarch automagically, intelligently finds, defines, maps and extracts the unstructured data in the document for you.Alternatively, you can define each column individually by double-clicking on a field in the top window. Data Prep Studio will create a new column for each field that you define, and populate that column with similar field values.

Select Open in Data Prep Studio to complete this step and then click Preview Data.Here you can optionally make changes, blend other data sources, and so on. For my test, I merely wanted to export the data for reporting.

To export data, click the Load Selected Tables button and then Export Data.

Now pick your desired data export destination and you are good to go. You can analyze the automatically extracted data to your hearts content in your BI tool of choice. Here is a peek at my exports to Excel and Tableau TDE.

Last but not least, I built a lovely Tableau report from this previously unstructured, unusable, dark data in a matter of two minutes.

Tags

Jen Underwood is a Senior Director at DataRobot and founder of Impact Analytix, LLC. She has a unique blend of product management and “hands-on” experience in data warehousing, reporting, visualization, and advanced analytics. In addition to keeping a constant pulse on industry trends, she enjoys digging into oceans of data to solve complex problems with machine learning.
Over the past 20 years, Jen has held worldwide product management roles at Microsoft and served as a technical lead for system implementation firms. She has experience launching new products and turning around failed projects. Most recently she provided advisory, strategy, educational content development, and marketing services to 100+ technology vendors through her own firm. She has been mentioned by KD Nuggets, Information Management and Forbes for her work. She also has written for InformationWeek, O’Reilly Media, and numerous other tech industry publications.
Jen has a Bachelor of Business Administration – Marketing, Cum Laude from the University of Wisconsin, Milwaukee and a post-graduate certificate in Computer Science – Data Mining from the University of California, San Diego. She was also honored to be a former IBM Analytics Insider, Tableau Zen Master, and Top 10 Women Influencer.