As promised, I wanted to share some of the options out there that I have been evaluating for self-service ETL, data quality and cleansing. In this post I will briefly highlight Microsoft Data Explorer Excel Add-In, Data Wrangler, LavaStorm Analytics, Alteryx, and Microsoft Data Quality Services. There are actually quite a few of these tools out there! A great list of more options is avallable on KD Nuggets, my favorite data mining site. Data prep is probably the most important aspect of getting a great predictive or analytic data model.

I can’t talk about self-service ETL subject and not mention Microsoft Access, Excel queries, and VBA. Back in the day, we used Access linked tables to various data sources, coded merge queries, Excel queries, VBA subprocedures, and jumped through all sorts of hoops. It took forever to build what takes a couple minutes to build today. In Excel we were quite limited in how much data we could analyze so we had to use Access, FilemakerPro, FoxPro or something similar (oh, I don’t miss those days). When I see Access being used like this today, I cringe since there are soooooo much better, far more efficient ways to accomplish the the same, or in most cases, even more powerful analytics with free or low cost analytics tools. It would be interesting to know if the younger Millennial generation is using Access, how they are using it, and if it really has become old school or not. I do know that VBA futures are “sketchy”. No one is talking about the future of VBA in the cloud Office365. There is no VBA support in Office365 and SharePoint published solutions. You can open an Excel doc with VBA in the web browser but you won’t be able to run any of the VBA code in the web browser. You can only use it in desktop version of Excel 2013 today. It is no secret that Microsoft is moving towards cloud based “services and devices” versus on premise applications on your desktop. To be future-proof per se, start rethinking and replacing your VBA analytic solutions with the Napa Office Apps HTML JavaScript framework or another web based analytic solution based on modern industry standard supported technologies. If you are going to PASS BA Conference (aka Excel Conference), someone please ask about VBA futures, VBA to Napa code migration utilities/plans * if any * for the cloud world that Microsoft is pushing customers to migrate onto. If anyone hears an answer to this open question that will undoubtedly impact millions of Excel spreadsheets with embedded VBA, please let me and I will post the update to this blog post.

Lately there has been a lot of good Microsoft Excel MVP buzz about the Data Explorer Preview, an Excel add-in for Excel 2010 and 2013. Data Explorer seems to be a Microsoft light version of an Oracle Endeca-like offering. Data Explorer pricing and 2013 version support for it are not yet known – the preview is free. After the Rob Collie Excel 2013 Power Pivot rant, I wanted to be sure to clarify Data Explorer pricing/version details are totally unknown to me right now. (Digression: Also Microsoft is not evil – wow Rob’s blog post got a huge uproar. Microsoft is not a charity – they just happen to do a lot of charitable things and build low cost, less feature rich, for the masses-type tools. Microsoft is a great company that makes a positive difference in the world. Yes, I am biased.) Data Explorer is absolutely worth checking out if you want easy, self-service ETL type functionality within Excel. Data Explorer has a simple, non-technical wizard interface for loading, combining, and transforming both structured and unstructured data into Excel. The learning curve is minimal. The Data Explorer team did a great job of covering common data loading, data prep, and tweaking tasks from the most popular data and web data sources. They even have a social media connector and big data connector. It is notably interesting that instead of using SQL to query and tweak data, they introduced a new language called “M” with Data Explorer . “M” seems pretty powerful. You can read up on it in the M docs. Additional resources to ramp up on Data Explorer include an introduction tutorial and a more advanced one on combining data data sources.

The next tool I wanted to cover is free Data Wrangler, This tool is in alpha phase was totally new to me in 2013. Data Wrangler allows interactive transformation and pivoting of messy, real-world data into the cleanly formatted data tables most analysis tools need. The Data Wrangler video is a must watch to understand what it can do to load, clean, and prep data. Data Wrangler also can export the user visually created ETL transformation scripts as code to be used in other tools. Script export can be useful for handling large data sets – first transform a sample of your data in the Data Wrangler visual interface and then run the resulting script on the full data set in SSIS, Informatica, LavaStorm, Alteryx, or another enterprise scale ETL tool. Data Wrangler currently supports output scripts in two languages: Python (for data-crunching on the back end) and JavaScript (for data transform in the web browser, or using node.js). Data Wrangler was tested with sources containing millions of rows using the exported scripts technique.

Another tool in this self-service ETL space is LavaStorm Analytics. I really liked what this group offers but do confess that now we are closer to a power user or an enterprise ETL developer tool user experience. The LavaStorm Analytics tool user interface is much simpler than the enterprise ETL tools I have used before. The introductory YouTube video was quite helpful for me to get a quick understanding of capability to begin building ETL solutions right away the same day. I liked the richness of the Lavastorm Analytics Engine functionality, ETL component reusablity, pre-packaged analytic libraries to incorporate predictive analytic models and statistical data analysis, and R task/R integration out-of-the-box. You do need to script the R right now in the R task (bummer) but I am hopeful that will improve in the future. The ETL component reusablity is a big plus since analytic routines can be complex and often are shared for use on numerous analytic projects. I am not doing this great self-service ETL tool justice in a one paragraph overview. Download it, watch the online free video training, and play with it for yourself. They do have a free desktop version, and several levels of paid desktop versions and server versions starting at $4,500. They also have some other product related offerings around specific analytic use cases like fraud and spend, and optimization.

Alteryx seems to be the “king” of the self-service ETL tools I reviewed and they also had the highest pricing at ~$45,000 per seat and $15,000 for personal edition. Alteryx was a highlighted vendor in the 2013 Gartner BI Magic Quadrant report. Alteryx is super feature rich and comes with embedded, prepackaged industry data sources like MOSAIC profiles,TomTom geospatial, Dunn and Bradstreet, Experian, and other in-tool geographic, demographic, and business data that you get when you buy it. So the high price is really a bundled price for both the industry analytic data sources and the self-service ETL tool. Alteryx has bulk data loaders for SalesForce, Amazon S3, SharePoint, Teradata, Oracle, MongoDb, Hive Hadoop, and many other structured and unstructed data sources. They offer a read-write, hosted analytics application wrapper framework on their website. They also have nice R integration that did not require any R coding AND an R code task option if you want/need R code within your self-service ETL flows. I was not able to test the R functions myself since I had a version conflict with my local open source R installation. Alteryx’ predictive product manager was kind enough to spend time with me covering the R features in depth and also shared that they are working on a vNext iteration that will support side-by-side various R flavor installs. What I felt were true strengths with Alteryx includes their deep location/geospatial analytics and prepackaged industry data sources for mash ups. Powerful common analytic tasks like give me the geocodes for plotting 15 minute drive times from a specific store location on a map are very hard to geocode and then visualize with other tools – in Alteryx it is easy. They also just added a Tableau TDE ETL destination to allow Alteryx ETL data to be easily sent to Tableau for the deep visual, interactive analytics – see image above for the type of analytic solution this awesome combination empowers. Alteryx seems to be used by larger enterprise analytic groups – it is not for the occasional one-off, ETL project or even pure ETL. If you just need ETL, there are other much cheaper ETL tools like LavaStorm Analytics, SSIS or Pentaho.

Last but not least if you just want data cleansing and your company already owns SQL Server 2012, don’t forget about SQL Server Data Quality Services (DQS) and the Data Quality Services add-in for Excel. I am going to copy the description directly from the official overview “SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you to build a knowledge base and use it to perform a variety of critical data quality tasks, including correction, enrichment, standardization, and de-duplication of your data. DQS enables you to perform data cleansing by using cloud-based reference data services provided by reference data providers. You can also perform data quality processes by using the DQS Cleansing component in Integration Services and Master Data Services (MDS) data quality functionality, both of which are based on DQS.” Now this option is really not self-service ETL but I felt the need to mention it here since many self-service ETL projects really are basic data cleansing projects where DQS fits very well. There are tons of free videos and resources to learn how to get started and use DQS.

Tags

Jen Underwood is a Senior Director at DataRobot and founder of Impact Analytix, LLC. She has a unique blend of product management and “hands-on” experience in data warehousing, reporting, visualization, and advanced analytics. In addition to keeping a constant pulse on industry trends, she enjoys digging into oceans of data to solve complex problems with machine learning.
Over the past 20 years, Jen has held worldwide product management roles at Microsoft and served as a technical lead for system implementation firms. She has experience launching new products and turning around failed projects. Most recently she provided advisory, strategy, educational content development, and marketing services to 100+ technology vendors through her own firm. She has been mentioned by KD Nuggets, Information Management and Forbes for her work. She also has written for InformationWeek, O’Reilly Media, and numerous other tech industry publications.
Jen has a Bachelor of Business Administration – Marketing, Cum Laude from the University of Wisconsin, Milwaukee and a post-graduate certificate in Computer Science – Data Mining from the University of California, San Diego. She was also honored to be a former IBM Analytics Insider, Tableau Zen Master, and Top 10 Women Influencer.