DataRobot Automated Machine Learning

DataRobot Automated Machine Learning

DataRobot is the world’s most advanced automated machine learning platform. It empowers data analysts and data scientists to rapidly find key insights, hidden data patterns and make better predictions faster. With unmatched ease of use – no complicated math or scripting required – DataRobot automates the training and evaluation of numerous predictive models in parallel, delivering more accurate predictions and easier model deployment at scale. In this article, I will briefly introduce and walk through a tour of the DataRobot automated machine learning solution.

Introducing DataRobot

I have been following DataRobot for several years now. To be completely transparent with you, DataRobot is one of the most impressive solutions that I have reviewed in a long time. The DataRobot automated machine learning platform expedites predictive model building, training, evaluation, and deployment. Using drag-and-drop, point-click guided menu options, users with all degrees of data science experience can build predictive models simply and quickly with automated machine learning.

Machine learning life-cycle steps that used to take me weeks or months of effort can now be completed in hours.

What makes DataRobot truly unique is the baked-in model blueprints and best practices that were designed by some of the world’s leading data scientists. DataRobot’s built-in optimizations and safeguards allow analytical talent of all skill levels – from business analysts to highly experienced data scientists – to safely apply machine learning models to properly prepared data.

Domain knowledge and best practices designed by the world’s leading data scientists have been uniquely baked in.

DataRobot’s extensive libraries of algorithms are also quite impressive. It supports popular advanced machine learning techniques and open source tools such as Apache Spark, H2O, Scala, Python, R, and TensorFlow. DataRobot streamlines model development by performing a parallel heuristic search for the best model or ensemble of models based on the characteristics of the data and the prediction target. By cost-effectively evaluating a near-infinite combination of data transformations, features, algorithms, and tuning parameters in parallel across a large cluster of servers, DataRobot delivers the best predictive model in the shortest amount of time.

Take a Tour

To get started with DataRobot, you will log in and load a prepared dataset. To learn how to properly prepare data for DataRobot, please refer to this article, webinar, and complimentary white paper on that topic. Although DataRobot has some data cleansing, preparation, and transformation capabilities, usually a niche data wrangling tool is recommended for advanced data preparation.

Loading and Profiling Data

DataRobot currently supports uploading csv, tsv, dsv, xls, xlsx, sas7bdat, bz2, gz, zip, tar, and tgz file types, and reading data from a variety of enterprise databases via JDBC database connectivity. Directly loading data from production databases for model building allows you to quickly train and retrain models. It also eliminates the need to export data to a file for ingestion into DataRobot.

After you load your data, DataRobot performs exploratory data analysis, detecting the data types and showing the number of unique, missing, mean, median, standard deviation, and minimum and maximum values. This information is helpful for getting a sense of the dataset shape and distribution.

Selecting a Prediction Target

Next, you will select a prediction target (what you are trying to predict) from the uploaded dataset and click the big “Start” button to begin training models in Autopilot mode. Note: if you have dates in your dataset, you might also see time-aware modeling settings.

If you want to customize the model building process, you can modify a variety of advanced parameters, optimization metrics, feature lists, transformations, partitioning and sampling options with the Show Advanced Options link. For more control over which models DataRobot runs, there are manual and quick-run options.

Once the modeling process begins in DataRobot, the platform further analyzes the data to create an Importance column. This Importance grading provides a quick cue to better understand the most influential variables for your chosen prediction target.

On this screen, visual plots reveal relationships between each feature and the target variable. There are also options to drill down on variables to view distributions, add features, and apply basic transformations.

Reviewing Automated Modeling Results

DataRobot’s autopilot searches through hundreds or thousands of possible combinations of algorithms, pre-processing steps, features, transformations, and tuning parameters. It then uses supervised learning algorithms to analyze the data and identify predictive relationships. Autopilot is ideal for smart data exploration, finding key influencing variables and patterns. After it completes, you will be shown a Leaderboard of top-ranking predictive models you can explore further.

To examine the ranked predictive models, you click on a model name and are shown a variety of options to Understand, Describe, Evaluate, and Predict. Popular exploratory capabilities here include the Feature Impact rankings, Model X-Ray, Prediction Explanations and Word Cloud. These all help enlighten you on what drives a model’s predictions.

Feature Impact measures how much each feature contributes to the overall accuracy and outcome / prediction of the model (i.e. column values within Age and Commute Distance have significant effects on determining whether an individual would purchase a bike.) Feature Impact highlights which columns you should explore further. This information alone can be valuable in guiding an organization to focus on what matters most.

The Model X-Ray chart displays more details on a per-feature basis—a feature’s effect on the overall prediction—depicting how a model “understands” the relationship between each variable and the target. It provides specific values within each column that are likely large factors in determining whether someone will purchase a bike or not. This information is great for understanding where the model makes errors for input tuning.

Diving deeper, DataRobot’s Insights tab provides more graphical representations of your model. There are tree-based variable rankings, variable effects to illustrate the magnitude and direction of a feature’s effect on a model’s predictions, hotspots, anomaly detection, text mining charts, and a word cloud of keyword relevancy.

The Word Cloud tab provides a graphic of the most relevant words and short phrases in a word-cloud format. The tab is only available for models trained with data that contains unstructured text. Here is an example from a different healthcare related dataset.

Model Blueprint

In the Describe tab, you can view the end-to-end model blueprint containing details of the specific tasks and algorithms DataRobot uses to run the model. They do this by linking out to detailed model documentation that facilitates knowledge sharing and training. You can also review the size of the model and how long it ran.

Model Evaluation

After you build a set of models, you can then evaluate and select which one is best to use for prediction. You can refer to the model Leaderboard to view a ranked list of models with summary performance information, charts, graphs, and functions. To estimate possible model performance, the Evaluate options include industry standard Lift Chart, ROC Curve, Accuracy over Time, Confusion Matrix, and Advanced Tuning. There are also options for measuring models by Learning Curves, Speed versus Accuracy, and Comparisons. The interactive charts to evaluate models are very detailed, but don’t require a steep learning curve in order to understand what they convey. Business analysts and citizen data scientists will be able to easily figure out which model should perform best for a given use case.

DataRobot model evaluation and validation helps assess model accuracy. There are several industry standard methods available for validating models including, but not limited to,Train-Validation-Holdout and k-fold cross validation.

Model Predictions

You can immediately put your DataRobot model findings to work with Predict options. Here you can upload a new dataset to DataRobot to be scored and downloaded. You also have an option to download all the DataRobot charts if you want to create a presentation or report of your findings.

Actionable DataRobot output can be used for exploration, making decisions, creating presentations, or integrating predictions.

Every model built in DataRobot is immediately ready for deployment. DataRobot API options allow you to integrate predictions into apps, reports, or business processes. There are also options to export scoring code for applications where API scoring is not an option.

Regulatory Compliance

DataRobot can automatically generate model documentation – a detailed report containing an overview of the model development process, with full insight into the model assumptions, limitations, performance and validation detail. This feature is ideal for organizations in highly regulated industries that have compliance teams that need to review all aspects of a model before it can be put into production. Of course, having this degree of transparency into a model has clear benefits for organizations in any industry.

Custom Models with Jupyter

Although DataRobot builds hundreds of predictive models “out of the box” using a vast set of diverse, best-in-class algorithms, there may be times when you want to test your own custom Python or R models in DataRobot. To use custom models with DataRobot, Jupyter Notebook integration is available.

User-built models get added into the Leaderboard rankings so you can see how they compare to other DataRobot-built models.

For More Information

In this week’s Solution Review, I have barely scratched the surface of DataRobot’s capabilities. There is so much more for you to explore. If you would like to learn more about automated machine learning, please review the following recommended resources or contact a DataRobot expert.

Tags

Jen Underwood is a Senior Director at DataRobot and founder of Impact Analytix, LLC. She has a unique blend of product management and “hands-on” experience in data warehousing, reporting, visualization, and advanced analytics. In addition to keeping a constant pulse on industry trends, she enjoys digging into oceans of data to solve complex problems with machine learning.
Over the past 20 years, Jen has held worldwide product management roles at Microsoft and served as a technical lead for system implementation firms. She has experience launching new products and turning around failed projects. Most recently she provided advisory, strategy, educational content development, and marketing services to 100+ technology vendors through her own firm. She has been mentioned by KD Nuggets, Information Management and Forbes for her work. She also has written for InformationWeek, O’Reilly Media, and numerous other tech industry publications.
Jen has a Bachelor of Business Administration – Marketing, Cum Laude from the University of Wisconsin, Milwaukee and a post-graduate certificate in Computer Science – Data Mining from the University of California, San Diego. She was also honored to be a former IBM Analytics Insider, Tableau Zen Master, and Top 10 Women Influencer.