According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.

Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame-like objects. However, certain functions require a data.table class object as input due to the update-by-reference feature, which I will cover in later part of the post.

Now enough said and let's look at some code, shall we?

Take the BostonHousing dataset from the mlbench library:

library(mlbench)
data("BostonHousing", package = "mlbench")

Initial Visualization

Without knowing anything about the data, my first 3 tasks are almost always:

library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?

While there are not many interesting insights from plot_missing and plot_bar, below is the output from plot_histogram.

Upon scrutiny, the variable rad looks like discrete, and I want to group crim, zn, indus and b into bins as well. Let's do so:

At this point, we have much better understanding of the data distribution. Now assume we are interested in medv (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:

And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.

Feature Engineering

Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a data.table as the input object, because it is lightning fast. However, if you don't feel like coding in data.table syntax, you may adopt the following process:

Comments

You can follow this conversation by subscribing to the comment feed for this post.

The vignette aircraft examples are a bit misleading, as data needs a bit more cleanup, I think. Airbus is in the list with two different strings, McDonnell Douglas with at least three, and Canada with two. If those were first lumped together into one each, before lumping the long tail together into an "other" bin, this could make a big difference in further modeling, as Airbus would jump to largest group by far, not the third, with about half of the Airbus data being lumped into "other". #oops

I like the package, but why the inconsistent ggplot theming: Defaults for boxplots, but odd semi-transparent bars with not really prett black outlines for the barplots and histograms? Sticking to ggplot standards would have been nicer imho.

plot_str just gives:
Error: C stack usage 7970280 is too close to the limit