This guide is geared towards data analysts who have limited or no experience with R. It aims to get the reader started with R through concrete examples with Snowplow data - so that by the end of the guide, the reader can pull Snowplow data into R, play around with it and plot it. That should leave the reader comfortable enough to use other resources to build out her knowledge of R, and give her enough knowledge to start using R to perform productive analytics on Snowplow data.

The previous example a was simple enough start. It didn’t enable us to do anything that could not easily be achieved using any other visualization tool.

In our second example, we’ll pull a more complicated data set from Snowplow, and try out some visualizations in GGPlot2 (boxplot and jittered scattergrams) that I think are awesome at conveying distribution information, and are not well supported by other tools.

For this example, we’re going to look at visit data by a number of different dimensions, and produce different plots to see how different visit characteristics vary with those different dimensions. This is classically something that is well supported by a BI / OLAP tool like Tableau. The example shows that R has those same capabilities, and in addition, a broader set of visualizations to enable us to unpick relationships between metrics and dimensions.

First, we need to design a SQL query to pull the cut of data we want. In this case, we want visit level data. We want to capture the following metrics for each visit:

We need to add a line about whether or not one of our services pages was visited as part of the customer journey. We can identify all the visits where the visitor did pop by one of our services pages by executing the following query:

SELECT
domain_userid,
domain_sessionidx,
1 AS 'visited_services_pages'
FROM "public"."events"
WHERE page_urlpath LIKE '/services/%'
GROUP BY domain_userid, domain_sessionidx, visited_services_pages

And then join the results of the above query to our first query to generate the additional dimension:

Lastly we add the data about who are referer is, and what the URL of the landing page is. Both of these data points can be read from the first line of data for the visit, so we add them by doing an additional join with the output of the previous query, this time on domain_userid, domain_sessionidxandmin(collector_tstamp):

This is a convenient moment to take a quick aside from our example and introduce factors in R.

In data analysis, we can distinguish three types of variable:

Type of variable - Description - ExampleContinuous - A variable that can take any value with a range - e.g. Height of a person, session lengthOrdinal - A discrete (not continuous) variable, but one in which there is an order - e.g. Sizes of T-shirt (small / medium / large)Nominal - A discrete (not continuous) variable, where there is no inherent order between values - e.g. Landing pages

Continuous variables in R are typically numeric. Ordinal and nominal variables are factors. We can tell R which of the columns in our data frame are factors. This will help R

Process our data more efficiently

Plot our data more intelligently (because R will understand the type of variable in each column)

domain_userid, for example, is a factor: it can take one of a finite (but large) number of values of cookies set. It is an unordered factor, because there is no hierarchy between values. We can tell R it is factor by entering:

visits$domain_userid <- as.factor(visits$domain_userid)

If we inspect the structure of that column, we’ll see it is now a factor:

That is kind of interesting: it shows that we have visitors who engage very deeply with only one page on our site (e.g. the visitor at the top left of the scatter) and we have visitors who engage lightly across a broad range of pages (bottom right) and deeply with a lot of pages (top right). However, most of the points are in the bottom left, and they are too closely spaced to see how closely packed they are.

We can get a clearer view if we plot a jitter plot: this “jitters” points left and right about where they would otherwise be plotted, making it easier to identify areas with a high concentration of points:

Note: we have changed the plot type from “point” to “jitter” to jitter each point. The “alpha” argument is responsible for point transparency: we have set it to 5% (i.e. 20 points in one place will be required to create a black spot). The scale_y_continuous and scale_x_continuous arguments scale the x and y axes so that we zoom in on the point of the graph where most of our visits lie. R has kindly reminded us that some of our points lie outside this range, and so have been been removed.

The graph shows us that the majority of visitors do not engage deeply. It actually appears that engagement is likely to be deeper if a visitor visits fewer pages. That is interesting: we can explore if using a boxplot:

The boxplot shows that our earlier intuition was wrong: engagement depth increases with breadth. We had thought looking at the jittered plot that a higher proportion of people who only engaged with one page engaged more deeply. Now we can see that this was false: it looked to be the case, because the absolute number of people engaging deeply who only look at one page is higher than the actual number who engage deeply with a broader set. The boxplot controls for the difference in absolute numbers: we are instead looking at the distribution within each breadth level.

One thing to notice is that when producing the boxplot, we plot factor(distinct_pages_visited) on the x-axis rather than simply distinct_pages_visited. That is because whereas a scatter or jitter plot can be produced by plotting a continuous variable on the x-axis, a boxplot requires a discrete variable. Applying the factor function converts our continuous variable into a discrete variable very simply: in general, R makes converting one type of variable into another very simple.

The boxplot above was interesting: it showed very clearly a relationship between breadth and depth of engagement by visit. Let’s use the same plot, but this time to compare breadth of engagement by landing page:

Ouch - this graph is very messy: breadth of engagement does appear to vary by landing page, but there are too many (100+) landing pages to make the analysis straight forward. What we want is to plot the results just for the top 10 landing pages, by number of visits.

To do this, let’s start by identifying the top 10 landing pages by number of visits. This is a straightforward query to write in SQL: luckily a package is available for R called sqldf that enables you to to execute SQL statements against data frames in R directly, and returns the results as a new data frame:

We’ve now created a data frame with two columns: landing_page (with the landing page name) and cnt (with the count of visits). There are 10 lines of data in the frame, which corresopnd to the top 10 landing pages by number of visits.

We can now use this to create a subset of the visit data so we only include visits where the landing page is in our top 10 list:

We can tidy up the x-axis by aligning the labels on the axis vertically, by adding + theme(axis.text.x = element_text(angle=90, vjust=0.4)) to our command. We can also hide the outliers on our plot, to make it clear, by adding outlier.shape=NA:

The results are very interesting: visitors who land on the homepage and analytics cookbook are much more likely to engage more broadly with content on the site, rather than visitors who land on the specific blog posts (e.g. on writing Hive UDFs or dealing with Hadoop small files problem.)

Let’s see if there is a similar pattern in the depth of engagement by landing page. (This is likely: remember we showed in our first boxplot that visitors who engage more broadly are more likely to also engage more deeply.)

The impact of landing page on depth of engagement is not nearly as pronounced as that between landing page and depth of engagement. That is likely to be because of both:

The intent of people landing on specific blog posts is likely to be much more focused on learning specific bits of information (that may not be Snowplow specific), rather than learn about Snowplow in general. (Which someone landing on the hoempage or cookbook homepage are likely to be interested in.)

The design of the homepage and analytics cookbook homepage encourage users to explore beyond the page much more forcefully than the links that surround individual blog posts.

Let’s now have a look and see if breadth of engagement changes depending on the referer type. To do this, we simply swap out landing_page as our x-axis dimension (first argument in qplot) and swap in refr_medium. Also note that we have switched back to using our original data frame visits, rather than the subset we created for the landing page analysis:

The chart is interesting - it suggests that even though e.g. direct visitors engage less deeply, overall, they are more likely to visit our services pages (and hence convert) than visitors from search. The relationship between level of engagement and conversion isn’t straightforward - we can explore it by comparing our engagement scatter plots (breadth vs depth) for visitors that did visit the services page, with those that do not:

We used the facet argument above to ask R to plot us two version of the scatter plot, one for visits where the services pages were visted, and one for visits where they were not, to see if the overall pattern of engagement for the two groups is different. The results are interesting: they suggest that in both cases, there is a strong relationship between breadth and depth of engagement. (These are highly correlated.) However, visitors that visited the services pages are more likely to engage both more deeply and more broadly, overall, than those who do not, as evidenced from the lower fraction of points at the bottom right of the graph. (Further visualization work is required to specify this more precisely.)

This tutorial barely scratches the surface of what is possible with R. The purpose of it is to give enough familiarity with basic R functions around fetching data, managing data via data frames, and plotting data, that the reader will be able to get up and running with R on top of Snowplow data.

There is a wealth of material on R available. The following are some recommended resources:

R in Action. An excellent and thorough overview of R, with lots of straightforward-to-follow examples

Hacley Wickham’s excellent ggplot2 book. Written by the author of ggplot2, this covers the actual graphics grammar implemented by this excellent library