Excited to announce that i am waiting in a LINE for the WOMEN’s restroom at a tech conference! Thanks @RLadiesNYC, @RLadiesGlobal, and @nyhackr for the opportunity, wouldn’t be here without your support

Emily Robinson Shows There is More to the Tidyverse than Hadley

Emily Robinson, otherwise known as ERob, gave an excellent talk showing how the Tidyverse is so much more than just Hadley and that there are many people inspired by him to contribute in the Tidy way.

Sean Taylor Forecasted the Future with Prophet

Sean Taylor, former New Yorker and unrepentant Eagles fan, demonstrate his powerful R and Python, package Prophet, for forecasting time series data. Facebook open sourced his work so we could all benefit.

Following her sold-out meetup appearance in March, Jennifer continued to push the boundaries of causal inference.

I Made the Authors of Caret and scitkit-learn Show That R and Python Can Get Along

While both Andreas and Max gave great individual talks, I made them pose for this peace-making photo.

David Robinson Got the Upper Hand in a Sibling Twitter Duel

Given only about 30 minutes notice, David put together an entire slideshow on how to livetweet and how to compete with your sibling.

In the End Emily Robinson Beat Her Brother For Best Tweeting

Despite David’s headstart Emily was the best tweeter (as calculated by Max Kuhn and Mara Averick) so she won the WASD Code mechanical keyboard with MX Cherry Clear switches.

Silent Auction of Data Paintings

Thomas Levine made paintings of famous datasets that we auctioned off with the proceeds supporting the R Foundation and the Free Software Foundation. The Robinson family very graciously chipped in and bought the painting of the Pizza Poll data for me! I’m still floored by this and in love with the painting.

In addition to the traditional Pi Symbol atop the cake we added Albert Einstein since today is also his birthday. It seems fitting that we lost one of the world’s other greatest physicists, Stephen Hawking on the same math holiday.

Pi Cake 2018

The crew has grown quite large from the five of us who celebrated our first pie day almost a decade ago.

In my last post I discussed using coefplot on glmnet models and in particular discussed a brand new function, coefpath, that uses dygraphs to make an interactive visualization of the coefficient path.

Another new capability for version 1.2.5 of coefplot is the ability to show coefficient plots from xgboost models. Beyond fitting boosted trees and boosted forests, xgboost can also fit a boosted Elastic Net. This makes it a nice alternative to glmnet even though it might not have some of the same user niceties.

The data are about New York City land value and have many columns. A sample of the data follows. There’s an odd bug where you have to click on one of the column names for the data to display the actual data.

While glmnet automatically standardizes the input data, xgboost does not, so we calculate that manually. We use preprocess from caret to compute the mean and standard deviation of each numeric column then use these later.

preProc <- preProcess(manTrain, method=c('center', 'scale'))

Just like with glmnet, we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with xgboost we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix and vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

There are two functions we can use to fit xgboost models, the eponymous xgboost and xgb.train. When using xgb.train we first store our X and Y matrices in a special xgb.DMatrix object. This is not a necessary step, but makes things a bit cleaner.

The best fit was arrived at after 41 rounds. We can see how the model did on the train and validate sets using dygraphs.

dygraphs::dygraph(mod1$evaluation_log)

We can now plot the coefficients using coefplot. Since xgboost does not save column names, we specify it with feature_names=colnames(manX). Unlike with glmnet models, there is only one penalty so we do not need to specify a specific penalty to plot.

coefplot(mod1, feature_names=colnames(manX), sort='magnitude')

This is another nice addition to coefplot utilizing the power of xgboost.

I’m a big fan of the Elastic Net for variable selection and shrinkage and have given numeroustalks about it and its implementation, glmnet. In fact, I will even have a DataCamp course about glmnet coming out soon.

As a side note, I used to pronounce it g-l-m-net but after having lunch with one of its creators, Trevor Hastie, I learn it is pronounced glimnet.

coefplot has long supported glmnet via a standard coefficient plot but I recently added some functionality, so let’s take a look. As we go through this, please pardon the htmlwidgets in iframes.

First, we load packages. I am now fond of using the following syntax for loading the packages we will be using.

In order to use glmnet we need to convert our tbl into an X (predictor) matrix and a Y (response) vector. Since we don’t have to worry about multicolinearity with glmnet we do not want to drop the baselines of factors. We also take advantage of sparse matrices since that reduces memory usage and compute, even though this dataset is not that large.

In order to build the matrix ad vector we need a formula. This could be built programmatically, but we can just build it ourselves. The response is TotalValue.

Notice the - 1 means do not include an intercept since glmnet will do that for us.

manX <- useful::build.x(valueFormula, data=manTrain,
# do not drop the baselines of factors
contrasts=FALSE,
# use a sparse matrix
sparse=TRUE)
manY <- useful::build.y(valueFormula, data=manTrain)

We are now ready to fit a model.

mod1 <- glmnet(x=manX, y=manY, family='gaussian')

We can view a coefficient plot for a given value of lambda like this.

coefplot(mod1, lambda=330500, sort='magnitude')

A common plot that is built into the glmnet package it the coefficient path.

plot(mod1, xvar='lambda', label=TRUE)

This plot shows the path the coefficients take as lambda increases. They greater lambda is, the more the coefficients get shrunk toward zero. The problem is, it is hard to disambiguate the lines and the labels are not informative.

Fortunately, coefplot has a new function in Version 1.2.5 called coefpath for making this into an interactive plot using dygraphs.

coefpath(mod1)

While still busy this function provides so much more functionality. We can hover over lines, zoom in then pan around.

These functions also work with any value for alpha and for cross-validated models fit with cv.glmnet.

The biggest event for me this year was completely outside of work and had nothing to do with statistics or R: I got married. We technically met at the Open Stats meetup and I did build our wedding website with RMarkdown, so R was still involved. We just returned from our around-the-world honeymoon so I thought the best way to track our travels would be with maps and globes using leaflet and threejs.

Before we get to any code, the following packages were used in making this post.

This was an extensive trip that, in addition to traditional vacation activities, included a few visits to clients and speaking and a fewconferences and meetups. In all, we visited, London, Singapore, Hong Kong, Auckland, Queenstown, Bora Bora, Tahiti, Moorea, San Jose and San Francisco, with a connection or two in between.

The airport/ferry codes for our trip were the following.

Origin

Destination

Airline

JFK

LGW

Norwegian Air

LHR

SIN

Singapore Airlines

SIN

HKG

Singapore Airlines

HKG

AKL

Cathay Pacific

AKL

ZQN

Air New Zealand

ZQN

AKL

Air New Zealand

AKL

PPT

Air New Zealand

PPT

BOB

Air Tahiti

BOB

MOZ

Air Tahiti

MOZ

PPT

Terevau

PPT

LAX

Air Tahiti Nui

LAX

SJC

Alaska Airlines

SFO

JFK

JetBlue

Converting these to latitude and longitude is easy thanks to Open Flights.

We then manually reorder the airports so that edges can be drawn nicely between them. This is akin to creating an edgelist of airport-pairs. This is not the most robust way of creating this list, but suffices for our purposes.

Now we can provide that data to threejs. We first specify an image to overlay on the globe. Then we specify the latitude and longitude of visited airports. After that, we provide the origin and destination latitudes and longitudes of our flights. The rest of the arguments are cosmetic.

Beyond the epic proportions of our travel, this honeymoon was outstanding from the sheer length, to the vastly different places we visited, to the food we ate and the sights we saw, to the activities we participated in, to the great people along the way. And, of course, it’s amazing to spend a month traveling with your favorite person.

For a while I’ve been trying to find the best slideshow format to create using RMarkdown. My go-to format is ioslides because it has a presenter mode that works in self-contained mode, meaning the entire presentation is in one file. The drawback is that fullscreen images is not a built in feature.

Another popular format is reveal.js. This allows for fullscreen images, but presenter mode is not possible while keeping the file self-contained.

A very promising format is from the xaringan package which is based on remarkjs. This allows for beautiful, fullscreen images and it has an excellent presenter mode. This can all be done with a self-contained file. The problem is, due to the nature of the format, having all the images URI encoded for self-contained mode causes the file to open very slowly. I gave a talk about web scraping at the MIT Sports Analytics Conference which takes over five minutes to open.

Given all this, I decided to hack my way to a solution using ioslides. After some prep it works quite well.

The trick is to associate the image, in CSS, with the ID of the <slide>. Even though we assign an ID to the <article> this does not help us know the ID of the <slide>. To get around this we use a bit of jQuery. We write some code that take a name attribute from every <article> tag and assigns it as an ID for the corresponding <slide> tags.