Northwestern University

One of the benefits of being at Insight has been a reasonably large library stocked with great material for learning data science. If you’re looking to brush up on your skills or break into the industry, I recommend checking out the following:

I actually read through Winston’s cookbook before Insight, but it’s been an invaluable resource. Why write 20 lines of matplotlib or R base graphics when you can accomplish a better graph using 5 lines of ggplot2?

Most technical interviews with companies will ask
you to whiteboard code some type of recursive function
in your favorite programming language. Although Python
seems to be the dominate king in data science, recursion
can be a powerful tool in R.

What is recursion?

Recursive functions call themselves. That is, they
break down the problem into the smallest possible
components and the function() calls itself within the
original function() on each of the smaller components.
Afterward, the results are put together to solve the
original problem. Let’s take a look at more concrete examples.

New versions of R are pushed frequently to fix bugs and address performance
concerns. However, in order to avoid conflicts between R and packages that were
compiled for older versions of R, every upgrade defines a new system and user library
location in which to install packages (e.g., /Library/Frameworks/R.framework/Versions/3.1/).
So how does one avoid installing each package
manually?

I wrote the following code for my lab to automate the re-installation of an
R system library after version upgrades. It reads the old package names into R as a list
and recompiles each packages for the new version of R, when available.

Overview

When I began using R, like most researchers I kept all my data in some combination
of R’s native data.frame
format or a CSV file that my analysis would continually read.
However, as I began to analyze big datasets at the SAPA Project
and at Insight,
I realized that there is a lot of value to instead keeping your data in a MySQL database
and streaming it into R when necessary.
This post will briefly outline a few advantages of using a database to store data and run through a
basic example of using R to transfer data to MySQL.

Overview of reproducible research

Reproducible research is a phrase that describes an academic paper or manuscript that contains the code and data in addition to what is usually published - the researcher’s interpretation. In doing so, the experimental design and method of analysis is easily replicated by unaffiliated labs and critiqued by reviewers as the full analysis used to produce the results is submitted along with the final paper. One way of producing reproducible research is to use R code directly inside your LaTeX document. In order to faciliate the combination of statistical code and manuscript writing, two R packages in particular have arisen: Sweave and knitr. knitr is an R package designed as a replacement for Sweave, but both packages combine your R analysis with your LaTeX manuscript (i.e., knitr = R + LaTeX).

One advantage of knitr is that the researcher can easily create ANOVA and demographic tables directly from the data without messing around in Excel. However, as we’ll see, both knitr and Sweave can run into problems when formatting your table values to 2 decimal points. In this post, I’ll detail my proposed method of fixing that which can be applied to your entire mansucript by editing the beginning of your knitr preamble.

What are tetra- and polychoric correlations?

Polychoric correlations estimate the correlation between two theorized normal distributions given two ordinal variables. In psychological research, much of our data fits this definition. For example, many survey studies used with introductory psychology pools use Likert scale items. The responses to these items typically range from 1 (Strongly disagree) to 6 (Strongly agree). However, we don’t really think that person’s relationship to the item is actually polytomous. Instead, it’s an imperfect approximation.

Similarly, tetrachorics are special cases of polychoric crrelations when the variable of interest is dichotomous. The participant may have gotten the item either correct (i.e., 1) or incorrect (i.e., 0), but the underlying knowledge that led to the items’ response is probably a continuous distribution.

When you have polytomous rating scales but want to disattenuate the correlations to more accurately estimate the correlation betwen the latent continuous variables, one way of doing this is to use a tetrachoric or polychoric correlation coefficient.

The problem

At the SAPA Project, the majority of our data is polytomous. We ask you the degree to which you like to go to lively parties to estimate your score on latent extraversion. Presently, we use mixed.cor(), which calls a combination of the tetrachoric() and polychoric() functions in the psych package (Revelle, W., 2013).

However, each time we build a new dataset from the website’s SQL server, it takes hours. And that’s if everything goes well. If there’s an error in the code or a bug in a new function, it may take hours to hit the error, wasting your day.

After a bit of profiling, it was revealed that much of our time building the SAPA dataset was used estimating the tetrachoric and polychoric correlation coefficients. When you do this for 250,000+ participants for 10,000+ variables, it takes a long time. So Bill and I thought about how we could speed them up and feel others may benefit from our optimization.

A serious speedup to tetrachoric and polychoric was initiated with the help of Bill Revelle. The increase in speed is roughly 1- (nc-1)2 / nc2 where nc is the number of categories. Thus, for tetrachorics where nc=2, this is a 75% reduction, whereas for polychorics of 6 item responses this is just a 30% reduction.

Item Response Theory can be used to evaluate the effectiveness of
exams given to students. One distinguishing feature from other paradigms is that it does not assume that every question
is equally difficult (or that the difficulty is tied to what the researcher said). In this way, it is an empirical investigation
into the effectiveness of a given exam and can help the researcher 1) eliminate bad or problematic items and 2) judge whether the test was too difficult or the students simply didn’t study.

In the following tutorial, we’ll use R (R Core Team, 2013) along with the psych package (Revelle, W., 2013) to look at a hypothetical exam.

When you’re writing up reports using statistics from R, it can be tiresome
to constantly copy and paste results from the R Console. To get around this, many of us use Sweave, which allows us to embed R code in LaTeX files.
Sweave is an R function that converts R code to LaTeX, a document typesetting language. This enables accurate, shareable analyses as well as high-resolution graphs that are publication quality.

Needless to say, the marriage of statistics with documents makes writing up APA-style reports a bit easier, especially with Brian Beitzel’s amazing apa6 class for LaTeX.

This guide is intended to faciliate the installation of up-to-date R packages
for users new to either R or Linux. Unlike Windows binaries or Mac packages,
Linux software is often distributed as source-code and then compiled by package
maintainers. The use of package managers has many advantages that I won’t
discuss here (see Wikipedia).
More importantly, the difference can be initially intimidating.
However, once the user gets used to using package managers such as
apt or
yum to install software,
I’m confident they’ll appreciate their ease of use.

These instructions are organized by system type.

Debian-based Distributions

Ubuntu

Full installation instructions for Ubuntu can be found
here. Luckily, CRAN mirrors have
compiled binaries of R which can be installed using the apt-get package manager.
To accomplish this, we’ll first add the CRAN
repo for Ubuntu packages to
/etc/apt/sources.list. If you prefer to manually edit the sources.list file,
you can do so by issuing the following in the terminal:

I use SSH regularly to login remotely to servers for experiments and
data analysis. For instance, Northwestern’s Social Sciences Computing
Cluster is available with an
SSH remote login and using X11
forwarding, I can access RStudio and run analyses
that require more memory than my office iMac has. However, logging into the
SSCC over SSH isn’t as quick and launching a program in Spotlight.

While browsing a friend’s .bashrc on Github, I realized I could
use a simple Bash function to speed things up. Copy and paste the following
into Terminal:

After you restart Terminal.app, you can launch RStudio remotely by typing
Rsscc, or whatever you renamed my function to. In principle, you could also
create a simple menu for choosing among multiple servers or programs using a bit of
read and
case.