Sebastian Sauer Stats Blog

Latest Posts

Slidify is a cool tool to render HTML5 slide decks, see here, here or here for examples.

Features include:

reproducibility. You write your slide deck as you would write any other text, similar to Latex/Beamer. But you write using Markdown, which is easier and less clumsy. As you write plain text, you are free to use git.

modern look. Just a website, nothing more. But with cool, modern features.

techiwecki. Well, probably techie-minded persons will love that more than non-techies…
Check this tutorial out as a starter.

Basically, slidify procudes a website:

While it is quite straight forward to write up your slidify presentations, some customization is a bit more of a hassle.

Let’s assume you would like to include a logo to your presentation. How to do that?

Logo page

There is an easy solution: In your slidify main file (probably something like index.Rmd) you will find the YAML header right at the start of the file:

That’s it! Happy slidifying!

A quite common task in data analysis is to change a dataset from wide to long format.

For example, this is a dataset in wide format:

Is is called wide, as, well, it is wide – several columns side by side.

For example, assume, we have measured a number of predictors (here: predictor_1, predictor_2, predictor_3), and an outcome measure (here: outcome). In this case, each variable is dichotomous (either yes or no).

For plotting, eg. using ggplot2, we have to convert this format to long format. Long datasets look like this:

So, non really surprising, the dataset is now longer than in wide format. Hence the name…

It was converted in a certain way. Look at the two figures (and the color scheme). The 4 columns are now 3: All of the predictors are now paired in two colums: key and value. In the key column we find the names of the former columns as entries; in the value column we find the entries of those columns as entries (check the colors).

We easily see that the variable outcome was not paired. Before and after the convertion, it happily resides in its own, whole column. We often want a non-paired column, as we need it for eg., facetting.

Let’s visualize this transformation. First, we look at the transformation of the values:

Second, let’s look at the transformation of the keys (column headers), and the outcome variable, which will not be paires (remains in its own, complete, tidy column):

Recently, I analyzed some data of a study where the efficacy of online psychotherapy was investigated. The investigator had assessed whether or not a participant suffered from some comorbidities (such as depression, anxiety, eating disorder…).

I wanted to know whether each of these (10 or so) comorbidities was associated with the outcome (treatment success, yes vs. no).

Of course, an easy solution would be to “half-manually” check the association, eg. using table() in R. But I wanted a more reproducible, more automated solution (ok, I confess, I justed wanted it to be cooler, probably…).

As a starter, consider the old question of the Titanic disaster: Did class correlate with survival? Let’s see:

data(titanic,package="COUNT")# package need be installedattach(titanic)# save some typingtable(class,survived)# tabulate the two variables

Note that we are talking about nominal variables with two levels, ie. dichotomous variables. In other words, we are looking at frequencies.

So my idea was:

take a column of a predictor (say, depression)

take the column of the outcome variable (treatment success)

Cross-tabulate the association.

Repeat those steps for each of the comorbidities

Make sure it looks (sufficiently) cool…

Have I already told that I like the R-package dplyr? Check it out. Let’s see how to do the job with dplyr. Let’s use the Wages dataset from package ISLR.

This dataset from ISLR shows the wage of some male professionals in the Atlantic US area; along with the wage there are also some “risk factors” for the wage such as health status, education, race. Let’s see if there is an association.

First, load the package and the data:

# don't forget to install the packages upfront.library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
## filter, lag

z-transformation is an ubiquitous operation in data analysis. It is often quite practical.

Example: Assume Dr Zack scored 42 points on a test (say, IQ). Average score is 40 in the relevant population, and SD is 1, let’s say. So Zack’s score is 2 points above average. 2 points equals to SDs in this example. We can thus safely infer that Zack is about 2 SDs above average (leaving measurement precision and other issues at side).

If the variable (IQ) is normally distributed, than we are allowed to look up the percentiles of this z-value in a table. Or ask the computer (in R):

pnorm(2)

## [1] 0.9772499

Here, R tells us that approx. 98% of all individuals have an IQ that is smaller than that of Dr. Zack. Lucky guy! Compare the figure.

Here an overview on some frequently used quantiles of the normal distribution:

Now, next step. We have a bunch of guys, and test each of them. Say, 10 guys. Now we can calculate mean and SD of this distribution. To see the differences more clearly, we can z-transform the values.

But why is it that mean and sd will end up nicely with 0 (mean) and 1 (sd)?

To see this, consider the following. A z-transformation is in step 1 nothing more than subtracting the mean of some value. If I subtract the mean (42 here) from each value, and compute the mean then, I will not be surprised that the mean is now 42 points lower than it was. But 42-42=0. So the new mean (after z-transformation) will be 0.

Ok, but what about sd? Why must it equal 1? A similar reasoning applies. If I divide each value by the SD of the distribution (here 1), the new SD will be exactly by this factor lower. So whatever it used to be before z-transformtion, it will be 1 afterwards, as x / x = 1.