Models help clarify complex data-sets

Effect size often has to been seen to be understood

When doing confirmatory analysis, we might want to know how strong an
effect is. Data viz is very useful for this task. Lets compare two
datasets that have similar p-values, but very different effect sizes

Superplots

Andrew Gelman coined the term
superplots for
plotting different models on multiple panels of a graph so you can
visually compare them.

For instance, say we have several time-series and we want to know if
they deviate from each other signficantly. An easy way to compare them
is to fit splines to each time-series and then just plot them next to
each other, with SEs. Then we can compare visually for ‘signficant’
differences.

Here’s some code to simulate three made-up series. The first two have
the same trend, but different observation errors, the third has a
different trend:

When is ggplot2 appropriate, or when should I use base R?

In general I think ggplot2 is appropriate for
problems of intermediate complexity. Like this:

Base R is great if you just want to plot a barplot quickly, or do an x-y
plot. ggplot2 comes into its own for slight more complex plots, like
having multiple panels for different groups or colouring lines by a 3rd
factor. But once you move to really complex plots, like overlaying a
subplot on a map, ggplot2 becomes very difficult, if not impossible. At
that point it is better to move back to Base R. ggplot2 can also get
very fiddly if you are very specific about your plots, like having
certain colours, or the labels in a certain way.

In reality both ggplot2 and base R graphics are worth learning, but I
would start with learning the basics of base R graphics and then move
onto ggplot2 if you want to quickly plot lots of structured data-sets.

Pie graphs vs bar graphs

In Mariani et al. they plot rates of seafood fraud by several European
countries. While its a foundational study that establishes improvement
in the accuracy of food labelling, their graphics could be improved in
several ways.

First they use perspective pies. This makes it incredibly hard to
compare the two groups (fish that are labelled/mislabelled). Humans are
very bad at comparing angles and pretty bad at comparing areas. With the
perspective you can’t even compare the areas properly. They do provide
the raw numbers, but then, why bother with the pies?
Note that the % pies misrepresent the data slightly because the %
figures are actually odds ratios (mis-labels / correct labels), rather
than percent (mis-labeels / total samples).
Second the pies are coloured red/green, which will be hard for red-green
colourblind people to see.
Third, they have coloured land blue on their map, so it appears to be
ocean at first look.
Fourth, the map is not really neccessary. There are no spatial patterns
going on that the authors want to draw attention to. I guess having a
map does emphasize that the study is in Europe. Finally, the size of
each pie is scaled to the sample size, but the scale bar for the sample
size shows a sample of only 30, whereas most of their data are for much
larger samples sizes (>200). Do you get the impression from the pies
that the UK has 668 samples, whereas Ireland only has 187? Therefore,
from this graphic we have no idea what sample size was used in each
country.

In fact, all the numbers that are difficult to interpret in the figure
are very nicely presented in Table 1.

Below is a start at improving the presentation. For instance, you could
do a simple bar chart, ordering by rate of mislabelling.

You could add another subfigure to this, showing the rate by different
species too.

The barplot doesn’t communicate the sample size, but then that is
probably not the main point. The sample sizes are probably best reported
in the table

If we felt the map was essential, then putting barcharts on it would be
more informative. It is not that easy to add barcharts ontop of an
existing map in R, so I would recommend creating the barcharts
seperately, then adding them on in Illustrator or Powerpoint.

Interpreting rates

The units you use affect how people interpret your graph.
People are bad at interpreting rates, we just can’t get our heads around
accumulation very well. Here is a numerical example. Check out the below
figure and ask yourself:

At what time is the number of people in the shopping centre declining?

Would you say it is at point A, B, C or D?

Let’s plot the total number of people:

Hopefully the answer is obvious now. So the right scales can help make
interpretation much easier.

You could also rephrase the question from when is the total number
decreasing to when is the number entering less than the number of people
leaving?

Choosing colour scales

Alot of thought should go into choosing colour scales for graphs for
instance- will it print ok? will colour blind people be able to see
this? does the scale create artificial visual breaks in the data?
Luckily there is a package to help you make the right decision for a
colour scale, it is called RColorBrewer. Check out colorbrewer.org for
a helpful interactive web interface for choosing colours.

Using red-green palettes makes it hard for colour blind people. Also,
using a diverging palette makes it look like there is something
important about the middle point (yellow). A better palette to use would
be one of the sequential ones, “Purples” shown here.

To make it easier to understand, let’s look at these again as contour
plots. I will use a more appropriate diverging palette to the red-green
one though.

R the intergrator

In my talk I give some examples of how R can integrate everything from
data input, data processing, analysis, visualisation to sharing results
(even as interactive web content). The results for the quiz you took
above are available as a google spreadsheet. We can use R to read that
data and visualise it, and even share the results as a webpage or on
Twitter.