on data science

The term “data science” appears to have been introduced in its current
sense by W. S. Cleveland in 2001 (Data science: an action plan for
expanding the technical areas of the field of statistics,
International Statistical Review, 2001) David Donoho’s
50 years of data science
is an excellent summary of what “data science” means, especially from
the perspective of how academic disciplines are organized, and how
they change over time.

At first glance (and subsequent glances too) the term is a bit silly,
a tautology. All sciences do data analysis. Even if you’re talking
‘big data’, Pierre Laplace was doing large scale data analysis on
Paris and London census records in 1781 (which we’ll discuss next
week). As the statistician Karl Broman has asked, when a physicist
does a mathematical analysis, does she say she’s doing “number
science”?

Donoho makes a pretty good case that “data science” is a rebranding
exercise, and his piece gives interesting insight into disciplinary
fad and fashion. But since we biologists are still in in the midst of
“systems biology”, “integrative biology”, and “quantitative biology”
(and a whole lot of -omes)
we can’t be throwing too many stones in any glass houses of branding
“new” subdisciplines.

Beyond the politics of it, there must be something interesting and
real about “data science”, or it wouldn’t be so high in our collective
consciousness. Donoho makes a case that data science is a rebranding
of statistics by outsiders, and that statistics departments opened the
door to this schism by downplaying and avoiding the difficult, messy
practicalities of real-world data analyses. Donoho repeats a famous
quote that 80% of the effort in data analysis is just being able to
deal with the messiness of data at all: reformatting, exploring,
cleaning, visualizing. Computer scientists, in particular, have taken
up those practical challenges by necessity and opportunity, and
re-invigorated an old area. Even Donoho turns up his nose a bit at
computational technologies that seem to him like mere “coping skills
for dealing with organizational artifacts of large-scale cluster
computing”, but being able to compute effectively on terabytes or
petabytes of data does present serious, expensive, and interesting
practical challenges.

In biology, that adage about 80% of the effort in data analysis might
even be an understatement. Biology is particularly ready for more
emphasis on the key ideas of data science: the practicalities of data
exploration, data cleaning, and data visualization, and the coding
skills to be able to do it. This week, we’re going to focus on some of
those practicalities.

a data paranoia toolkit

The first thing you do with a new data file is look at it. But, you
say, come on, my file is too big to read the whole thing. Here’s a
litany of basic techniques for dealing with big data files, aiming to
get big unmanageable data sets down to little data sets that human,
intuitive biological superpowers can work on:

slice: look at the start and the end

select: look at known cases

sample: look at a random sample

sort: look for outliers

summarize: look for more outliers

validate: the whole file, once you think you understand it.

We’ll focus here on simple line-oriented data files, but the same
ideas apply to more complex formats with multiple lines per record.

look at the start, end

The best place to start is at the beginning. The start of the file
tends to include helpful information like column headers, comments,
and other documentation. If you’re going to parse the file, this might
help – plus your parser might have to know to skip or specially deal
with header material.

The easiest way to look at the start of a text file is on the command line:

% head -10 w03-data.tbl

shows you the first 10 lines of the file. (Obviously, you can use this
to get however many you want to read.)

Worth checking the tail end of the file too. It might end in a special
kind of line that you’d need to parse. A corrupted file (in a
truncated download) might end abruptly in an obvious way. Some files
don’t end in a newline character, and that might screw up your parser
if you aren’t expecting it. Again, the command line way is:

% tail -20 w03-data.tbl

which shows you the last 20 lines.

Of course, if the file is small enough that you could just look at the
whole thing, for example using more on the command line:

% more w03-data.tbl

select known cases

The superpower of a biologist looking at a fresh new data set: ‘let’s
look at my favorite gene… wait, that’s not right, oh good lord.’

Look at a few cases where you know the right answer. For instance, in
a cell type specific RNA-seq data set, look at a transcript that’s
supposed to be on in that cell type, and look at one that’s supposed
to be off. If you’re going to worry about cross-contamination in the
data, look at highly expressed genes in other nearby cell types (the
most likely cross-contaminants). If you’re going to worry about
sensitivity of detecting rarer transcripts, find a low-expressed
transcript to look at.

For pulling specific lines out of a file, grep is a go-to. This
will print all lines in the file w03-data.tbl that contain the
string coriander:

% grep coriander w03-data.tbl

take a random sample

Many data files are arranged in a nonrandom order, which means that
just looking at the start and the end might not reveal some horrific
systematic problem. Conversely, looking at the head and tail can make
it look like the data file’s worse than it actually is: genome
sequence files, ordered by chromosome coordinate, can look terrible
when they start and end with great swaths of N’s, placeholding for the
difficult-to-sequence telomere repeats.

For small to medium sized files, if you don’t care about efficiency,
there are plenty of ways to randomly sample N lines, One quick
incantation (when you have a modern GNU-ish sort program installed)
is to sort the file randomly (the -R option) and take the first N
lines:

% sort -R w03-data.tbl | head -10

For large files, where efficiency matters, we’re going to introduce
the reservoir sampling algorithm in just a moment.

There’s a
great Feynman story about the power of random sampling.
Feynman was called in to consult on the design of a big nuclear plant,
looking over engineering blueprints that he didn’t understand at all,
surrounded by a crowd of actual engineers. Just trying to start
somewhere and understand something small, he points randomly to a
symbol and says “what happens if that valve gets stuck?”. Not even
sure that he’s pointing to a valve! The engineers look at it, and then
at each other, and say “well, if that valve gets
stuck… uh… you’re absolutely right, sir” and they pick up the
blueprints and walk out, thinking he’s a genius.

sort to find outliers

In a tabular file, you can sort on each column to look for weird
values and outliers (and in other data formats, you can script to
generate tabular summaries that you can study this way too). For
example, this will sort the data table on the values in the third
column (-k3), numerically (-n), and print the first ten:

% sort -n -k3 w03-data.tbl | head

-r reverses the order of the sort, to look at the last ten.
sort has many other useful options too.

Bad values tend to sort to the edges, making them easier to spot.

summarize to find more outliers

You can also find ways to put all the values in a column through some
sort of summarization. Here your go-to is the venerable awk, which
makes it easy to pull single fields out of the lines of tabular files:
awk '{print $3}' myfile, the world’s simplest script, prints the
third field on every one of myfile.

I have a little Perl script of my own called avg that calculates a
whole bunch of statistics (mean, std.dev., range…) on any input that
consists of one number per line, and I’ll do something like:

% awk '{print $3}' w03-data.tbl | avg

You don’t have my avg script, but you can write your own in Python.

If the values in a column are categorical (in a defined vocabulary),
then you can summarize all the values that are present in the column:

% awk '{print $2}' myfile | sort | uniq -c

where the uniq utility collapses identical adjacent lines into one,
and the -c option` says to also print the count, how many times each
value was seen.

For something sequential, like DNA or protein sequences, you can whip
up a quick script to count character composition; you may be surprised
by the non-DNA characters that many DNA sequence files contain.

Another trick is to count the length of things in a data record (like
sequence lengths), sort or summarize on that, and look for
unreasonable outliers.

validate

By now you have a good mental picture of what’s supposed to be in the
file. The truly compulsive, which of course includes all careful
scientists among us, might now write a validation script: a script
that looks automatically over the whole file, making sure that each
field is what it’s supposed to be (number, string; length, range;
whatever).

Of course, you’re not going to do all of these paranoid steps on every
data file you encounter. The point is that you can; it only takes
basic scripting and command line skills to do some very powerful and
systematic things at scale, enabling you (as a biologist) to wade
straight into a large dataset and wrap your mind around it.

reservoir sampling algorithm

Suppose you’re trying to take a random sample of a really large data
set. Suppose the data set is so large that it doesn’t fit in your
computer’s memory. Suppose you don’t even know how large it is; you’re
going to read one element at a time in a very long stream, until you
run out of elements. Is it possible to sample elements from total
elements, where $n$ is unknown (until the end), such that each element
is sampled uniformly with probability ?

Algorithm: reservoir sampling

allocate a reservoir to hold up to elements, with “slots” .

put the first elements in reservoir slots .

for each subsequent element , starting with , up to
the last one :

choose a random number

if , swap element into reservoir slot ,
discarding the element that was there. (That is, each element k
initially gets put in the reservoir with probability ).

Our first complete algorithm in the course, I think! One of the things
we want you to learn is how to read a pseudocode algorithm and
implement it in your own Python code. You’ll find that one detail you
always need to keep track of (obsessively) is that pseudocode
algorithms often count with indexes starting from 1, but in Python,
lists are indexed starting from 0: “off-by-one” errors are easy to
make.

Reservoir sampling is efficient in both time and memory. It becomes
the preferred sampling algorithm when files get large. Writing your
own script gives you the ability to ignore non-data parts of the file
(here, comment lines) and to parse the file into data records any way
you need to (not just one line at a time). Knowing how to write a
quick reservoir sampling script gives you the superpower of being able
to randomly downsample any data file, no matter how large the file,
and no matter the format of the data.

proof that it works

We prove it in two cases. First consider the case of element , which gets into the original sample automatically. At each
iteration to , you have a chance to knock it out of
the reservoir, and the probability that it survives at iteration
is . It has to survive every round, so:

which is what we want: the probability that i is sampled is
. Obviously the key thing here is how every numerator
neatly cancels its preceding denominator.

Now consider the case of an element . The probability that it
gets into the reservoir when its turn comes is . Then
it has to survive getting knocked out by any subsequent element
, so

The same cancellation, but from a different initial condition.

Two points to make about this proof:

if you don’t remember anything about the proof, you’ll probably have
to look up the reservoir sampling algorithm every time you implement
it, because it’s easy to get muddled up on off-by-one issues. But if
you even remember the outline of the proof, you’ll remember that the
final sample has to be sampled with probability
, and that will remind you that the recursive rule is
to sample each element with probability .

I looked this up on Google to double check that I had the proof
right, and the first sample pages I looked at were
complicated, confusing, and/or wrong.

computational complexity - big O notation

How many steps does the reservoir sampling algorithm take? It does a
couple of operations for each element (sample a random number;
compare it to ; maybe swap element into the reservoir). If
we sample from a data set that’s 10x larger, it will take us 10x more
operations. We say that the algorithm is time, “order of n in
time”.

How much memory does it take? We only need to hold the reservoir’s
elements in memory, plus one current element . We say the algorithm
is O(m) memory, “order of m in memory”.

We saw that another way to take a random sample is to sort the
elements randomly and take the first of them. In-memory sorting
algorithms are typically time and
memory. ( terms often arise in algorithms that are doing
some sort of binary divide-and-conquer, like recursively sorting
elements to the left or right of some middle point.) Which algorithm
should we prefer?

The big-O of these algorithms tells us that at some point, for large
enough , reservoir sampling is the preferred algorithm in both
time and memory, because is better than time,
and is better than memory.

But: Big-O notation is only about relative scaling as the problem
size increases; it ignores constant factors. If algorithm
A happens to require 1000000 operations per element and algorithm B does 1 operation per element, algorithm A is actually
faster in practice until n gets really big.

It can be difficult to predict how long an individual operation takes,
especially on a modern computer where CPU operations are blazingly
fast and random memory accesses are like sending a pigeon to the moon.
Algorithms can be analyzed theoretically to determine their big-O, but
implementations must be analyzed empirically to determine how long
they actually take, on a given computer.

Big-O notation has a
more formal definition
in computer science. The most important thing to know about the
formalism is that big-O reflects worst case performance, an
asymptotic upper bound: “there exists a constant c such that the
time required is for all n”. There are also ways
to talk about best case behavior and average case
behavior . Some algorithms, including reservoir sampling
always take the same number of steps. A good example where it makes a
difference is a popular sorting algorithm called quicksort, which
almost always takes time but is worst case
. Nonetheless, you will see people (including me) using
notation for average case behavior sometimes. Technically this
is sloppy and colloquial; they’re literally meaning “this calculation
is going to take on the order of time”, rather than
meaning to state a formal upper bound.

displaying data in graphs and tables

switching gears…

You’re reading the
Weissgerber et al. paper
this week because it makes important points about data presentation
in graphs and tables. In particular:

Summary statistics are only meaningful when there are enough data to summarize.

An exemplary graph: Figure 1b from
Ding et al. (Nature 536:331, 2016)
on courtship song variation in Drosophila species, showing
raw data points in addition to summary statistics. They
summarize the data by mean and standard deviation, not standard error, because
the first thing you want to know is how much variability you're
seeing in samples, not how much variability is expected in the
estimate of the mean. ('Open image in new tab' to embiggen.)

It is extremely common to see bar graphs in biology, displaying means
and standard errors. But we’ve often only done a small number of
replicates (maybe n = 3-10), and we could just as easily just show
the raw data points: why don’t we?

Plotting mean and standard error implicitly assumes that the data can
be summarized with just those two numbers. You’re basically assuming
that your data were normally distributed, and maybe they weren’t.
Looking at the raw data points shows you the actual distribution -
maybe it was multimodal, skewed, had outliers, or whatever.

The other thing is the business of plotting the standard error
(i.e. the expected error of the mean), not the sample standard
deviation. If we’re visualizing the data, we’re first primarily
interested in the variability of the data: if we do the experiment
again, if we take another independent sample, where is it likely to
be? That’s the sample variance, and the square root of that is the
standard deviation. Maybe once we’re digging into our analysis a
little deeper – we’re convinced that the data are distributed
normally, and we want to parameterize where the mean is likely to be
– then we might start worrying about how good our estimated mean is
likely to be (the standard error) but that’s usually not what
someone’s graph of their samples is aiming to represent, or implying
that it represents.

This business of plotting your actual raw data has become
trendy. There’s even a Twitter hashtag, #barbarplots. But packages for
graphing data are still catching up. (I ran across a member of the
Microsoft Excel development team replying to a community request for
this by saying “huh, why would anyone want to do that?”) A buzzword is
a swarm plot or a beeswarm plot. We’ll use the python Seaborn
module in this week’s homework because it has the ability to make
swarm plots. You can also do an almost identical thing with a strip
plot by adding random jitter (jitter=True, in Seaborn). And you
can emulate all this easily enough in a scatter plot by plotting
your data points along one axis while adding a small amount of random
jitter in the other.

two other things we all learn in high school and forget

Significant figures: state your numbers to a non-obscene
precision. Just because a computer outputs a floating point real
number as ‘2.745970945707204’ doesn’t mean you need to report it
that way; in biological data we’ve often only got numbers nailed
down to a digit or two. ‘2.7’ is easier to think about anyway, and
remember, Fermi estimation.

Always label your axes: seriously, always label your axes. Few
people will stop you in a talk and ask you “um, what’s on the
x-axis?” because it’s hard to ask that without it sounding like “um,
do you even know how to make graphs?”. But that’s what they’ll be
thinking.

A variable is something that can take on different values, either
because we set it as a condition of our experiment (a fixed
variable), or because we’re measuring it (a measured variable).
For example, a fixed variable “gene” (which takes on 20,000 different
values for different gene names), or a fixed variable “sex” (‘M’ or
‘F’), or a measured variable “TPM” (a number for an estimated
expression level). A value is the actual number, string, or
whatever that’s assigned to a variable.

An observation is one measurement or experiment, done under some
certain conditions that we might be varying, resulting in observed
values for one or more things we measured in that same experiment.

In a tidy data set,

each variable forms a column

each observation forms a row

each type of observational unit forms a table

Wickham says it’s “usually easy to figure out what are observations
and what are variables”, but at least for RNA-seq data I’m not so sure
I agree. I don’t think it’s entirely clear whether it’s better to
treat genes as observations or variables. That is, it seems to me that
this is tidy:

gene

mouse

cell_type

sex

TPM

carrot

1

neuron

M

10.5

carrot

2

glia

F

9.8

kohlrabi

1

neuron

M

53.4

kohlrabi

2

glia

F

62.1

cabbage

1

neuron

M

0.4

cabbage

2

glia

F

0.3

but so is this:

mouse

cell_type

sex

carrot

kohlrabi

cabbage

1

neuron

M

10.5

53.4

0.4

2

glia

F

9.8

62.1

0.3

Wickham says:

it is easier to describe functional relationships between variables
(e.g., z is a linear combination of x and y, density is the ratio of
weight to volume) than between rows, and it is easier to make
comparisons between groups of observations (e.g., average of group a
vs. average of group b) than between groups of columns.

which seems to argue most strongly for the second version above, with
genes as columns. An “observation” is an RNA-seq experiment on one
sample. I get measurements for 20,000 genes in that experiment, and
they’re linked. I might want to study how one or more genes covary
with others across samples (either for real, or because of some
artifact like a batch effect). But that’s a really wide table,
20,000 columns, and you’ll have trouble looking at it; many text
editors will even choke on lines that long.

We’ll see in the homework that it’ll be easier to deal with using
Seaborn to visualize some example data if we use the first version
instead. We can transform the second form into the first by
melting the data table, and Pandas gives us a method to automate
that.

Clearly what I don’t want in a tidy RNA-seq data set, though, is a
bunch of fixed-variable values, representing my experimental
conditions, masquerading as column headers, like the ‘wt’ vs ‘mut’ and
‘M’ vs. ‘F’ columns in this week’s homework. That’s #1
in Wickham’s list of most common problems in messy data:

Or you can create one from a dict of lists, where the dict keys are
the column names, and the list for each key is a column. Here I’m also
naming the rows sample1 etc.; Pandas calls the row names an index:

reading tabular data from files

By default, read_table() assumes tab-delimited fields. You will also
encounter comma-delimited and whitespace-delimited files commonly. You
specify an alternative single-character field separator with a
sep='<c>' argument, and you can specific whitespace-delimited with a
delim_whitespace argument.

Row names: by default, assumes there are no row names (there are
fields per line, for column names), but if the first field
is a name for each row (i.e. there are $n+1$ fields on each data line,
one name and named columns), use index_col=0:

df = pd.read_table('file.tbl', index_col=0) # First field is the row name

To get Pandas to skip comments in the file, use the comment= option to
tell it what the comment character is:

indexing rows, columns, cells in a DataFrame

Best way to learn is to play with example DataFrames in a Jupyter
notebook page. I find Pandas to be inconsistent and confusing
(sometimes you access a DataFrame as if it’s column-major, and
sometimes like it’s row-major, depending on what you’re doing) but
it’s powerful once you get used to it.

To extract single columns, or a list of several columns, you can index
into the DataFrame as if it’s in column-major form:

df['caraway'] # Gets 'caraway' column, as a pandas Series
df[['caraway','kohlrabi']] # Gets two columns, as a smaller DataFrame

As a convenience, you can also use slicing operators to get a subset
of rows; now you index into the DataFrame as if it’s in row-major form:

To access data by row/column labels, use the .loc[] method.
.loc takes two arguments, for which rows and columns you want to
access, in that order (now we’re row-major again). These arguments are
the row/column labels: either a single label, or a list. Note
that .loc takes arguments in square brackets, not parentheses! Also
note that when you use slicing syntax with .loc, it takes a closed
interval, in contrast to everywhere else in Python where you’d expect
a half-open interval.

## When rows are labeled:
df.loc['sample1'] # Extracts row as a Series
df.loc[['sample1','sample2']] # Extracts two rows
## When rows are numbered, not labeled:
df.loc[2:4] # Extracts two rows when rows are numbered, not labeled. Note, closed interval!
df.loc[2:4, ['genotype', 'sex', 'A1CF']] # Extracts rows 2..4, and these three columns
df.loc[2,'A2M'] # Get a single value: row 2, column 'A2M'

To access data by row/column positions, use the .iloc[]
method. .iloc[] has square brackets around its row/col arguments
just like .loc[] but the arguments are an integer, a list of
integers, or a range like 2:5. Ranges in .iloc[] are half-open
intervals; compare:

sorting data tables

The by= option tells it what field to sort on (which column label,
usually). The argument to by= is a single column label (a string),
or a list of them. If it’s a list, it sorts first by the first one,
then second by the second, etc.

By default, it sorts rows, by some column’s value. If you choose
‘axis=1’, it sorts columns, by some row’s value.

By default, it sorts in ascending order; specify ascending=False to
get descending order.

df_sorted = df.sort_values(by='coriander', ascending=False)

plotting example

To scatter plot two columns against each other (i.e. each row is a
point on the scatter plot):