We’ve moved from using the command line to using R. Much like the command line
allows us to do text and database like operations on large files, R is a
statistical programming language that allows us to do Excel-like operations
on any size dataset. R can do anything from act like a simple interactive
calculator, all the way up to automatically analyzing thousands of files and
outputting the analysis as website of interactive charts and graphs. For
instance, this Near Earth Object tracker
is an R script that pulls data from NASA, and will let you do exploratory analysis
by just pointing and clicking on a website!

R is it’s own coding language, so it will take a little time before you can
build your own NEO tracker, but we’re going to keep coming back to R and
learning new features over the course of the workshop. So, it’s okay if you
don’t feel like you ‘get’ it today.

blast_out is called a dataframe, which is a sort of R-ish version of a
spreadsheet with named columns. View can be used to present it nicely,
and head(blast_out) can be used to look at just the first few rows.

Another useful command is dim which will tell you the DIMENSIONS of this
data frame:

dim(blast_out)

That’s a big data frame! 14,524 rows (and 12 columns!)

Let’s do some data visualization to get a handle on what our blast output looked like. First, let’s look at the evalue:

hist(blast_out$evalue)

This is telling us that MOST of the values in the evalue column are
quite low. What does this mean? How do we figure out what this is?

(You can also try plotting the distribution of -log(blast_out$evalue) - why
is this more informative?)

So these are a lot of low e-values. Is that good or bad? Should we
be happy or concerned?

We can take a look at some more stats – let’s look at the bitscore column:

hist(blast_out$bitscore)

what are we looking for here? (And how would we know?)

(Hint: longer bitscores are better, but even bitscores of ~200 mean a
nucleotide alignment of 200 bp - which is pretty good, no? Here we really
want to rescale the x axis to look at the distribution of bitscores in the
100-300 range.)

Another question - if ‘bitscore’ is a score of the match, and ‘pident’
is the percent identity - is there a relationship between bitscore and
pident?

Well, we can ask this directly with plot:

plot(blast_out$pident, blast_out$bitscore)

why does this plot look the way it does? (This may take a minute to show
up, note!)

The answer is that bitscores are only somewhat related to pident; they
take into account not only the percent identity but the length. You can
get a napkin sketch estimate of this by doing the following:

This is an example of initial exploratory data analysis, in which we poke
around with data to see roughly what it looks like. This is opposed to
other approaches where we might be trying to do statistical analysis to
confirm a hypothesis.

Typically with small replicate sizes (n < 5) it is hard to do confirmatory
data analysis or hypothesis testing, so a lot of NGS work is done for
hypothesis generation and then confirmed via additional experimental
work.