All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Talk by Will Trimble of Argonne National Laboratory, on April 23, 2014, at MSU's BEACON Center for the Study of Evolution in Action on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.

Transcript of "All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes"

1.
All
kmers
are
not
created
equal:
ﬁnding
the
signal
from
the
noise
in
large-­‐scale
metagenomes.
Will
Trimble
metagenomic
annota<on
group
Argonne
Na<onal
Laboratory
BEACON
seminar
April
23,
2014
MSU

2.
Apology:
I
speak
biology
with
an
accent
• I
spent
six
years
in
dark
rooms
with
lasers
• Now
I
use
computers
to
analyze
high-­‐throughput
sequence
data.
• I
introduce
myself
as
an
applied
mathema<cian.
• Finding
scoring
func<ons
to
answer
ques<ons
with
ambiguous
data

3.
Apology:
I
speak
biology
with
an
accent
• I
spent
six
years
in
dark
rooms
with
lasers
• Now
I
use
computers
to
analyze
high-­‐throughput
sequence
data.
• I
introduce
myself
as
an
applied
mathema<cian.
• Finding
scoring
func<ons
to
answer
ques<ons
with
ambiguous
data
• Shoveling
data
from
the
data
producing
machine
into
the
data-­‐consuming
furnace.

4.
• Sequences
are
diﬀerent
• How
much
did
my
sequencing
run
give
me?
kmerspectrumanalyzer!
• How
much
did
I
sample?
nonpareil-k
• PreXy
pictures
thumbnailpolish!
Outline

8.
Sequences
are
diﬀerent
• Sequencing
produces
sequences.
Sequences
are
qualita<vely
diﬀerent
from
all
other
data
types.
• Each
sequence
is
an
informa<on-­‐rich
(possibly
corrupted)
quota4on
from
the
catalog
of
gene<c
polymers.

9.
What
is
this
sequence
?
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAG
GTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?
“be regarded as unproved until it has been
checked against more exact results”
Searching
We
know
what
to
do
with
these
puzzles.
You
go
to
this
website,
and
type
it
in…

10.
What
is
this
sequence
?
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?
“be regarded as unproved until it has been checked against more exact results”
Searching
How
long
do
reads
need
to
be
to
recognize
them?

11.
What
is
this
sequence
?
>mystery_sequence
CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who
wrote
this
line
?
“be regarded as unproved until it has been checked against more exact results”
Searching
How
long
do
reads
need
to
be
to
recognize
them?
To
do
what,
to
place
on
a
reference
genome?
this
can
be
turned
into
a
math
problem
that
I
will
illustrate
with
a
search
engine
analogy.

12.
How
long
do
reads
need
to
be?
Informa4on
(Shannon,
1949,
BSTJ):
is
a
quan<ta<ve
summary
of
the
uncertainty
of
a
probability
distribu4on
–
a
model
of
the
data
Profound
applicability
in
paXern
matching
+
modeling
Logarithmic
measurements
have
units!
H =
X
i
pi log2
✓
1
pi
◆

13.
A
word
on
the
sign
of
the
entropy
• A
popular
straw
man
among-­‐mathema<cians-­‐
and-­‐CS-­‐people
is
the
“random
sequence
model.”
Uniform
categorical
distribu<on
over
all
4L
sequences.
• When
we
learn
something—like
we
collect
some
genomes
and
expect
our
new
sequences
to
look
like
them—we
implicitly
construct
a
less
ﬂat
distribu<on.
Models
always
have
less
entropy
than
the
model
of
ignorance.

14.
How
long
do
phrases
need
to
be?
Exercise:
Pick
a
book
from
your
bookshelf.
Pick
an
arbitrary
page
and
arbitrary
line.
for n in 1..10 !
type the first n words into google
books, quoted.!
break if google identifies your book.!

15.
• Informa<on
content
of
English
words:
Hword
ca.
12
bits
per
word.
• Size
of
google
books?
Big
libraries
have
few
107
books,
each
one
has
105
indexed
words
….so
a
database
size
of
1012
words.
log(database
size)
=
1012
=
239.9
=
40
bits
• So
we
expect
on
average
40
/
12
=
3.3
=
4
words
to
be
enough
to
ﬁnd
a
phrase
in
google’s
index.
Try
it.
How
long
do
phrases
need
to
be?

16.
How
long
do
phrases
need
to
be?
Exercise:
Pick
a
book
from
your
bookshelf.
Pick
an
arbitrary
page
and
arbitrary
line.
for n in 1..10 !
type the first n words into google books, quoted.!
break if google identifies your book.!
Most
oken
takes
4
words

17.
• Informa<on
content
of
English
words:
Hword
ca.
12
bits
per
word.
• Size
of
google
books?
Big
libraries
have
few
107
books,
each
one
has
105
indexed
words
….so
a
database
size
of
1012
words.
log(database
size)
=
1012
=
239.9
=
40
bits
• So
we
expect
on
average
40
/
12
=
3.3
=
4
words
to
be
enough
to
ﬁnd
a
phrase
in
google’s
index.
Try
it.
How
long
do
phrases
need
to
be?
Not
all
phrases
are
equally
dis<nc<ve.

18.
• Maximum
informa<on
content
of
base
pairs
Hread
2
bits
per
length-­‐
sequence
• Most
long
kmers
are
dis<nct:
genome
of
size
G
(ca
1010
bp)
log(G)
=
1010
=
233.2
=
34
bits
• So
we
expect
that
when
2
>
34
bits,
we
should
be
able
to
place
any
sequence.
• That
means
we
need
at
least
17
base
pairs
(seems
small)
to
deliver
mail
anywhere
in
the
genome.
How
long
do
reads
need
to
be?
`
`
`
`

19.
The
data
deluge
• There
were
some
technological
breakthroughs
in
the
mid-­‐2000s
that
led
to
inexpensive
collec<on
of
10s
of
Gbytes
of
sequence
data
at
once.
• The
data
has
outgrown
some
favorite
algorithms
from
the
1990s
(BLAST)

28.
Redundancy
is
good
• OMG!
Check
out
these
three
sequences!
I’ve
found
the
fourth,
ﬁkh,
and
sixth
domains
of
life.
• OMG!
I
see
this
sequence
10
million
<mes.
• OMG!
There
are
more
than
10
billion
dis<nct
31mers
in
my
dataset.
I
only
have
128
Gbases
of
memory.
• Error
correc<on
and
diginorm
somewhat
amusingly
strive
for
opposite
ends.

29.
Redundancy
is
good
• OMG!
Check
out
these
three
sequences!
I’ve
found
the
fourth,
ﬁkh,
and
sixth
domains
of
life.
• OMG!
I
see
this
sequence
10
million
<mes.
• OMG!
There
are
more
than
10
billion
dis<nct
31mers
in
my
dataset.
I
only
have
128
Gbases
of
memory.
• Error
correc<on
and
diginorm
somewhat
amusingly
strive
for
opposite
ends.
Abundance-­‐based
inferences
are
beXer
in
the
high-­‐
abundance
part
of
the
data.

34.
How
much
novelty
is
in
my
dataset?
How
many
sequences
do
you
need
to
see
before
you
start
seeing
the
same
ones
over
and
over
again?
Ini<ally,
everything
is
novel,
but
there
will
come
a
point
at
which
less
than
half
of
your
new
observa<ons
are
already
in
the
catalog.

35.
Nonuniqefraction(✏; {r}, {n}) =
X
i
ni · ri
P
j nj · rj
(1 Poisscdf (✏ · ri, 1))
(1 Poisscdf (✏ · ri, 0))
How
much
novelty
is
in
my
dataset?
How
many
sequences
do
you
need
to
see
before
you
start
seeing
the
same
ones
over
and
over
again?
Ini<ally,
everything
is
novel,
but
there
will
come
a
point
at
which
less
than
half
of
your
new
observa<ons
are
already
in
the
catalog.
We
can
calculate
this
eﬃciently
using
the
kmer
spectrum.

42.
Generali<es
from
the
kmer
coun<ng
mines
• Many
datasets
have
as
much
as
5-­‐45%
of
the
sequence
yield
in
adapters.
• FEW
DATASETS
have
well-­‐separated
abundance
peaks
(of
the
sort
metavelvet
was
engineered
to
ﬁnd)
• Diverse
datasets
have
a
featureless,
geometric
rela4onship
between
kmer
rank
and
kmer
abundance.
• Shannon
entropy
is
oversensi4ve
to
errors.
Higher-­‐order
Rényi
entropy
is
more
stable.