Introductory slides

BIT 815: Analysis of Deep Sequencing Data
Overview:
• This course will cover methods for analysis of data from Illumina and Roche/454 high-throughput
sequencing, with or without a reference genome sequence, using free and open-source software
tools with an emphasis on the command-line Linux computing environment*
Lecture Topics:
• Types of samples and analyses
• Experimental design and analysis
• Data formats and conversion tools
• Alignment, de-novo assembly, and other analyses
• Computing needs and available resources
• Annotation
• Summarizing and visualizing results
Labs:
Lab sessions meet in a computing lab, and will provide students with hands-on experience in managing
and analyzing datasets from Illumina and Roche/454 instruments, covering the same set of topics as
the lectures. Example datasets will be available from both platforms, for both DNA and RNA samples;
students who have their own datasets may contact the instructor prior to the course to discuss
opportunities for analysis of their data during the lab sessions.
* see http://www.physics.ubc.ca/mbelab/computer/linux-intro/html/ for an overview
Introduction to the course and to each other
- background in biology, computing, and sequencing
- experiments of interest to participants
Course structure
- 3 two-hour blocks per week, one theme per week
* ~ 45 min lecture/discussion
* ~ 70 min lab exercises
- some assigned reading
- participation in classroom discussion is expected
- no exams
Course Objective
- to teach you how to teach yourself
The sequencing rate is growing faster than Moore’s Law
Stein (2010) Genome Biology 11:207
An alternative perspective from an independent source
Doubling time 19.8 months
Doubling time 2 months
Doubling time 7.3 months
Sequence data analysis is changing rapidly
- relatively few methods are completely static
- much of the software is still under active development
- new methods and tools are reported every month
- staying on the learning curve is essential
Why use Linux for sequencing data analysis?
- it is well-suited to the task
* preferred development platform for most tools
* modular design – thousands of independent programs
* however … it’s built for speed, not for comfort
Modular design in Linux – a ‘toolbox’ approach
• Individual components of
the Linux operating system
are written as separate
programs
• Different programs can have
similar functions
• A Linux “distribution” is a
collection of programs that
work together as an
operating system
• Users have the power to add
new programs, or take away
existing programs that are
not being used, to optimize
system performance
A map of the software components of the kernel
Why is modularity an advantage?
- adding new software is relatively straightforward
- the operating system can be continually upgraded
- adding tools to the toolbox is easy
- your analyses are limited only by your imagination;
the tools to carry them out are probably already in place
There is always more than one way to do it
- some sequence analysis tasks have matured to stability
- most have not, and are still changing
- ‘best practices’ are also changing, and subject to dispute
- staying on the learning curve is essential
Key principles from the Eric Raymond book chapter
• Clarity is better than cleverness. Document everything you
do, because you won’t remember what you did, or why
• Programmer time is more expensive than machine time.
Don’t worry about optimizing things unless it is necessary
• Prototype before polishing – get it working before you
optimize it. It is often easiest to start with something very
simple, then add complexity and capability in steps
• Design for simplicity; add complexity only where you must.
“Make things as simple as possible, but no simpler” –
paraphrased from Albert Einstein
Computational thinking – four general principles
• Decompose a complex problem to simple steps. Linux is
based on simple tools that do one thing well; these tools require
problems to be framed in simple terms.
• Look for patterns . Recognizing similarities among different types of
problems allows re-use of the same tools in new contexts.
• Generalize patterns to create abstract versions. A tool is most
powerful when it can be applied to a variety of problems that all share
common features
• Combine simple tools into more complex pipelines.
Repetitive tasks are what computers are good at – our job is to build
the algorithms, or sequences of simple steps, that allow the computer
to do those repetitive tasks so we don’t have to.
File Globbing Exercises
• Download FileGlobbing.pdf from the course website, and
also get smallfiles.zip if you are not using a BIT laptop
(http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html)
• Note the complexity of the explanation – way more
information than you really wanted
• Start with the simplest forms, and work up from there
• Right-click on the bit815 (or smallfiles) folder in
Documents, choose “Command Prompt Here” from the
drop-down menu.
• Use ls *, ls *.fq, ls smallread[12].fq, ls small*, and other
commands to explore what works and what doesn’t
File and Directory Commands
• The Software Carpentry videos introduced several
commands related to directories and files – most if not all
of those will work in the Gnu On Windows or Mac Terminal
command line
• Start with creating a new directory – mkdir sandbox
• Change to the new directory, and create some files there:
cd sandbox
touch file1 file2 file3 file4 file5 file6 file7 file8 file9
• List those files, using the short version or the long one:
ls
ls –la
• Note that all files are empty (0 bytes in column 5)
File and Directory Commands
• Create another directory within the sandbox directory
mkdir bucket
• Create a symbolic link (equivalent to a Windows shortcut)
ln -s ../smallread1.fq mylink
• Do another long directory listing and look at the output –
what information is available there?
ls –la
File and Directory Commands
drwxrwxr-x 2 ross ross 4096 Mar 1 11:05 bucket
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file1
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file2
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file3
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file4
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file5
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file6
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file7
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file8
-rw-rw-r-- 1 ross ross 0 Mar 1 11:05 file9
lrwxrwxrwx 1 ross ross 16 Mar 1 11:06 mylink -> ../smallread1.fq
• Use pwd to make sure you are still in the sandbox
directory
• Try removing everything in the directory with rm *
what happens?
File and Directory Commands
• Remove the bucket directory using rmdir bucket
• Be very careful when using file globbing with the rm
command, because there is no undelete in Linux
• When something is deleted, it is gone forever, so be
careful – always make sure you know what directory you
are in, and which files will be affected by any globbing
characters you use.
Regular Expression Exercises
• Download RegularExpressions.pdf from the course website
(http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html)
• Read through the document – note that (as with file
globbing) there are different ways to describe the same
pattern
• Open Notepad++ from the Programs menu in Windows,
navigate to the Documents/bit815 folder, and open
smallread1.fq
• Use Cntrl-H to open the search and replace dialog box
Regular Expression Exercise
• Enter the following expression in the Find what: box
^@1:([0-9]{1,3}):([0-9]{2,5}):([0-9]{2,5}):([YN])
• Enter the following expression in the Replace with: box
@InstrID_FlowcellID_lane1_tile\1_xcoord\2_ycoor\3_pass\4
• Click the Find Next button to see if the pattern matches
anything in the file
• If not, make sure you entered it correctly. If so, click Replace
to see what happens
• If it works for one example, click Replace All and scroll
through the file to see the results
Sequencing technology overview
- Two different systems on campus: Illumina GAIIx, 454
- A similar overall strategy for highly-parallel sequencing
- Different approaches taken at virtually every step
- These different platforms produce data with different
characteristics
- Other platforms are available off-campus, but are not a
focus of the course
Similarities
- DNA molecules are fragmented and ligated to adaptors
- individual DNA molecules are immobilized on a surface
- a series of nucleotide addition reactions are carried out
- the nucleotide added is detected after each addition
- a data file is produced containing the DNA sequences of
many fragments
Sequencing technology overview - 454
DNA fragmentation – usually sonication
Adaptor oligonucleotide addition
Images from www.454.com
Sequencing technology overview - 454
A single molecule immobilized on a bead
PCR amplification in oil-water emulsion
creates ~10 million copies per bead
Images from www.454.com
Sequencing technology overview - 454
DNA-containing beads deposited in wells
of PicoTiterPlate , along with smaller beads
with immobilized enzymes for light
production
“Pyrosequencing” produces light when any
nucleotide is incorporated, so only a single
nucleotide is provided during a cycle, and
light output is recorded during each cycle
Sequencing technology overview - 454
A ‘flowgram’ showing light output from each cycle of base addition
one flowgram is produced for each of the ~1 million wells in a PicoTiterPlate
TACG ‘key’
sequence
Sequencing technology overview – Illumina
Illumina uses a glass ‘flowcell’, about the size of a microscope slide, with 8 separate ‘lanes’.
The GAIIx instrument focuses the laser and light detection system only on one of the two
surfaces inside the flowcell; the new HiSeq instrument scans both surfaces and therefore
doubles the yield of sequence data per lane. Additional improvements in scanning and
increases in cluster density make the difference closer to 4x or 5x more data from a HiSeq.
Sequencing technology overview – Illumina
Fragment DNA, ligate adaptor oligos
Single-stranded DNA binds to flowcell surface
Sequencing technology overview – Illumina
Surface-bound primers are extended by DNA polymerase across annealed ssDNA molecules,
the DNA is denatured back to single strands, and the free ends of immobilized strands anneal
again to oligos bound on surface of flowcell. This ‘bridge PCR’ continues until a cluster of
~ 1000 molecules is produced on the surface of the flowcell, all descended from the single
molecule that bound at that site. After PCR, the free ends of all DNA strands are blocked.
Sequencing technology overview – Illumina
Another perspective of the amplification process, showing the clusters of products
Sequencing technology overview – Illumina
Sequencing technology overview – Illumina
Sequencing technology overview – Illumina
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
GCTGA
CTTAG
TAAGT
AGCCG
Although four different colors are
used for the fluorescent
nucleotides, only two lasers are
used to excite the fluorescence.
The fluorescent labels are grouped
in pairs - labels on A and C are
excited by a red laser, and labels
on G and T are excited by a green
laser. The software assumes that
signal from both lasers will be
balanced at each cycle.
This means that distinguishing
between the A signal and the C
signal is more difficult for the
instrument than A versus G or A
versus T. Base substitution errors
are the most common type of
sequencing error for Illumina
instruments.
Understanding FASTQ format
or “what do all these symbols mean?”
Instrument ID
lane tile X Y barcode read#
flowcell
Header lines sequence quality scores
• Quality scores are numbers that represent the probability that
the given base call is an error.
• These probabilities are always less than 1, so the value is given
as 10 times minus log(10) of the probability
• For example, an error probability of 0.001 (1x10-3) is represented
as a quality score of 30.
• The numbers are converted into text characters so they occupy
less space – a single character is as meaningful as 2 numbers
plus a space between adjacent values
Understanding FASTQ format
Illumina v1.8 header version:
@HWI-EAS209:06:FC706VJ:5:58:5894:21141 1:N:ATCACG
Instrument /flowcell ID lane tile X Y barcode read#
Header lines sequence quality scores
Unfortunately, at least four different ways of converting numbers
to characters have been used, and header line formats have also
changed, so one aspect of data analysis is knowing what you have.
Illumina flowcell geometry (GAIIx)
12345678
A flowcell has 8 lanes, which are physically separated.
Each lane is imaged during each cycle of sequencing
in multiple separate images, called ‘tiles’, which are
not physically separated.
Tiles within a GAIIx lane are numbered from 1 to 60
down the length of the lane, then from 61 to 120
back up the other side.
1
120
2
119
59
62
60
61
Illumina flowcell geometry (Hiseq)
12345678
Tiles within a Hiseq lane are numbered using a
different system . The first digit denotes which
surface (1 = lower, 2 = upper), the second denotes a
vertical “swath” (1 = left, 2 = middle, 3 = right), and
the last two digits denote a tile within that swath (01
means closest to the outflow end of the lane; 08 or
16 means closest to the inflow end of the lane
1101 1201 1301
1102 1202 1302
1107 1207 1307
1108 1208 1308
Command-line Exercises
• Download SAMformatAndCLtools.pdf from the course
website
(http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html)
• Read the first two pages of the document – don’t worry
about the “bitwise flag” information; that is for future
reference
• Go back to the command-line window that you used for
the file globbing exercise – right-click on the folder that
contains the example Fastq and SAM files, and select
“Command Prompt Here” from the drop-down menu
Command-line Exercises
• Download SAMformatAndCLtools.pdf from the course
website
(http://www4.ncsu.edu/~rosswhet/BIT815/Spring2013/list.html)
• Read the first two pages of the document – don’t worry
about the “bitwise flag” information; that is for future
reference
• Go back to the command-line window that you used for
the file globbing exercise – right-click on the folder that
contains the example Fastq and SAM files, and select
“Command Prompt Here” from the drop-down menu