► As the volume of genetic sequencedata increases due to improved sequencing techniques and increased interest, the computational tools available to analyze the data are…
(more)

▼ As the volume of genetic sequencedata increases due to improved sequencing
techniques and increased interest, the computational tools available to analyze
the data are becoming inadequate. This thesis seeks to improve a few of the computational
methods available to access and analyze data in the genetic sequence
databases. The first two results are parallel algorithms based on previously known
sequential algorithms. The third result is a new approach, based on assumptions
that we believe make sense in the biological context of the problem, to approximating
an NP complete problem. The final result is a fundamentally new approach
to approximate string matching using the divide and conquer paradigm instead
of the dynamic programming approach that has been used almost exclusively in
the past.
Dynamic programming algorithms to measure the distance between sequences
have been known since at least 1972. Recently there has been interest
in developing parallel algorithms to measure the distance between two sequences.
We have developed an optimal parallel algorithm to find the edit distance, a metric
frequently used to measure distance, between two sequences.
It is often interesting to find the substrings of length k that appear most
frequently in a given string. We give a simple sequential algorithm to solve this problem and an efficient parallel version of the algorithm. The parallel algorithm
uses an efficient novel parallel bucket sort.
When sequencing a large segment of DNA, the original DNA sequence is
reconstructed using the results of sequencing fragments, that may or may not
contain errors, of many copies of the original DNA. New algorithms are given to
solve the problem of reconstructing the original DNA sequence with and without
errors introduced into the fragments. A program based on this algorithm is used
to reconstruct the human beta globin region (HUMHBB) when given a set of 300
to 500 mers drawn randomly from the HUMHBB region.
Approximate string matching is used in a biological context to model the
steps of evolution. While such evolution may proceed base by base using the
change, insert, or delete operators, there is also evidence that whole genes may
be moved or inverted. We introduce a new problem, the string to string rearrangement
problem, that allows movement and inversion of substrings. We give
a divide and conquer algorithm for finding a rearrangement of one string within
another.
Advisors/Committee Members: Cull, Paul (advisor), D'Ambrosio, Bruce (committee member).

There is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads. However, several big challenges remain to be overcome…
(more)

▼

There is a rapidly increasing amount of de novo
genome assembly using next-generation sequencing (NGS) short reads.
However, several big challenges remain to be overcome to make it
efficient, accurate, and versatile. Stem from the very short read
length provided at the emerging stage of NGS, early assemblers,
though have been successfully applied to assemble some published
genomes, failed in leveraging reads generated by newer generation
sequencers. The new reads are not only longer, but also exhibit
improved profiles and patterns that green-lighted some previously
prohibitive genome studies. However, this requires new algorithms
to be developed.
SOAPdenovo2 is developed with a new algorithm
design that: 1) reduces memory consumption in graph construction;
2) resolves more complex repetitive regions in contig assembly; 3)
increases coverage and length in scaffolding; 4) improves gap
closing, and 5) optimizes for large genomes. Benchmark using the
public datasets showed that SOAPdenovo2 greatly surpasses its
predecessor SOAPdenovo and is competitive compare to other
assemblers in both assembly length and accuracy.
SOAPdenovo2 was
developed with versatility as a top priority. Working alone or as a
part of a pipeline, SOAPdenovo2 successfully illustrated its power
by 1) presenting detailed structural variation (SV) maps of an
Asian and African genome and showing that whole genome de novo
assembly could serve as a new solution to a more comprehensive SV
map; 2) drafting the highly polymorphic and repetitive Oyster
genome and showing that complicated oceanic species could be
assembled by SOAPdenovo2 together with hierarchical assembly
strategy; and 3) finishing the assembly of a haplotype-resolved
diploid genome without using a reference genome. The community has
also successfully applied SOAPdenovo2 in assembling over a hundred
species.
The versatility of SOAPdenovo2 was also exemplified by
developing SOAPdenovo-Trans, an assembler tailored for
transcriptome assembly using RNA sequencing data. Benchmarking on
known transcripts from well-annotated genomes, SOAPdenovo-Trans
outperforms two other software on identifying alternative splicing
and differential expression levels.

The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from…
(more)

▼

The development of next-generation sequencing
technology enables us to obtain a vast number of short reads from
metagenomic samples. In metagenomic samples, the reads from
different species are mixed together. So, metagenomic binning has
been introduced to cluster reads from the same or closely related
species and metagenomic annotation is introduced to predict the
taxonomic information of each read. Both metagenomic binning and
annotation are critical steps in downstream analysis. This thesis
discusses the difficulties of these two computational problems and
proposes two algorithmic methods, MetaCluster 5.0 and
MetaAnnotator, as solutions.
There are six major challenges in
metagenomic binning: (1) the lack of reference genomes; (2) uneven
abundance ratios; (3) short read lengths; (4) a large number of
species; (5) the existence of species with extremely-low-abundance;
and (6) recovering low-abundance species. To solve these problems,
I propose a two-round binning method, MetaCluster 5.0. The
improvement achieved by MetaCluster 5.0 is based on three major
observations. First, the short q-mer (length-q substring of the
sequence with q = 4, 5) frequency distributions of individual
sufficiently long fragments sampled from the same genome are more
similar than those sampled from different genomes. Second,
sufficiently long w-mers (length-w substring of the sequence with w
≈ 30) are usually unique in each individual genome. Third, the
k-mer (length-k substring of the sequence with k ≈ 16) frequencies
from reads of a species are usually linearly proportional to that
of the species’ abundance.
The metagenomic annotation methods in
the literatures often suffer from five major drawbacks: (1) unable
to annotate many reads; (2) less precise annotation for reads and
more incorrect annotation for contigs; (3) unable to deal with
novel clades with limited references genomes well; (4) performance
affected by variable genome sequence similarities between different
clades; and (5) high time complexity. In this thesis, a novel tool,
MetaAnnotator, is proposed to tackle these problems. There are four
major contributions of MetaAnnotator. Firstly, instead of
annotating reads/contigs independently, a cluster of reads/contigs
are annotated as a whole. Secondly, multiple reference databases
are integrated. Thirdly, for each individual clade, quadratic
discriminant analysis is applied to capture the similarities
between reference sequences in the clade. Fourthly, instead of
using alignment tools, MetaAnnotator perform annotation using k-mer
exact match which is more efficient.
Experiments on both
simulated datasets and real datasets show that MetaCluster 5.0 and
MetaAnnotator outperform existing tools with higher accuracy as
well as less time and space cost.

The emergence of high-throughput, low-cost next-generation sequencing (NGS) technologies has led to an explosion in genetic information for clinical care. The exploitation of such massive…
(more)

▼

The emergence of high-throughput, low-cost
next-generation sequencing (NGS)
technologies has led to an
explosion in genetic information for clinical care.
The
exploitation of such massive genetic information has the potential
to
revolutionize disease diagnosis and drug development, but it
also reveals an
urgent need for efficient and accurate tools to
analyze genetic information, in
particular, to interpret genetic
variants for clinical purposes.
The challenge of NGS data
management and analysis is not only in managing and analyzing the
massive amount of data generated from genetic tests.
Diverse
sources (databases) of medical knowledge in annotations of genetic
variants complicate the process of automating the variant analysis.
For example, the coordinate system and naming convention vary from
case to case.
Integrating these annotations is an important, but
enormous task, and the resulting databases require substantial
storage space, and querying can be very
slow without proper
indexing and pre-processing. Another issue is that, in
order to
help users get a better understanding of genetic related
annotations,
visualization of different aspects of variant
information needs to be handled
carefully. Existing software tools
have solved some of these problems, but lack
other features.
In
this thesis, I present a data management framework for the clinical
interpretation of human variations. First, it involves a unified
coordinate system
in which annotations are categorized according
to variants, genes or proteins.
Second, the annotation process can
be speeded up by pre-processing the data
on a supercomputer, and
the integrated database storage can be reduced via a
unified
database representation with compressed fields. Based on this
framework, an variant interpretation software tool called
database.bio was designedand developed. It combines variant
annotation, categorization, and visualization in order to support
clinical doctors or bioinformaticians with insight
into individual
genetic characteristics. Moreover, the categorization rules and
filter cascade function are included in database.bio to allow users
to focus on
a smaller volume of data, and a genome browser and
seven specific tools are
integrated to provide a better view of
variant distributions, the nearby regions
of the variant, the
impact on the protein domain, and the pathways.
(354
words)

► A new generation of non-Sanger-based sequencing technologies, so called “next-generation” sequencing (NGS), has been changing the landscape of genetics at unprecedented speed. In particular, our…
(more)

▼ A new generation of non-Sanger-based sequencing
technologies, so called “next-generation” sequencing (NGS), has
been changing the landscape of genetics at unprecedented speed. In
particular, our capacity in deciphering the genotypes underlying
phenotypes, such as diseases, has never been greater. However,
before fully applying NGS in medical genetics, researchers have to
bridge the widening gap between the generation of massively
parallel sequencing output and the capacity to analyze the
resulting data. In addition, even a list of candidate genes with
potential causal variants can be obtained from an effective NGS
analysis, to pinpoint disease genes from the long list remains a
challenge. The issue becomes especially difficult when the
molecular basis of the disease is not fully elucidated.
New NGS
users are always bewildered by a plethora of options in mapping,
assembly, variant calling and filtering programs and may have no
idea about how to compare these tools and choose the “right” ones.
To get an overview of various bioinformatics attempts in mapping
and assembly, a series of performance evaluation work was conducted
by using both real and simulated NGS short reads. For NGS variant
detection, the performances of two most widely used toolkits were
assessed, namely, SAM tools and GATK. Based on the results of
systematic evaluation, a NGS dataprocessing and analysis pipeline
was constructed. And this pipeline was proved a success with the
identification of a mutation (a frameshift deletion on Hnrnpa1,
p.Leu181Valfs*6) related to congenital heart defect (CHD) in
procollagen type IIA deficient mice.
In order to prioritize risk
genes for diseases, especially those with limited prior knowledge,
a network-based gene prioritization model was constructed. It
consists of two parts: network analysis on known disease genes
(seed-based network strategy)and network analysis on differential
expression (DE-based network strategy). Case studies of various
complex diseases/traits demonstrated that the DE-based network
strategy can greatly outperform traditional gene expression
analysis in predicting disease-causing genes. A series of
simulation work indicated that the DE-based strategy is especially
meaningful to diseases with limited prior knowledge, and the
model’s performance can be further advanced by integrating with
seed-based network strategy. Moreover, a successful application of
the network-based gene prioritization model in influenza host
genetic study further demonstrated the capacity of the model in
identifying promising candidates and mining of new risk genes and
pathways not biased toward our current knowledge.
In conclusion,
an efficient NGS analysis framework from the steps of quality
control and variant detection, to those of result analysis and gene
prioritization has been constructed for medical genetics. The
novelty in this framework is an encouraging attempt to prioritize
risk genes for not well-characterized diseases by network analysis
on known disease genes and differential expression data. The…
Advisors/Committee Members: Song, Y, Jin, D.

RNA plays an important role in molecular biology. RNA sequence comparison is an important method to analysis the gene expression. Since aligning RNA reads needs…
(more)

▼

RNA plays an important role in molecular biology.
RNA sequence comparison is an important method to analysis the gene
expression. Since aligning RNA reads needs to handle gaps,
mutations, poly-A tails, etc. It is much more difficult than
aligning other sequences. In this thesis, we study the RNA-Seq
align tools, the existing gene information database and how to
improve the accuracy of alignment and predict RNA secondary
structure.
The known gene information database contains a lot of
reliable gene information that has been discovered. And we note
most DNA align tools are well developed. They can run much faster
than existing RNA-Seq align tools and have higher sensitivity and
accuracy. Combining with the known gene information database, we
present a method to align RNA-Seq data by using DNA align tools.
I.e. we use the DNA align tools to do alignment and use the gene
information to convert the alignment to genome based.
The gene
information database, though updated daily, there are still a lot
of genes and alternative splicings that hadn't been discovered. If
our RNA align tool only relies on the known gene database, then
there may be a lot reads that come from unknown gene or alternative
splicing cannot be aligned. Thus, we show a combinational method
that can cover potential alternative splicing junction sites.
Combining with the original gene database, the new align tools can
cover most alignments which are reported by other RNA-Seq align
tools.
Recently a lot of RNA-Seq align tools have been
developed. They are more powerful and faster than the old
generation tools. However, the RNA read alignment is much more
complicated than other sequence alignment. The alignments reported
by some RNA-Seq align tools have low accuracy. We present a simple
and efficient filter method based on the quality score of the
reads. It can filter most low accuracy alignments.
At last, we
present a RNA secondary prediction method that can predict
pseudoknot(a type of RNA secondary structure) with high sensitivity
and specificity.

﻿The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample.…
(more)

▼

﻿The recent advance of second-generation sequencing
technologies has made it possible to generate a vast amount of
short read sequences from a DNA (cDNA) sample. Current short read
assemblers make use of the de Bruijn graph, in which each vertex is
a k-mer and each edge connecting vertex u and vertex v represents u
and v appearing in a read consecutively, to produce contigs. There
are three major problems for de Bruijn graph assemblers: (1) branch
problem, due to errors and repeats; (2) gap problem, due to low or
uneven sequencing depth; and (3) error problem, due to sequencing
errors. A proper choice of k value is a crucial tradeoff in de
Bruijn graph assemblers: a low k value leads to fewer gaps but more
branches; a high k value leads to fewer branches but more gaps.
In this thesis, I first analyze the fundamental genome assembly
problem and then propose an iterative de Bruijn graph assembler
(IDBA), which iterates from low to high k values, to construct a de
Bruijn graph with fewer branches and fewer gaps than any other de
Bruijn graph assembler using a fixed k value. Then, the
second-generation sequencing data from metagenomic, single-cell and
transcriptome samples is investigated. IDBA is then tailored with
special treatments to handle the specific issues for each kind of
data.
For metagenomic sequencing data, a graph partition
algorithm is proposed to separate de Bruijn graph into dense
components, which represent similar regions in subspecies from the
same species, and multiple sequence alignment is used to produce
consensus of each component. For sequencing data with highly uneven
depth such as single-cell and metagenomic sequencing data, a method
called local assembly is designed to reconstruct missing k-mers in
low-depth regions. Then, based on the observation that short and
relatively low-depth contigs are more likely erroneous, progressive
depth on contigs is used to remove errors in both low-depth and
high-depth regions iteratively. For transcriptome sequencing data,
a variant of the progressive depth method is adopted to decompose
the de Bruijn graph into components corresponding to transcripts
from the same gene, and then the transcripts are found in each
component by considering the reads and paired-end reads support.
Plenty of experiments on both simulated and real data show that
IDBA assemblers outperform the existing assemblers by constructing
longer contigs with higher completeness and similar or better
accuracy. The running time of IDBA assemblers is comparable to
existing algorithms, while the memory cost is usually less than the
others.

► Since the introduction of Next Generation Sequencing(NGS), it has quickly become a popular tool for studying monogenic and polygenic disorders. Genetic architecture between monogenic and…
(more)

▼ Since the introduction of Next Generation
Sequencing(NGS), it has quickly become a popular tool for studying
monogenic and polygenic disorders. Genetic architecture between
monogenic and polygenic disorders differs significantly, posing
unique analysis challenges. By studying4 different disorders in
this thesis, I have attempted to demonstrate the strength of NGS
and the analysis approaches towards NGS data across very different
disorders.
Diffuse oesophageal leiomyomatosis(DOL) is a
monogenic disorder displaying an X-Link dominant inheritance. In
the case study described, I have studied 4 members of the family
with mother and son affected. The whole exome sequencing analysis
coupled with genotyping array resulted in the identification of a
new COL4A5/6 deletion shared by2 affected individuals. The
difficulty in detecting gonosomal mosaicism on the affected mother
revealed the weakness of current analysis methods. A similar
deletion was described in previous studies and could destabilize
the collagen IV monomer and induce leiomyomatosis.
The three
disorders discussed after, including choledochal cyst(CDD), mesial
temporal lobe epilepsy related to hippocampal
sclerosis(MTLE-HS),and hepatocellular carcinoma(HCC)are examples of
polygenic disorders. Polygenic disorders often present a complex
genetic architecture affected by a combination of genetic and
environmental factors. In the cases of MTLE-HS and CDD, both
demonstrated a complex and heterogeneous genetic profile where a
vast number of genes liable to the disorder have been discovered.
For the MTLE-HS study, 23 trios were studied to investigate
inherited rare recessive mutations and de novo mutations. As
aresult,PKD1 and CEP170Bwerefound to be recurrently mutated across
patients, multiple listed de novo variants were also found on genes
liable to psychiatric disorder. The study on 33 CDD trios suggested
amore complex genetic architecture of the disorder. Three different
analysis approaches to detection of candidate mutations resulted in
finding double hit rare variants, de novo mutations, and an overall
enrichment of rare variants compared to the normal population. The3
approaches identified 31 genes recurrently mutated in the double
hit setting, 21 genes carrying de novo damaging variants, and
significant enrichment in the interacting gene pair
TRIM28/ZNF382(p<3.6x10-6). To further understand the biology
behind the vast number of genes involved, multiple functional
databases on pathways, mouse knockout phenotypes, and human disease
were queried.
The capability of NGS is beyond simple mutations
on DNA level. As demonstrated by the HCC study, a malicious and
recurrent intronic Hepatitis B Virus(HBV) integration was
uncovered. Although this interesting event did not fall within the
coding region, the effect is only observable after
post-transcriptional splicing. By coupling transcriptome
sequencing, Sanger sequencing and functional studies, a large
number of HCC patients affected by a non-degradable CCNA2 protein
encoded by a 177 b.p. long insertion from HBV…

► As the volume and complexity of available sequencedata continues to grow at an exponential rate, the need for new sequence analysis techniques becomes more…
(more)

▼ As the volume and complexity of available sequencedata continues to grow at an
exponential rate, the need for new sequence analysis techniques becomes more urgent,
as does the need to test and to extend the existing techniques. These include,
among others, techniques for assembling raw sequencedata into usable genomic sequences;
for using these sequences to investigate the evolutionary history of genes
and species; and for examining the mechanisms by which sequences change over
evolutionary time scales. This thesis comprises three projects within the Veld of
sequence analysis.
It is shown that organelle genome DNA sequences can be assembled de novo
using short Illumina reads from a mixture of samples, and deconvoluted bioinformatically,
without the added cost of indexing the individual samples. In the
course of this work, a novel sequence element is described, that probably could
not have been detected with traditional sequencing techniques.
The problem of multiple optima of likelihood on phylogenetic trees is examined
using biological data. While the prevalence of multiple optima varies widely
with real data, trees with multiple optima occur less often among the best trees.
Overall, the results provide reassurance that the value of maximum likelihood
as a tree selection criterion is not often compromised by the presence of multiple
local optima on a single tree.
Fundamental mechanisms of mutation are investigated by estimating nucleotide
substitution rate matrices for edges of phylogenetic trees. Several large alignments
are examined, and the results suggest that the situation may be more
complex than we had anticipated. It is likely that genome scale alignments will
have to be used to further elucidate this question.

▼ When challenged by difficult biological samples, the
forensic analyst is far more likely to obtain useful data by
sequencing the human mitochondrial DNA (mtDNA). Nextgeneration
sequencing (NGS) technologies are currently being evaluated by the
Forensic Science Program at Western Carolina University for their
ability to reliably detect lowlevel variants in mixtures of mtDNA.
The sequence profiles for twenty individuals were obtained by
sequencing amplified DNA derived from the mitochondrial
hypervariable (HV) regions using Sanger methods. Two-person
mixtures were then constructed by mixing quantified templates,
simulating heteroplasmy at discrete sites and in defined ratios.
Libraries of unmixed samples, artificial mixtures, and instrument
controls were prepared using Illumina® Nextera® XT and
deep-sequenced on the Illumina® MiSeq™. Analysis of NGS data using
a novel bioinformatics pipeline indicated that minor variants could
be detected at the 5, 2, 1, and 0.5% levels of detection.
Additional experiments which examined the occurrence of sequence
variation in hair tissue demonstrates that a considerable amount of
sequence variation can exist between hairs and other tissues
derived from a single donor.; Forensic science, Illumina(R)
MiSeq(TM), low-level mixtures, Minor variant, Mitochondrial DNA,
Next-generation sequencing
Advisors/Committee Members: Mark Wilson (advisor).

De novo genome assembly is a fundamental problem in genomics research. When assembling large genomes, time is often a very important concern, and one might…
(more)

▼

De novo genome assembly is a fundamental problem in
genomics research. When assembling large genomes, time is often a
very important concern, and one might have no choice but to use a
more efficient assembler like SOAPden-ovo2 instead of a
high-quality but prohibitively slow assembler (e.g., SPAdes). Yet
SOAPdenovo2 has inherent difficulty to utilize the full advantage
of longer reads (say, 150bp to 250bp from Illumina HiSeq and
MiSeq). Other assemblers that are based on string graphs (e.g.,
SGA), though less popular and also very slow, are indeed more
favorable for longer reads.
In this thesis, I mainly present a
new contig assembler called BASE, based on a seed-extension
approach. It exploits an efficient indexing of reads to generate
adaptive seeds with high probability of unique appearance in the
genome and high sequencing quality. Guided by these seeds, BASE
constructs extension trees and gradually removes the branches with
a method called reverse validation, which utilizes information
about read coverage and paired-end relationship to obtain consensus
sequences of reads sharing the seeds. These consensus sequences are
further extended to form high quality contigs.
Benchmark on
several bacteria and human datasets demonstrates the performance
advantage of BASE in speed and assembly quality when longer reads
are used. Our first benchmark was based on two datasets of deeply
sequenced bacteria genomes (240X) with read length of 100bp and
250bp. Especially for 250bp reads, BASE performs much better than
SOAPdenovo2 and SGA and is similar to SPAdes in performance.
Regarding speed, BASE is consistently a few times faster than
SPAdes and SGA, but still slower than SOAPdenovo2. We have further
compared BASE and SOAPdenovo2 using human genome datasets with read
length 100bp, 150bp and 250bp. BASE consistently achieves a higher
N50 for all datasets; while the improvement becomes more
significant when read length reaches 250bp. SOAPdenovo2 uses
relatively more memory when sequencing error is high.
BASE is an
efficient assembler for contig construction, with significant
improvement in quality for long NGS reads. It could be easily
extended to support scaffolding in the near
future.

► Genome-wide association studies (GWAS) have been successfully applied to several complex diseases, yielding many confirmed associations. Nonetheless, at most they have explained half the genetic…
(more)

▼ Genome-wide association studies (GWAS) have been
successfully applied to several complex diseases, yielding many
confirmed associations. Nonetheless, at most they have explained
half the genetic variance and often much less. It is quite apparent
the rich GWAS datasets contain far more information than is
typically uncovered using the most common univariate analysis
approaches. The focus of the present thesis is on methods to
extract the most information from GWAS and on post-GWAS
experimental strategies, divided in four broad approaches.
The
first approach involves use of candidate gene studies to explore
epistasis and gene by environment interactions, using samples of
two different disorders, Hirschsprung disease (HSCR) and cognitive
decline. For HSCR, previous studies identified rare and common
variants in two genes, RET and NRG1, to be predisposing to disease,
and further demonstrated a statistical interaction between common
variants in these two genes. In this thesis, joint effects between
common and rare variants both within and across the two genes were
demonstrated by statistical modelling and then supported by
functional interaction. For cognitive decline, SNPs previously
implicated in Alzheimer’s disease were examined for epistasis and
gene-environment interaction in an independent sample of elderly
Chinese. The ACE rs1800764_C heterozygote in combination with
below-college educational level was found to result in greater
cognitive decline. These two studies demonstrate the utility of
post-GWAS candidate gene studies in detectinginteraction effects.
The next two approaches were adopted on GWAS summary statistics
at the SNP level. One of them involves meta-analysis applied to 11
epilepsy GWAS datasets, to increase power and explore whether
findings are population-specific or general across populations. Two
novel susceptibility genes (SCN1a and PCDH7) were identified using
this approach. Furthermore, the previously identified epilepsy risk
variant CAMSAP1L1 was found to only be a risk factor for Chinese
focal epilepsy patients. The other summary statistic approach
involved the development of a revised GWAS pathway analysis
pipeline to search for effective genes or gene-sets. Its
application to two autoimmune diseases revealed that multiple
pathways might be dysfunctional simultaneously and hence contribute
jointly to disease status. In addition, it indicated the pipeline
was powerful for mining moderate/small genetic effects on common
disorders.
The last approach to post-GWAS analysis involves the
use of next-generation sequencing (NGS). To this end, an automated
NGS pipeline for variant calling, filtering and prioritization was
established, specifically designed for gene burden analysis,
recurrent gene sharing and de novo mutation (DNM) identification.
The pipeline was applied to NGS sequencing of 62 candidate genes
and also whole exomes of HSCR patients and their parents. Results
indicated that multiple rare damaging inherited variants in several
genes contribute to HSCR; in addition, loss of…

► The rapidly developing sequencing technology has brought up an opportunity to scientists to look into the detailed genotype information in human genome. Computational programs have…
(more)

▼ The rapidly developing sequencing technology has
brought up an opportunity to scientists to look into the detailed
genotype information in human genome. Computational programs have
played important roles in identifying disease related genomic
variants from huge amount of sequencing data.
In the past years,
a number of computational algorithms have been developed, solving
many crucial problems in sequencing data analysis, such as mapping
sequencing reads to genome and identifying SNPs. However, many
difficult and important issues are still expecting satisfactory
solutions. A key challenge is identifying disease related mutations
in the background of non-pathogenic polymorphisms. Another crucial
problem is detecting INDELs especially the long deletions under the
technical limitations of second generation sequencing technology.
To predict disease related mutations, we developed a machine
learning-based (Random forests) prediction tool, EFIN (Evaluation
of Functional Impact of Nonsynonymous mutations). We build A
Multiple Sequence Alignment (MSA) for a querying protein with its
homologous sequences. MSA is later divided into different blocks
according to taxonomic information of the sequences. After that, we
quantified the conservation in each block using a number of
selected features, for example, entropy, a concept borrowed from
information theory. EFIN was trained by Swiss-Prot and HumDiv
datasets. By a series of fair comparisons, EFIN showed better
results than the widely-used algorithms in terms of AUC (Area under
ROC curve), accuracy, specificity and sensitivity. The web-based
database is provided to worldwide user at paed.hku.hk/efin.
To
solve the second problem, we developed Linux-based software,
SPLindel that detects deletions (especially long deletions) and
insertions using second generation sequencing data. For each
sample, SPLindel uses split-read method to detect the candidate
INDELs by building alternative references to go along with the
reference sequences. And then we remap all the relevant reads using
both original references and alternative allele references. A
Bayesian model integrating paired-end information was used to
assign the reads to the most likely locations on either the
original reference allele or the alternative allele. Finally we
count the number of reads that support the alternative allele (with
insertion or deletions comparing to the original reference allele)
and the original allele, and fit a beta-binomial mixture model.
Based on this model, the likelihood for each INDEL is calculated
and the genotype is predicted. SPLindel runs about the same speed
as GATK and DINDEL, but much faster than DINDEL. SPLindel obtained
very similar results as GATK and DINDEL for the INDELs of size 1-15
bps, but is much more effective in detecting INDELs of larger size.
Using machine learning method and statistical modeling
technology, we proposed the tools to solve these two important
problems in sequencing data analysis. This study will help identify
novel damaging nsSNPs more accurately and…
Advisors/Committee Members: Lau, YL, Yang, W.

► The research of Gene Predicting Algorithms is a key section in bioinformatics. After brief introduction to bioinformatics, we introduce a new and improved Gene Predicting…
(more)

▼ The research of Gene Predicting Algorithms is a key section in bioinformatics. After brief introduction to bioinformatics, we introduce a new and improved Gene Predicting Algorithm, together with a model to give an analytical explanation of its meaning. DNA sequences are formed by patches or domains of different nucleotide composition, for which the Jensen Shannon divergence method is often employed to find the boundaries for these compositional domains. By introducing one new parameter α into Jensen Shannon divergence, we find numerically the optimal value to obtain the best accuracy of border finding. We explain this result mathematically, and give the exact expression for this parameter. We then apply this improved Jensen Shannon divergence to artificial sequences and some real DNA sequences. The results demonstrate that this parameter is useful for segmentation of genomic sequences into compositionally homogeneous segments.

▼ AluScan is a pre-sequencing capture method with reduced costs and cancerous DNA sample requirement, using inter-Alu PCR in conjunction with next-generation sequencing (NGS). As an efficient solution to challenges in current cancer genome studies, AluScan generates an Alu-anchored scan on human genome to achieve inter-Alu sequences enriched in genic regions. With the built-up pipeline and developed programs, AluScan sequences from cancer genomes were employed in analysis of genomic alterations including single nucleotide variation (SNV), loss of heterozygosity (LOH) and copy number variation (CNV) to reveal the potential underlying pathways and hence improve the treatment of the tumors. In this thesis, two programs developed for processing AluScan sequencing data were introduced, one was SAMSVM and the other was AluScanCNV. For SAMSVM, it was developed as a tool for misalignment detection and filtration on SAM-format sequences from Illumina platform with use of support vector machine (SVM). Employing LIBSVM packages, SAMSVM performed misalignment detecting with high accuracies ranged from 0.89 to 0.97 and F-scores ranging from 0.77 to 0.94 in benchmarking of simulation data. Also, it increased mapping rate and on-target rate of SNP calling on real data. For AluScanCNV, it was developed as a tool for CNV calling on AluScan data. Employing Geary-Hinkley transformation (GHT) and circular binary segmentation (CBS), AluScan performed localized CNV calling and extended CNV calling in practical use. The CNV calling result of liver cancers resembled the results obtained from whole genome sequencing (WGS) study. Also, the validation test on existing dataset showed high correlation (R = 0.935 in CNV loss calling and R = 0.776 in CNV gain calling) with another CNV calling tool named FREEC. In this thesis, AluScan sequencing data from ten hepatitis B virus (HBV) positive and five non-viral hepatocellular carcinomas (HCCs) were subjected to comprehensive analysis to reveal genomic difference between viral and non-viral HCCs. Generally, non-viral HCCs displayed far fewer SNVs than HBV-positive HCCs, whereas these two types of HCCs showed similar patterns of LOH preferences and contained similar levels of CNVs in comparable genic locations. Mutational signature analysis showed that the two types of HCCs displayed specific signatures in base substitutions, suggesting that virus infection could result in specific SNVs. Signature V1 enriched in C>T mutation at NpC̲pG sites, suggesting that deamination of the methyl-5’ cytosine could be associated with virus infection; the mutually reversible T>A mutations at ApT̲pC and GpT̲pT were found in non-viral HCCs but only T>A mutations at GpT̲pT were observed in HBV-positive HCCs. In addition, results of hierarchical clustering on samples and selected functional events (SFEs) showed that non-viral HCCs belonged to C-class while HBV-positive HCCs belonged to M-C mixed class in terms of their dominant mutations. Lastly, the cancer genes that were found to contain mutations were shown to…

► Computational Molecular Biology or Bioinformatics is an emerging area for Electronic and Computer Engineering. Bioinformatics research results are expected to have a great impact on…
(more)

▼ Computational Molecular Biology or Bioinformatics is an emerging area for Electronic and Computer Engineering. Bioinformatics research results are expected to have a great impact on Biology and on Medical research, leading to new medicines or treatments for several diseases. The Bioinformatics area consists of several algorithms and datasets, leading to computationally challenging problems. Datasets have exponentially grown in size over the last few years, and the trend continues. Thealgorithms have several variations, depending on the size and the nature of datasets. Several algorithms are usually combined to solve bioinformatics problems. The BLAST algorithm is considered to be the most widely used one in the Bioinformatics community, and is used in many Bioinformatics problems, e.g. to find similarity between fragments of genetic data (query) and an organism (database), even if there are mutations or data that are not properly decoded (non exact match algorithm). Reconfigurable logic has been used in numerous problems to accelerate theexecution time of many applications, FPGAs have been previously used to map exact matching algorithms or less sophisticated Βioinformatics algorithms vs. BLAST. This dissertation presents a system based on reconfigurable logic to implement the BLAST algorithm, regardless of data size or algorithm variation. The BLAST algorithm has been studied in depth and the corresponding architectures have been designed and evolved in four different generations. The architectures are original and unique in offering a completely general solution for all BLAST variations. All architectures have been thoroughly post place and route simulated and the results have been confirmed against results of the most broadly accepted version of software (the NCBI BLAST). In addition, a laboratory prototype system has been build on an off-theshelf platform and all major technical implementation problems have been solved, including I/O issues. The TUC BLAST system, which is presented in this work, is one to three orders of magnitude faster than a general purpose computer running the BLAST algorithm.

▼ Genomes of several eukaryotic model organisms have now been finished, among them yeast (Saccharomyces cerevisiae), worm (Caenorhabditis elegans), fly (Drosophila melanogasta), mustard weed (Arabidopsis thaliana), mouse (Mus musculus) and human (Homo sapiens). Although the sequencing error in finished genomes is estimated to be less than one in 10,000 nucleotides, this does not account for the errors that arise from the misassembly or cloning errors of the sequence. We propose that these assembly errors may be roughly estimated by computing the size of duplicated sequence found within genomes, and a preliminary estimation of this error is as much as 1% of the genome, i.e., two-order of magnitudes larger than the sequencing error. We will describe the computation of the duplicated sequence found in three of the finished model genomes - yeast, mustard weed and mouse - and present a selection of the many duplications found which we suspect to be assembly or cloning errors, some of them longer than 10,000 nucleotides. We verify experimentally that some very large duplications in the model worm genome are indeed assembly errors caused by the presence of chimeric clones, that is contiguous pieces of DNA used for sequencing that in fact derive from two separated locations in the worm genome. We also verify experimentally that there is an error caused by an insertion of a piece of foreign DNA which is about 300 nucleotides. The ultimate goal of our research is to algorithmically correct finished genome assembly and cloning errors to substantially improve their accuracy.

► In genome wide association studies (GWAS), there are single-nucleotide polymorphism (SNP) pairs which have significant associations with diseases via the combination of their main effects…
(more)

▼ In genome wide association studies (GWAS), there are single-nucleotide polymorphism (SNP) pairs which have significant associations with diseases via the combination of their main effects and interactions. This effect is referred to as associations allowing for interactions [1]. A fast method has been proposed [2]. The method is based on a likelihood ratio test with the assumption that the statistics follow a chi-square distribution. Many SNP pairs with significant associations allowing for interactions have been detected using their method. However, the chi-square test requires the expected values in each cell of the contingency table to be at least 5. This assumption is violated in some identified SNP pairs. In this case, a likelihood ratio test may not be applicable any more. A permutation test is an ideal approach to double checking the p-values calculated in a likelihood ratio test because of its nonparametric nature. The p-values of SNP pairs having significant associations with disease are always extremely small, so permutation test in genome wide association studies is computationally demanding. We need a huge number of permutations to achieve a correspondingly high resolution for the p-value. In order to investigate whether the p-values from likelihood ratio tests are reliable, a fast permutation tool to accomplish a large number of permutations is desirable. In this thesis, we firstly presented a fast permutation tool based on graphics processing units (GPUs) with highly reliable p-value estimation. We designed a memory layout schema which is dedicated to concurrent permutation, and utilized the properties of different memories in GPUs to optimize the efficiency of the tool. We also proposed an algorithm to test multiple SNP pairs in each iteration of permutation, which greatly improved the efficiency of testing the identified SNP pairs. Our tool completed 107 permutations for a single SNP pair from the Wellcome Trust Case Control Consortium (WTCCC, [3]) genome data within 1 minute on a single Nvidia Tesla M2090 device, while it took 60 minutes on a single CPU Intel Xeon ES-2650 to finish the same task. More importantly, when simultaneously testing 256 SNP pairs for 107 permutations, our tool took only 5 minutes, while the CPU program took 10 hours. Secondly, we used this tool to do permutation tests on simulation data sets to determine the eligibility condition of likelihood ratio tests. We found that the p-values from likelihood ratio tests will have relative error of more than 100% when more than 8 cells in the contingency table have an expected count of less than 5 or when there is a zero expected count in any of the contingency table cells. Finally, we permuted the WTCCC data sets. By permuting on a GPU cluster consisting of 40 nodes, we completed 1012 permutations for all 280 SNP pairs reported with p-values smaller than 10-12 in the WTCCC data sets in 1 week. We found two pairs whose permutation test p-values were larger than the significance threshold.…

▼ Foodborne bacterial pathogens like Salmonella genera
remain of interest to regulatory agencieslike the FDA and CDC. As a
foodborne pathogen, capable of causing serious illness in bothhuman
and non-human animals, the CDC has listed Salmonella spp. as
potential bioterrorismagents. From a forensic perspective, accurate
and rapid identification of Salmonella subspecies isessential for
successful investigation of foodborne outbreaks or suspected
biocrimes. Massivelyparallel sequencing (MPS) provides
investigators with a streamlined, cost-effective method torapidly
sequence the whole bacterial genome. To study the genetic variation
of naturallyoccurring Salmonella spp., environmental samples were
collected from areas around freshwaterlakes, rivers and ponds in
the Piedmont and mountains of western North Carolina.
NineteenSalmonella isolates were sequenced using the Illumina MiSeq
producing high quality sequencedata that were submitted to NCBI in
an effort to build a comprehensive database containingwhole genome
sequences of bacterial pathogens. Distance–based phylogenetic trees
were createdusing the sequence information. This method was shown
to be susceptible to the quality of thegiven sequencedata. kSNP, a
SNP analysis program to create phylogenetic trees, was shown
toproduce trees of similar quality without the influences of
sequence quality as found in distance-based trees. Ultimately, the
databases generated from MPS data can serve as a repository of
phylogenetic information and population data to most effectively
answer questions germane tobacterial forensics, such as identifying
the source of a foodborne outbreak; Bacterial forensics,
GenomeTrackr, Illumina, Massively parallel sequencing, NCBI,
Salmonella Enterica
Advisors/Committee Members: Mark Wilson (advisor).

In recent years, the demand for DNA sequencing analysis has been boosted with the advance of DNA sequencing technologies; exceeding the capacities of high-end computer…
(more)

▼

In recent years, the demand for DNA sequencing
analysis has been boosted with the advance of DNA sequencing
technologies; exceeding the capacities of high-end computer
servers.
This thesis presents integrated software solutions for
popular DNA Sequencing analyses, along with implementation and
experiments with real data to demonstrate the strength of the
solutions over conventional solutions.
The first software tool
presented is BALSA, which integrates the DNA pairend short reads
aligner SOAP3-dp with a newly designed secondary analysis. BALSA
finishes 30x Whole-Genome Analysis (WGA) within 6 hours. The
well-known pipeline BWA+GATK takes about 20 hours for the same
analysis. BALSA’s efficiency is rooted at its fast alignment
algorithm and an integrated design that significantly reduces the
time spent on file IO. More importantly, experiments show that
variant calling accuracy and sensitivity of BALSA are competitive
to other existing solutions.
The second tool presented is
BALSA-Amplicon, which is designed for amplicon sequencing analysis.
Unlike WGA, amplicon sequencing data come along with artificial
primers, which will contaminate the analysis. A common fix is to
trim the reads at the beginning, but this also removes useful data
that helps to map the read correctly.
BALSA takes advantage of
aligning with the primer and only trims it when updating the
in-memory alignment information data structure. The sequencing
depth of amplicon data could also be several thousands of times of
that of WGA data. The data structure has been modified to support
the high sequencing depth without degrading the performance.
Experiments show BALSA-amplicon takes 20 minutes for calling
variants from 3 million of 275bp amplicon short-read pairs.
Thirdly, we introduce a short-read aligner SOAP4, targeted on
aligning short-read pairs with read length larger than or equal to
150bp (the current standard of high-throughput sequencers like
HiSeq 10X). Unlike 100bp reads, the number of mismatches along the
reads are generally greater than 2. SOAP3-dp is unable to be
aligned quickly by using BWT index solely. SOAP4 highly adapts the
seed-and-extend strategy. Experiments, with real data with 250bp
length, show that SOAP4 is 8% faster than SOAP3-dp and the
sensitivity of SOAP4 is 95.83%, compared to 85.32% of SOAP3-dp. And
simulated data experiments show SOAP4 gives competitive accuracy
compare with SOAP3-dp.
Lastly, we introduce the tool ELSA,
CPU-version BALSA. ELSA tries to compensate a GPU card (which
typically contains hundreds to a few thousand cores) by
multiple-cores CPUs in a single computing node (typically 2x12
cores). Although GPU is a popular tool in the aspect of
high-performance computing, it is costly and requires special
maintenance especially on its evolving software environment. ELSA
is targeted to be a cost-effective solution for secondary
analysis.

► Ribonucleic Acid (RNA) is an important cellular macromolecule vital to most if not all life on Earth. RNA has many different roles in the cell,…
(more)

▼ Ribonucleic Acid (RNA) is an important cellular macromolecule vital to most if not all life on Earth. RNA has many different roles in the cell, most notably as the intermediary molecule that transfers genetic information from DNA to protein in translation. Recently, additional functions of RNA have been elucidated more clearly, such as catalyzing chemical reactions and regulating gene expression. These exciting new findings have shined a scientific spotlight on the field of RNA structure in order to better understand how the once mundane polynucleotide acts in such myriad ways.
An important factor in RNA’s versatile nature is the inherent variation in its chemical structure. The hydroxyl group present on the ribose sugar of a ribonucleic acid makes the corresponding polynucleotide capable of chemical reaction, with itself or with other molecules in the cell. This hyper-reactivity allows RNA to form substantially unique structures, from the hammerhead ribozyme's helical shape from which it takes its name, to the L-shaped conformation common to all transfer RNAs. The problem at hand is thus to study RNA structure and determine if any new patterns can be discovered.
The work presented here centered on a collaborative effort to define a set of conformations common to two-nucleotide long sequences of RNA found in structures from the Protein Data Bank (PDB). This work contributed by clustering RNA di-nucleotides by their torsion angle space using a Fast Fourier averaging technique proven to be effective in clustering nucleotide structure. Each group in the collaboration used different methodologies to analyze the same RNA structural data, and yet found similar results. The collaboration ultimately produced a set of 46 consensus conformations defined by the seven dihedral angles of the sugar-to-sugar unit in a di-nucleotide RNA sequence.
To utilize this new set of RNA di-nucleotide conformations, a software tool was designed and developed to automatically assign the conformation nomenclature to input RNA structure. The program was successfully tested on the pilot study data. A test study was performed on a unique set of RNA structures. The results of this study demonstrated that the consensus conformation set can in fact be used to classify RNA structure.
Advisors/Committee Members: Micallef, David Ian, 1979- (author), Berman, Helen (chair), Levy, Ronald (internal member), Olson, Wilma (internal member).