Algorithms for Biological Sequence Processing

Biological sequence data is commonly treated as strings over finite alphabets, and fundamental questions about searching and extracting information from this string data arise; that abstraction and computational problem is particularly prevalent with regard to DNA and RNA molecules. I am interested in studying sequence problems in theoretical computer science that abstractly model biological processes or analysis. Here is a sample of the type of problems and approaches that I encounter in this work:

Outlier detection in sequencing data. Contamination of the DNA sample and erroneous runs of the sequencing platforms are frequent occurrences that lead to many reads having a large fraction of errors and hence, deviate quite dramatically from the rest of the data. Ideally, these “outlier” strings should be detected and removed from the input prior to assembly. Although problem formulations with outliers have been previously proposed and studied in different contexts – including machine learning, network design problem, and bioinformatics—it has not been considered or proposed in error correction. Present error correction methods in the absence of outliers are required to be highly liberal in the elimination of data in order to be effective and thus, have the effect of removing large number of reads that could have been used for assembly.

In collaboration with Christine Lo and Dr. Daniel Lokshtanov, we present a novel reformulation of the problem of error correction of sequencing data that allows for outliers. Studying the complexity of this problem leads to surprising theoretical insights. Our formulation lends itself to be the first problem that admits a polynomial time approximation scheme (PTAS), is fixed parameter tractable when parameterized by the value of the objective function, but which provably does not admit an efficient polynomial time approximation scheme (EPTAS) under plausible complexity assumptions. To accommodate for this, our hardness of approximation proof needs to combine parameterized reductions and gap preserving reductions in a novel manner.

The closest string problem. The closest string problem is an important sub-problem in motif recognition; therefore it is imperative that practical solutions for it are developed. Independently, both Gramm et al. (Algorithmica, 2003) and Sze et al. (WABI 2004) give an efficient algorithm for solving the restricted version of the consensus string problem when the input contains at most three strings. The former authors state that extending these results to finding an efficient polynomial-time algorithm for solving consensus string on a set of four strings remains open due to the enormous combinatorial complexity [of the ILP-based solution]”. We resolve this open problem for binary strings by giving an exact, linear-time algorithm for four strings. Our article has generated considerable interest in the closest string problem. For example, Amir et al. (SPIRE 2009) extend our results to a variant of the closest string problem where the distance metric is both the Hamming distance and edit distance.