Surge in the amount of short read DNA sequence data generated by Next-Generation Sequencing techniques like Restriction-site associated DNA sequencing (RAD-seq) has lead to a need for computationally efficient methods to process these data. Assembling the sequences into loci is a crucial and often challenging step in analyzing these genomic datasets. In this thesis we are addressing two important challenges in analyzing RAD-seq data. One of the challenges is to distinguish paralogous sequence variants (PSVs) from true single-nucleotide polymorphisms (SNPs) associated with orthologous loci. Another challenge is given the large number of short-read DNA sequences often examined per individual, identifying the optimal parameter settings for de novo assembly could be highly challenging. The proposed enhancements focuses on effectively identifying the paralogs, accelerating the de novo loci formation process and allowing the user to perform parameter search to identify the optimal parameter settings.

The first enhancement identifies paralogs using a network of connected short-reads based on their sequence similarity. Applying our method to de novo RAD-seq data from 150 Atlantic salmon (Salmo salar) samples collected from 15 locations across the Southern Newfoundland coast allowed the identification of 70% of total PSVs identified through alignment to the Atlantic salmon genome.

The second enhancement is a graph data structure, which captures the details about each unique DNA sequence in all the individuals in the dataset. The proposed method exploits the fact that a DNA sequence can present in most of the samples in the dataset and so eliminates the redundancy in calculating the sequence string mismatches. Using the graph structure, the locus formation can be seen as identifying the connected graph nodes, provided they are present in that individual and satisfy the parameter settings. The run-time comparison of the proposed method with most widely used Stacks tool for short read processing, using green crab (Carcinus maenas) samples has shown that the proposed method is at least 4 times faster.

The graph data structure captures the allele level to population level information from the raw RAD-seq data. The thesis examines possible extensions of the graph data structure to other population genetic studies like population structure inference, outlier loci identification and exploring allelic abundance.