Mining Complex Gene Expression Across the Tree of Life

Modern high throughput DNA sequencing technology continues to revolutionize life science research. However, tens to hundreds of millions of DNA sequence records within tens of thousands of datasets aggregates into petabytes of data. HPC/HTC systems like The Open Science Grid are required to process all this data into useful data structures. OSG-GEM is a Pegasus workflow that processes DNA sequencing text files to produce a Gene Expression Matrix (GEM), which contains quantified gene expression values across tens to thousands of samples. Due to storage and memory constraints on available compute nodes, the workflow splits raw input files into small pieces to process in parallel, and merges intermediate output files. Demonstrating the portability of Pegasus workflows, OSG-GEM is configured to run on both the Open Science Grid and Jetstream. The workflow contains a configuration file that allows the user to easily specify their input dataset locations, select software options (TopHat2, HISAT2, STAR, etc.), and customize hardware requests. The output files from the workflow are formatted for input into downstream biological analysis tools, such as differential gene expression analysis and gene coexpression network construction. In addition, a statistical report is produced by the workflow that users can view to ensure the quality of their data.

The Pegasus project is supported by the National Science Foundation under the OAC SI2-SSI program, grant #1664162. Pegasus also receives support from the Department of Energy, the National Institutes of Health, Defense Advanced Research Projects Agency, and the USC Information Sciences Institute.