Bottom Line:
Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation.GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort.On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.

Background: The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested.

Results: We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort.

Conclusions: General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.

Figure 3: CUDA code snippet. Variables threads and blocks store the thread configuration. Function cudaMemcpy feeds the data into the GPU and retrieves the results afterwards. Each of the preconfigured GPU threads independently executes the computeIGain function and scores the associated SNP pair.

Mentions:
Regardless of the development tool used, the programmer must follow certain rules to obtain maximum performance [17]. The most important one is to partition the algorithm in blocks small enough to simultaneously start a sufficient number of threads to utilize all available resources. For example, consider the code snippet in Figure 3, a simplified version of a code that scores pairs of SNPs. Function computeIGain calculates the information gain of a SNP pair using Equation 1. The details of the calculation are omitted to emphasize the architecture specific parts of code. The snippet includes all the peculiarities of programming for GPUs. The program has to implement the GPU-specific part separately from the CPU code and explicitly transfer data from the host to the GPU. Special functions called kernels (marked with the keyword __global__) must be written to be executed on the GPU. Memory transfer and allocation functions must be called to supply the necessary data to the GPU and collect the results afterwards. Usually, the programmer performs measurements to determine which thread configuration is most suitable for a particular problem size and the appropriate number of threads to launch.

Figure 3: CUDA code snippet. Variables threads and blocks store the thread configuration. Function cudaMemcpy feeds the data into the GPU and retrieves the results afterwards. Each of the preconfigured GPU threads independently executes the computeIGain function and scores the associated SNP pair.

Mentions:
Regardless of the development tool used, the programmer must follow certain rules to obtain maximum performance [17]. The most important one is to partition the algorithm in blocks small enough to simultaneously start a sufficient number of threads to utilize all available resources. For example, consider the code snippet in Figure 3, a simplified version of a code that scores pairs of SNPs. Function computeIGain calculates the information gain of a SNP pair using Equation 1. The details of the calculation are omitted to emphasize the architecture specific parts of code. The snippet includes all the peculiarities of programming for GPUs. The program has to implement the GPU-specific part separately from the CPU code and explicitly transfer data from the host to the GPU. Special functions called kernels (marked with the keyword __global__) must be written to be executed on the GPU. Memory transfer and allocation functions must be called to supply the necessary data to the GPU and collect the results afterwards. Usually, the programmer performs measurements to determine which thread configuration is most suitable for a particular problem size and the appropriate number of threads to launch.

Bottom Line:
Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation.GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort.On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.

Background: The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested.

Results: We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort.

Conclusions: General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.