About Me

Text

Physicist, Startup Founder, Blogger, Dad

Monday, December 02, 2013

PLINK 1.90 alpha

WDIST is now PLINK 1.9 alpha. WDIST (= "weighted distance" calculator) was originally written to compute pairwise genomic distances. The mighty Chris Chang then amazingly re-implemented all of PLINK with significant improvements (see below).

PLINK 1.9 even has support for LASSO (i.e., L1 penalized optimization, a particular method for Compressed Sensing).

This is a comprehensive update to Shaun Purcell's popular PLINK command-line program, developed by Christopher Chang with support from the NIH-NIDDK's Laboratory of Biological Modeling and others. (What's new?) (Credits.)

It isn't finished yet (hence the 'alpha' designation), but it's getting there. We are working with Dr. Purcell to launch a large-scale beta test in the near future. ...

Unprecedented speed

Thanks to heavy use of bitwise operators, sequential memory access patterns, multithreading, and higher-level algorithmic improvements, PLINK 1.9 is much, much faster than PLINK 1.07 and other popular software. Several of the most demanding jobs, including identity-by-state matrix computation, distance-based clustering, LD-based pruning, and association analysis max(T) permutation tests, now complete hundreds or even thousands of times as quickly, and even the most trivial operations tend to be 5-10x faster due to I/O improvements.

We hasten to add that the vast majority of ideas contributing to PLINK 1.9's performance were developed elsewhere; in several cases, we have simply ported little-known but outstanding implementations without significant further revision (even while possibly uglifying them beyond recognition; sorry about that, Roman...). See the credits page for a partial list of people to thank. On a related note, if you are aware of an implementation of a PLINK command which is substantially better what we currently do, let us know; we'll be happy to switch to their algorithm and give them credit in our documentation and papers.

Nearly unlimited scale

The main genomic data matrix no longer has to fit in RAM, so bleeding-edge datasets containing tens of thousands of individuals with exome- or whole-genome sequence calls at millions of sites can be processed on ordinary desktops (and this processing will usually complete in a reasonable amount of time). In addition, several key individual x individual and variant x variant matrix computations (including the GRM mentioned below) can be cleanly split across computing clusters (or serially handled in manageable chunks by a single computer).

Command-line interface improvements
We've standardized how the command-line parser works, migrated from the original 'everything is a flag' design toward a more organized flags + modifiers approach (while retaining backwards compatibility), and added a thorough command-line help facility.

Additional functions
In 2009, GCTA didn't exist. Today, there is an important and growing ecosystem of tools supporting the use of genetic relationship matrices in mixed model association analysis and other calculations; our contributions are a fast, multithreaded, memory-efficient --make-grm-gz/--make-grm-bin implementation which runs on OS X and Windows as well as Linux, and a closer-to-optimal --rel-cutoff pruner.

There are other additions here and there, such as cluster-based filters which might make a few population geneticists' lives easier, and a coordinate-descent LASSO. New functions are not a top priority for now (reaching 95%+ backward compatibility, and supporting dosage/phased/triallelic data, are more important...), but we're willing to take time off from just working on the program core if you ask nicely.

"The mighty Chris Chang then amazingly re-implemented all of PLINK with significant improvements (see below)."

Instead of doing all of these complex studies with the aim of understanding the human genome well enough to maybe have practical effects at some possibly distant point in the future, why not just uh donate this stud's sperm to a sperm bank and have him father 533 children...

Interesting and engaging. Every individual human genome is a huge store of information, we have more than a couple of billions of individual genomes "in action". How much computing power to solve this, I guess, two points in n dimensions problem for all the human pairs? Is the problem solution giving a probability density function for radius of n-1 dimensional sphere (distance from a weighted centre of a manifold?) in genomic space? Can there be a projection from a multidimensional genomic space into multidimensional real feature sphere? Was the problem widened from Fermat, then Weber into an attraction/repulsion one or is it just weights between two points in n - space determined by corresponding genomes (preselected loci)? I mean, one can preselect loci for different features having, thus, some points for one genome and same number for the other and some attraction/repulsion weights,or, maybe, I am talking rubbish.

Yen Shen, although dangerously close to trolling might have a point. For one, I would do with some soothing words about the moral side of these results. They might be implemented in economics, engineering and social engineering too. As for the latter, is human population on Earth capable of compressed sensing? Before starting modelling and trials (actually, I feel like they'd already) let's keep in mind: that's the only specimen we have. But again, did those who deliberately led to the Second and to the First WW think about it? Did Galileo and Copernicus do? Were they brave because they knew so little, or, were they brave because they had some faith against all odds? (I'd vote for the latter)