If one estimates a user population of ~1000, each saving of order $1000 in CPU/work time per year, then in the next few years PLINK 1.9 and its successors will deliver millions of dollars in value to the scientific community.

Background
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.

Findings
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(n‾√)-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.

The basic idea is straightforward, however the technique yields good evidence for polygenicity.

Variants in LD with a causal variant show an elevation in test statistics in association analysis proportional to their LD (measured by r2) with the causal variant1–3. The more genetic variation an index variant tags, the higher the probability that this index variant will tag a causal variant. In contrast, inflation from cryptic relatedness within or between cohorts4–6 or population stratification purely from genetic drift will not correlate with LD.

...

Real data

Finally, we applied LD Score regression to summary statistics from GWAS representing more than 20 different phenotypes15–32 (Table 1 and Supplementary Fig. 8a–w; metadata about the studies in the analysis are presented in Supplementary Table 8a,b). For all studies, the slope of the LD Score regression was significantly greater than zero and the LD Score regression intercept was substantially less than λGC (mean difference of 0.11), suggesting that polygenicity accounts for a majority of the increase in the mean χ2 statistic and confirming that correcting test statistics by dividing by λGC is unnecessarily conservative. As an example, we show the LD Score regression for the most recent schizophrenia GWAS, restricted to ~70,000 European-ancestry individuals (Fig. 2)32. The low intercept of 1.07 indicates at most a small contribution of bias and that the mean χ2 statistic of 1.613 results mostly from polygenicity.

Monday, February 23, 2015

The recent flourishing of deep neural nets is not primarily due to theoretical advances, but rather the appearance of GPUs and large training data sets.

Chronicle: ... Hinton has always bucked authority, so it might not be surprising that, in the early 1980s, he found a home as a postdoc in California, under the guidance of two psychologists, David E. Rumelhart and James L. McClelland, at the University of California at San Diego. "In California," Hinton says, "they had the view that there could be more than one idea that was interesting." Hinton, in turn, gave them a uniquely computational mind. "We thought Geoff was remarkably insightful," McClelland says. "He would say things that would open vast new worlds."

They held weekly meetings in a snug conference room, coffee percolating at the back, to find a way of training their error correction back through multiple layers. Francis Crick, who co-discovered DNA’s structure, heard about their work and insisted on attending, his tall frame dominating the room even as he sat on a low-slung couch. "I thought of him like the fish in The Cat in the Hat," McClelland says, lecturing them about whether their ideas were biologically plausible.

The group was too hung up on biology, Hinton said. So what if neurons couldn’t send signals backward? They couldn’t slavishly recreate the brain. This was a math problem, he said, what’s known as getting the gradient of a loss function. They realized that their neurons couldn’t be on-off switches. If you picture the calculus of the network like a desert landscape, their neurons were like drops off a sheer cliff; traffic went only one way. If they treated them like a more gentle mesa—a sigmoidal function—then the neurons would still mostly act as a threshold, but information could climb back up.

...

A decade ago, Hinton, LeCun, and Bengio conspired to bring them back. Neural nets had a particular advantage compared with their peers: While they could be trained to recognize new objects—supervised learning, as it’s called—they should also be able to identify patterns on their own, much like a child, if left alone, would figure out the difference between a sphere and a cube before its parent says, "This is a cube." If they could get unsupervised learning to work, the researchers thought, everyone would come back. By 2006, Hinton had a paper out on "deep belief networks," which could run many layers deep and learn rudimentary features on their own, improved by training only near the end. They started calling these artificial neural networks by a new name: "deep learning." The rebrand was on.

Before they won over the world, however, the world came back to them. That same year, a different type of computer chip, the graphics processing unit, became more powerful, and Hinton’s students found it to be perfect for the punishing demands of deep learning. Neural nets got 30 times faster overnight. Google and Facebook began to pile up hoards of data about their users, and it became easier to run programs across a huge web of computers. One of Hinton’s students interned at Google and imported Hinton’s speech recognition into its system. It was an instant success, outperforming voice-recognition algorithms that had been tweaked for decades. Google began moving all its Android phones over to Hinton’s software.

It was a stunning result. These neural nets were little different from what existed in the 1980s. This was simple supervised learning. It didn’t even require Hinton’s 2006 breakthrough. It just turned out that no other algorithm scaled up like these nets. "Retrospectively, it was a just a question of the amount of data and the amount of computations," Hinton says. ...

Friday, February 20, 2015

I've been trying to get my kids interested in coding. I found this nice game called Lightbot, in which one writes simple programs that control the discrete movements of a bot. It's very intuitive and in just one morning my kids learned quite a bit about the idea of an algorithm and the notion of a subroutine or loop. Some of the problems (e.g., involving nested loops) are challenging.

Some interesting longitudinal results on female persistence through graduate school in STEM. Post-PhD there could still be a problem, but apparently this varies strongly by discipline. These results suggest that, overall, it is undergraduate representation that will determine the future gender ratio of the STEM professoriate.

For decades, research and public discourse about gender and science have often assumed that women are more likely than men to “leak” from the science pipeline at multiple points after entering college. We used retrospective longitudinal methods to investigate how accurately this “leaky pipeline” metaphor has described the bachelor’s to Ph.D. transition in science, technology, engineering, and mathematics (STEM) fields in the U.S. since the 1970s. Among STEM bachelor’s degree earners in the 1970s and 1980s, women were less likely than men to later earn a STEM Ph.D. However, this gender difference closed in the 1990s. Qualitatively similar trends were found across STEM disciplines. The leaky pipeline metaphor therefore partially explains historical gender differences in the U.S., but no longer describes current gender differences in the bachelor’s to Ph.D. transition in STEM. The results help constrain theories about women’s underrepresentation in STEM. Overall, these results point to the need to understand gender differences at the bachelor’s level and below to understand women’s representation in STEM at the Ph.D. level and above. Consistent with trends at the bachelor’s level, women’s representation at the Ph.D. level has been recently declining for the first time in over 40 years.

... However, as reviewed earlier, the post-Ph.D. academic pipeline leaks more women than men only in some STEM fields such as life science, but surprisingly not the more male-dominated fields of physical science and engineering (Ceci et al., 2014). ...

Conclusion: Overall, these results and supporting literature point to the need to understand gender differences at the bachelor’s level and below to understand women’s representation in STEM at the Ph.D. level and above. Women’s representation in computer science, engineering, and physical science (pSTEM) fields has been decreasing at the bachelor’s level during the past decade. Our analyses indicate that women’s representation at the Ph.D. level is starting to follow suit by declining for the first time in over 40 years (Figure 2). This recent decline may also cause women’s gains at the assistant professor level and beyond to also slow down or reverse in the next few years. Fortunately, however, pathways for entering STEM are considerably diverse at the bachelor’s level and below. For instance, our prior research indicates that undergraduates who join STEM from a non-STEM field can substantially help the U.S. meet needs for more well-trained STEM graduates (Miller et al., under review). Addressing gender differences at the bachelor’s level could have potent effects at the Ph.D. level, especially now that women and men are equally likely to later earn STEM Ph.D.’s after the bachelor’s.

Tuesday, February 17, 2015

This report using CBO (Congressional Budget Office) data claims that income inequality did not widen during the Great Recession (table above compares 2007 to 2011). After government transfer payments (taxes, entitlements, etc.) are taken into account, one finds that low income groups were cushioned, while high earners saw significant declines in income.

... The CBO on the other hand defines income broadly as resources consumed by households, whether through cash payments or services rendered without payments.2 Its definition of market income includes employer payments on workers (Social Security, Medicare, medical insurance, and retirement) and capital gains. On top of market income, CBO next adds all public cash assistance and in-kind benefits from social insurance and government assistance programs to arrive at “before-tax income.” Finally, the CBO’s last step is to subtract all federal taxes including personal income taxes, Social Security payments, excise taxes and corporate income taxes to arrive at “after-tax income” or what other government series call disposable income.3 ...

CONCLUSION: It is now widely held that inequality increased dramatically in the decades prior to 2007. For example, Piketty and Saez’s research shows that 91 percent of economic growth between 1979 and 2007 went to the wealthiest 10 percent. But when comparing the CBO’s more comprehensive definition of income (including employer benefits, Social Security, Medicare, and other government benefits), 47 percent of growth of after-tax income went to the richest 10 percent.14

Consequently, both methodologies reveal a real income inequality problem.15 But this paper once again shows that the IRS data give a misleading impression of what has happened with income inequality (not growing as fast in the period from 1979 to 2007 and decreasing, not increasing in the years after 2007). While many on the left were unhappy with the first ITIF paper and my earlier work criticizing Piketty and Saez, it is less clear how they will react to this paper.16 On the one hand, the paper argues that inequality doesn’t always rise and that it didn’t since the onset of the Great Recession. On the other hand, it argues for the efficacy of robust income-support and growth policies and ultimately provides a refutation to a critique that Republicans have made of President Obama.

Almost no increase in US Gini coefficient since 1979 once transfer payments are accounted for:

Is it possible that nameless government employees at CBO have done a better job on this problem than the acclaimed economists Piketty and Saez? (What kind of serious statisticalresearcher uses Excel?!?)

Saturday, February 14, 2015

Short summary: top academic departments produce a disproportionate fraction of all faculty. The paper below finds that only 9 to 14% of faculty are placed at institutions more prestigious than their doctorate ... the top 10 units produce 1.6 to 3.0 times more faculty than the second 10, and 2.3 to 5.6 times more than the third 10.

The faculty job market plays a fundamental role in shaping research priorities, educational outcomes, and career trajectories among scientists and institutions. However, a quantitative understanding of faculty hiring as a system is lacking. Using a simple technique to extract the institutional prestige ranking that best explains an observed faculty hiring network—who hires whose graduates as faculty—we present and analyze comprehensive placement data on nearly 19,000 regular faculty in three disparate disciplines. Across disciplines, we find that faculty hiring follows a common and steeply hierarchical structure that reflects profound social inequality. Furthermore, doctoral prestige alone better predicts ultimate placement than a U.S. News & World Report rank, women generally place worse than men, and increased institutional prestige leads to increased faculty production, better faculty placement, and a more influential position within the discipline. These results advance our ability to quantify the influence of prestige in academia and shed new light on the academic system.

From the article:

... Across the sampled disciplines, we find that faculty production (number of faculty placed) is highly skewed, with only 25% of institutions producing 71 to 86% of all tenure-track faculty ...

... Strong inequality holds even among the top faculty producers: the top 10 units produce 1.6 to 3.0 times more faculty than the second 10, and 2.3 to 5.6 times more than the third 10. [ Figures at bottom show top 60 ranked departments according to algorithm defined below ]

... Within faculty hiring networks, each vertex represents an institution, and each directed edge (u,v) represents a faculty member at v who received his or her doctorate from u. A prestige hierarchy is then a ranking π of vertices, where πu = 1 is the highest-ranked vertex. The hierarchy’s strength is given by ρ, the fraction of edges that point downward, that is, πu ≤ πv, maximized over all rankings (14). Equivalently, ρ is the rate at which faculty place no better in the hierarchy than their doctorate. When ρ = 1/2, faculty move up or down the hierarchy at equal rates, regardless of where they originate, whereas ρ = 1 indicates a perfect social hierarchy.

Both the inferred hierarchy π and its strength ρ are of interest. For large networks, there are typically many equally plausible rankings with the maximum ρ (15). To extract a consensus ranking, we sample optimal rankings by repeatedly choosing a random pair of vertices and swapping their ranks, if the resulting ρ is no smaller than for the current ranking. We then combine the sampled rankings with maximal ρ into a single prestige hierarchy by assigning each institution u a score equal to its average rank within the sampled set, and the order of these scores gives the consensus ranking (see the Supplementary Materials). The distribution of ranks within this set for some u provides a natural measure of rank uncertainty.

Across disciplines, we find steep prestige hierarchies, in which only 9 to 14% of faculty are placed at institutions more prestigious than their doctorate (ρ = 0.86 to 0.91). Furthermore, the extracted hierarchies are 19 to 33% stronger than expected from the observed inequality in faculty production rates alone (Monte Carlo, P < 10−5; see Supplementary Materials), indicating a specific and significant preference for hiring faculty with prestigious doctorates.

Wednesday, February 11, 2015

Highly recommended podcast: Tim Harford (FT) at the LSE. Among the topics covered are Keynes' and Irving Fisher's performance as investors, and Philip Tetlock's IARPA-sponsored Good Judgement Project, meant to evaluate expert prediction of complex events. Project researchers (psychologists) find that "actively open-minded thinkers" (those who are willing to learn from those that disagree with them) perform best. Unfortunately there are no real "super-predictors" -- just some who are better than others, and have better calibration (accurate confidence estimates).

Yuval Noah Harari discusses his new book, Sapiens: A Brief History of Humankind, which explores the ways in which biology and history have defined us and enhanced our understanding of what it means to be human. One hundred thousand years ago, at least six different species of humans inhabited Earth. Yet today there is only one—homo sapiens. What happened to the others? And what may happen to us?

About 2 million years ago our human ancestors were insignificant animals living in a corner of Africa. Their impact on the world was no greater than that of gorillas, zebras, or chickens. Today humans are spread all over the world, and they are the most important animal around. The very future of life on Earth depends on the ideas and behavior of our species.

This course will explain how we humans have conquered planet Earth, and how we have changed our environment, our societies, and our own bodies and minds. The aim of the course is to give students a brief but complete overview of history, from the Stone Age to the age of capitalism and genetic engineering. The course invites us to question the basic narratives of our world. Its conclusions are enlightening and at times provocative. For example:

· We rule the world because we are the only animal that can believe in things that exist purely in our own imagination, such as gods, states, money and human rights.

· Humans are ecological serial killers – even with stone-age tools, our ancestors wiped out half the planet's large terrestrial mammals well before the advent of agriculture.

· The Agricultural Revolution was history’s biggest fraud – wheat domesticated Sapiens rather than the other way around.

· Money is the most universal and pluralistic system of mutual trust ever devised. Money is the only thing everyone trusts.

· Empire is the most successful political system humans have invented, and our present era of anti-imperial sentiment is probably a short-lived aberration.

· Capitalism is a religion rather than just an economic theory – and it is the most successful religion to date.

· The treatment of animals in modern agriculture may turn out to be the worst crime in history.

· We are far more powerful than our ancestors, but we aren’t much happier.

· Humans will soon disappear. With the help of novel technologies, within a few centuries or even decades, Humans will upgrade themselves into completely different beings, enjoying godlike qualities and abilities. History began when humans invented gods – and will end when humans become gods.

Thousands of genomic segments appear to be present in widely varying copy numbers in different human genomes. We developed ways to use increasingly abundant whole-genome sequence data to identify the copy numbers, alleles and haplotypes present at most large multiallelic CNVs (mCNVs). We analyzed 849 genomes sequenced by the 1000 Genomes Project to identify most large (>5-kb) mCNVs, including 3878 duplications, of which 1356 appear to have 3 or more segregating alleles. We find that mCNVs give rise to most human variation in gene dosage—seven times the combined contribution of deletions and biallelic duplications— and that this variation in gene dosage generates abundant variation in gene expression. We describe ‘runaway duplication haplotypes’ in which genes, including HPR and ORM1, have mutated to high copy number on specific haplotypes. We also describe partially successful initial strategies for analyzing mCNVs via imputation and provide an initial data resource to support such analyses.

General cognitive function is substantially heritable across the human life course from adolescence to old age. We investigated the genetic contribution to variation in this important, health- and well-being-related trait in middle-aged and older adults. We conducted a meta-analysis of genome-wide association studies of 31 cohorts (N=53 949) in which the participants had undertaken multiple, diverse cognitive tests. A general cognitive function phenotype was tested for, and created in each cohort by principal component analysis. We report 13 genome-wide significant single-nucleotide polymorphism (SNP) associations in three genomic regions, 6q16.1, 14q12 and 19q13.32 (best SNP and closest gene, respectively: rs10457441, P=3.93 × 10−9, MIR2113; rs17522122, P=2.55 × 10−8, AKAP6; rs10119, P=5.67 × 10−9, APOE/TOMM40). We report one gene-based significant association with the HMGN1 gene located on chromosome 21 (P=1 × 10−6). These genes have previously been associated with neuropsychiatric phenotypes. Meta-analysis results are consistent with a polygenic model of inheritance. To estimate SNP-based heritability, the genome-wide complex trait analysis procedure was applied to two large cohorts, the Atherosclerosis Risk in Communities Study (N=6617) and the Health and Retirement Study (N=5976). The proportion of phenotypic variation accounted for by all genotyped common SNPs was 29% (s.e.=5%) and 28% (s.e.=7%), respectively. Using polygenic prediction analysis, ~1.2% of the variance in general cognitive function was predicted in the Generation Scotland cohort (N=5487; P=1.5 × 10−17). In hypothesis-driven tests, there was significant association between general cognitive function and four genes previously associated with Alzheimer’s disease: TOMM40, APOE, ABCG1 and MEF2C.

Monday, February 02, 2015

Scott Alexander writes about ability, effort, and achievement at his blog Slate Star Codex. Like many of his excellent posts, this one has received hundreds of thoughtful comments. (Sequel: Part 2 is up.)

Scott has special insight into this question as consequence of a musically talented brother (who quickly surpassed Scott to become a piano prodigy and professional musician) and of having struggled with math despite being very bright. Experiences like these make clear the division between talent and effort, but they're not always easy to share with others.

Slate Star Codex: ... There are frequent political debates in which conservatives (or straw conservatives) argue that financial success is the result of hard work, so poor people are just too lazy to get out of poverty. Then a liberal (or straw liberal) protests that hard work has nothing to do with it, success is determined by accidents of birth like who your parents are and what your skin color is et cetera, so the poor are blameless in their own predicament.

I’m oversimplifying things, but again the compassionate/sympathetic/progressive side of the debate – and the side endorsed by many of the poor themselves – is supposed to be that success is due to accidents of birth, and the less compassionate side is that success depends on hard work and perseverance and grit and willpower.

The obvious pattern is that attributing outcomes to things like genes, biology, and accidents of birth is kind and sympathetic. Attributing them to who works harder and who’s “really trying” can stigmatize people who end up with bad outcomes and is generally viewed as Not A Nice Thing To Do.

And the weird thing, the thing I’ve never understood, is that intellectual achievement is the one domain that breaks this pattern.

Here it’s would-be hard-headed conservatives arguing that intellectual greatness comes from genetics and the accidents of birth and demanding we “accept” this “unpleasant truth”.

And it’s would-be compassionate progressives who are insisting that no, it depends on who works harder, claiming anybody can be brilliant if they really try, warning us not to “stigmatize” the less intelligent as “genetically inferior”.

I can come up with a few explanations for the sudden switch, but none of them are very principled and none of them, to me, seem to break the fundamental symmetry of the situation. ...

... I tried to practice piano as hard as he did. I really tried. But every moment was a struggle. I could keep it up for a while, and then we’d go on vacation, and there’d be no piano easily available, and I would be breathing a sigh of relief at having a ready-made excuse, and he’d be heading off to look for a piano somewhere to practice on. Meanwhile, I am writing this post in short breaks between running around hospital corridors responding to psychiatric emergencies, and there’s probably someone very impressed with that, someone saying “But you had such a great excuse to get out of your writing practice!”

I dunno. But I don’t think of myself as working hard at any of the things I am good at, in the sense of “exerting vast willpower to force myself kicking and screaming to do them”. It’s possible I do work hard, and that an outside observer would accuse me of eliding how hard I work, but it’s not a conscious elision and I don’t feel that way from the inside. ...

Pursuing this topic to the end leads to the difficult question of whether predispositions to hard work, conscientiousness, ambition, etc. are themselves heritable. Of course, the answer is yes, at least partly. Free Will? :-)

[Geoffrey Miller] Yeah I'd say about seventy percent of evolutionary psychology is about mating, attraction, physical attractiveness, mental attractiveness, potential conflicts between men and women, and how those play out. But then other evolutionary psych people study all kinds of other things, like the learning and memory that Wikipedia mentioned. ...

[Geoffrey Miller] Well one thing to note is it's a pretty new field. I was literally at Stanford University when the field got invented by some of the leading people, who kind of had a joint retreat there at a place called The Center for Advanced Study in Behavioral Sciences. 1989, 1990.

And they actually strategized about, "How do we create this new field? What should we call it? How do we launch it? What kind of scientific societies and journals do we establish?"

So the field's only twenty-five years old. It started out pretty strongly though, because the people who went into it were brilliant, really world-class geniuses, and that's one of the things that attracted me to the field when I was a grad student.

Since then, the quality of the research has gotten way better. It's a very progressive field in the sense that we actually build on each other's insights. Other areas of psychology, everybody wants to coin and patent their own little term, their own, almost, trademarked little theory, and try to ignore a lot of what other people do.

We tend to be in more of the tradition of mainstream biology, where you actually respect what other people have done before, and try to build on it. So I think we're really good at doing that.

The other thing to remember, apart from it being a young field, is it's a pretty small field. There's fewer than a thousand people in the world actively doing evolutionary psych research, compared to fifty thousand people doing neuroscience research, or probably a hundred thousand scientists doing cancer research.

So it's not a huge field. There's probably more science journalists trying to cover evolutionary psychology than there are evolutionary psych researchers. ...

[Geoffrey Miller] Well I'll tell you what areas of science really impress me at the moment, in terms of being super high-quality and sophisticated. One is behavior genetics. Twin studies. So I did a sabbatical in Brisbane, Australia with one of the big twin research groups, back in 2007.

And they were just making this shift. They had tracked thirty thousand pairs of twins in Australia for the previous twenty years, and given them literally hundreds of surveys, and measurements, and experiments over the years. And they were just starting to collect DNA from all these twin pairs.

And what you have now is big international networks of people working in behavior genetics, sharing their data, publishing papers with fifty or a hundred scientists on the paper, working together and being able to identify, "Hey, here's where the genes for, like, how sexually promiscuous you are overlap with the genes for this personality trait, or the genes for this physical health trait."

And it's amazingly sophisticated. It's powerful. The datasets are huge. The problem is a lot of that stuff is very politically incorrect, and it makes people uncomfortable. And people are like, "You can't say that propensities for murdering people are genetic. Or, propensities for having a lot of musical creativity are genetic," people don't want to hear that. So there's a big kind of ideological problem there. But honestly that's where some of the best research is being done in the behavioral sciences. ...

[Geoffrey Miller] Well one big thing is I think a lot of the pickup artist guys who quote The Mating Mind book, or refer to evolutionary psychology, get all obsessed with status, and they talk about alpha males, and beta males, and gamma males, and omega males, and whatever. Status, status, status. And that's fine. Status is important, no doubt.

But the idea that you can simply categorize human males into, "Oh, you're an alpha. You're a beta." That works for gorillas. It works for orangutans, where the different statuses are actually associated with different body sizes. Like an alpha orangutan is literally twice as heavy as a beta orangutan, and has huge cheek pads, and the beta doesn't. And they have completely different mating strategies.

But for humans, status is way more complicated. It's fluid, it depends on context. ...