About Me

Thursday, February 27, 2014

In social science a correlation of R = 0.4 between two variables is typically considered a strong result. For example, both high school GPA and SAT score predict college performance with R ~ 0.4. Combining the two, one can achieve R ~ 0.5 to 0.6, depending on major. See Table 2 in my paper Data Mining the University.

It's easy to understand why SAT and college GPA are not more strongly correlated: some students work harder than others in school, and effort level is largely independent of SAT score. (For psychometricians, Conscientiousness and Intelligence are largely uncorrelated.) Also, it's typically students in the upper half or quarter of cognitive ability relative to the general population that earn college degrees. If the entire range of students were enrolled in college the SAT-GPA correlation would be higher. Finally, there is, of course, inherent randomness in grading.

I often hear complaints of the type: "R = 0.4 is negligible! It only accounts for 16% percent of the total variance, leaving 84% unaccounted for!" (The fraction of variance unaccounted for is 1 - R^2.) This kind of remark even finds its way into quantitative genetics and genomics: "But the alleles so far discovered only account for 20% of total heritability! OMG GWAS is a failure!"

This is a misleading complaint. Variance is the sum of squared deviations, so it does not even carry the same units as the quantity of interest. Variance is a convenient quantity because it is additive for uncorrelated variables, but it leads to distorted intuition for effect size: SDs are the natural unit, not SD^2!

A less misleading way to think about the correlation R is as follows: given X,Y from a standardized bivariate distribution with correlation R, an increase in X leads to an expected increase in Y: dY = R dX. In other words, students with +1 SD SAT score have, on average, roughly +0.4 SD college GPAs. Similarly, students with +1 SD college GPAs have on average +0.4 SAT.

Alternatively, if we assume that Y is the sum of (standardized) X and a noise term (the sum rescaled so that Y remains standardized), the standard deviation of the noise term is given by sqrt(1- R^2)/R ~ 1/R for modest correlations. That is, the standard deviation of the noise is about 1/R times larger than that of the signal X. When the correlation is 1/sqrt(2) ~ 0.7 the signal and noise terms have equal SD and variance. ("Half of the variance is accounted for by the predictor X"; see for comparison the figure above with R = 0.8.)

As another example, test-retest correlations of SAT or IQ are pretty high, R ~ 0.9 or more. What fluctuations in score does this imply? In the model above the noise SD = sqrt(1 - 0.81)/0.9 ~ 0.5, so we'd expect the test score of an individual to fluctuate by about half a population SD (i.e., ~7 points for IQ or ~50 points per SAT section). This is similar to what is observed in the SAT data of Oregon students.

I worked this out during a boring meeting. It was partially stimulated by this article in the New Yorker about training for the SAT (if you go there, come back and read this to unfog your brain), and activist nonsense like this. Let me know if I made mistakes ... 8-)

Sunday, February 23, 2014

I thought it worthwhile to re-post my 2008 slides on the credit crisis. I wrote these slides just as the crisis was getting started (right after the big defaults), but I still think my analysis was correct and better than post-hoc discussions that are going on to this day.

I believe I called the housing bubble back in 2004 -- see, e.g., here for a specific discussion of bubbles and timescales. The figure above also first appeared on my blog in 2004 or 2005.

(2004) ... The current housing bubble is an even more egregious example. Because real estate is not a very liquid investment - the typical family has to move and perhaps change jobs to adjust to mispricing - the timescale for popping a bubble is probably 5-10 years or more. Further, I am not aware of any instruments that let you short a real estate bubble in an efficient way.

Atlantic: ... Jeremy, for instance, had arrived at Goldman thinking that his specific job—trading commodities derivatives—could make the world a teensy bit better by allowing large companies to hedge their costs, and pass savings along to customers. But one day, his boss pulled him aside and told him that, in effect, he’d been naïve.

“We’re not here to save the world,” the boss said. “We exist to make money.”

The British economist Roger Bootle has written about the difference between “creative” and “distributive” work. Creative work, Bootle says, is work that brings something new into the world that adds to the total available to everyone (a doctor treating patients, an artist making sculptures). Distributive work, on the other hand, only carries the possibility of beating out competitors and winning a bigger share of a fixed-size market. Bootle explains that although many jobs in modern society consist of distributive work, there is something intrinsically happier about a society that skews in favor of the creative.

“There are some people who may derive active delight from the knowledge that their working life is devoted to making sure that someone else loses, but most people do not function that way,” he writes. “They like to have a sense of worth, and that sense usually comes from the belief that they are contributing to society.”

During my interviews with young bankers, I heard a lot of them express this exact sentiment. They wanted to do something, make something, add something to the world, instead of simply serving as well-paid financial intermediaries at giant investment banks. It doesn’t hurt that creative jobs—including, but not limited to, jobs with Silicon Valley tech companies—are now considered sexier and more socially acceptable than Wall Street jobs, which still carry the stigma of the financial crisis. At one point, during the Occupy Wall Street protests, Jeremy told me that he had begun camouflaging his Goldman affiliation in public. ...

Friday, February 21, 2014

Freeman Dyson discusses Darwin's failure to discover Mendelian inheritance. Had Darwin a stronger grasp of statistics (then under development by his cousin Francis Galton), he might have discovered the properties of the basic units of inheritance, so central to his theory of natural selection. See also Bounded cognition.

NYBooks: ... Seven years after Darwin published The Origin of Species, without any satisfactory explanation of hereditary variations, the Austrian monk Gregor Mendel published his paper “Experiments in Plant Hybridization” in the journal of the Brünn Natural History Society. Mendel had solved Darwin’s problem. He proposed that inheritance is carried by discrete units, later known as genes, that do not blend but are carried unchanged from generation to generation. The Mendelian theory of inheritance fits perfectly with Darwin’s theory of natural selection. Mendel had read Darwin’s book, but Darwin never read Mendel’s paper.

The essential insight of Mendel was to see that sexual reproduction is a system for introducing randomness into inheritance. In sweet peas as in humans, each plant is either male or female, and each offspring has one male and one female parent. Inherited characteristics may be specified by one gene or by several genes. Single-gene characteristics are the simplest to calculate, and Mendel chose them to study. For example, he studied the inheritance of pod color, determined by a single gene that has a version specifying green and a version specifying yellow. Each plant has two copies of the gene, one from each parent. There are three kinds of plants, pure green with two green versions of the gene, pure yellow with two yellow versions, and mixed with one green and one yellow. It happens that only one green gene is required to make a pod green, so that the mixed plants look the same as the pure green plants. Mendel describes this state of affairs by saying that green is dominant and yellow is recessive.

Mendel did his classic experiment by observing three generations of plants. The first generation was pure green and pure yellow. He crossed them, pure green with pure yellow, so that the second generation was all mixed. He then crossed the second generation with itself, so that the third generation had all mixed parents. Each third-generation plant had one gene from each parent, with an equal chance that each gene would be green or yellow. On the average, the third generation would be one-quarter pure green, one-quarter pure yellow, and one-half mixed. In outward appearance the third generation would be three-quarters green and one-quarter yellow.

This ratio of 3 between green and yellow in the third generation was the new prediction of Mendel’s theory. Most of his experiments were designed to test this prediction. But Mendel understood very well that the ratio 3 would only hold on the average. Since the offspring chose one gene from each parent and every choice was random, the numbers of green and yellow in the third generation were subject to large statistical fluctuations. To test the theory in a meaningful way, it was essential to understand the statistical fluctuations. Fortunately, Mendel understood statistics.

Mendel understood that to test the ratio 3 with high accuracy he would need huge numbers of plants. It would take about eight thousand plants in the third generation to be reasonably sure that the observed ratio would be between 2.9 and 3.1. He actually used 8,023 plants in the third generation and obtained the ratio 3.01. He also tested other characteristics besides color, and used altogether 17,290 third-generation plants. His experiments required immense patience, continuing for eight years with meticulous attention to detail. Every plant was carefully isolated to prevent any intruding bee from causing an unintended fertilization. A monastery garden was an ideal location for such experiments.

In 1866, the year Mendel’s paper was published, but without any knowledge of Mendel, Darwin did exactly the same experiment. Darwin used snapdragons instead of sweet peas, and tested the inheritance of flower shape instead of pod color. Like Mendel, he bred three generations of plants and observed the ratio of normal-shaped to star-shaped flowers in the third generation. Unlike Mendel, he had no understanding of statistical fluctuations. He used a total of only 125 third-generation plants and obtained a value of 2.4 for the crucial ratio. This value is within the expected statistical uncertainty, either for a true value of 2 or for a true value of 3, with such a small sample of plants. Darwin did not understand that he would need a much larger sample to obtain a meaningful result.

Mendel’s sample was sixty-four times larger than Darwin’s, so that Mendel’s statistical uncertainty was eight times smaller. Darwin failed to repeat his experiment with a larger number of plants, and missed his chance to incorporate Mendel’s laws of heredity into his theory of evolution. He had no inkling that a fundamental discovery was within his grasp if he continued the experiment with larger populations. The basic idea of Mendel was that the laws of inheritance would become simple when inheritance was considered as a random process. This idea never occurred to Darwin. That was why Darwin learned nothing from his snapdragon experiment. It remained a brilliant blunder.

New York magazine on what people make in the Big Apple. Doctors, financiers, escorts, drivers, editors, cops, attorneys, etc. If the figures are correct for PR/Communications people, we are overpaying at the university ...

NYmag: ... It was in the spirit of this financial glasnost that we began an exhaustive survey of New York's most important professions. We studied doctors, dog walkers, bankers, baristas, headhunters, advertising honchos, prostitutes -- just about anyone who does anything in this town to make a buck. And surprisingly, most of them -- in the strictest of confidence, of course -- spilled the beans. ...

... P.R. maven Lizzie Grubman presides over a staff of 30 and says she gets 100 résumés a week from aspiring public-relations professionals. That's about all you need to know to understand why P.R. starting salaries seldom get over the $30,000 hurdle. Young P.R. people often find themselves where the action is -- but they're definitely not getting rich. "Public relations is just not a tremendous moneymaker," Grubman says. "You have to own the business to make money." Needless to say, Grubman, who's 29 years old, owns the business.

Before you set up your own shop, however, you'll want to score a few clients in an established firm. Win an account and you'll move up to be an account executive. If you're successful, you'll manage people and be a senior AE making $45,000 to $55,000. After five years, you'd start looking to be made a VP in a smaller agency or an account supervisor in a larger one. In either place, you'd bring home $75,000 to $100,000.

Some P.R. professionals end the infighting and maneuvering that come with the territory at certain agencies by going to corporations that regularly hire media specialists. Someone doing marketing communications at a big company, for a salary in the neighborhood of $60,000, has more control over the strategy. "You may still pick up the phone for the press calls," says Bill Heyman, a P.R. and corporate-communications recruiter, "but you'll also be putting together the press plan." ...

Thursday, February 20, 2014

A detailed inside look at the founders and early days of the startup. The team is only 50 people. Over half of the $19 billion from Facebook will go to the two founders (both Yahoo! alums), and I'd guess at least a billion or two will be split by the rest of the team. The rest will go to investors: Sequoia + angels.

Forbes: ... The two sat at Acton’s kitchen table and started sending messages to each other on WhatsApp, already with the famous double check mark that showed another phone had received a message. Acton realized he was looking at a potentially richer SMS experience – and more effective than the so-called MMS messages for sending photos and other media that often didn’t work. “You had the whole open-ended bounty of the Internet to work with,” he says.

He and Koum worked out of the Red Rock Cafe, a watering hole for startup founders on the corner of California and Bryant in Mountain View; the entire second floor is still full of people with laptops perched on wobbly tables, silently writing code. The two were often up there, Acton scribbling notes and Koum typing. In October Acton got five ex-Yahoo friends to invest $250,000 in seed funding, and as a result was granted cofounder status and a stake. He officially joined on Nov. 1. (The two founders still have a combined stake in excess of 60% — a large number for a tech startup — and Koum is thought to have the larger share because he implemented the original idea nine months before Acton came on board. Early employees are said to have comparatively large equity shares of close to 1%. Koum won’t comment on the matter.)

The pair were getting flooded with emails from iPhone users, excited by the prospect of international free texting and desperate to “WhatsApp” their friends on Nokias and BlackBerries. With Android just a blip on the radar, Koum hired an old friend who lived in LA, Chris Peiffer to make the BlackBerry version of WhatsApp. “I was skeptical,” Peiffer remembers. “People have SMS, right?” Koum explained that people’s texts were actually metered in different countries. “It stinks,” he told him. “It’s a dead technology like a fax machine left over from the seventies, sitting there as a cash cow for carriers.” Peiffer looked at the eye-popping user growth and joined.

Through their Yahoo network they found a startup subleasing some cubicles on a converted warehouse on Evelyn Ave. The whole other half of the building was occupied by Evernote, who would eventually kick them out to take up the whole building. They wore blankets for warmth and worked off cheap Ikea tables. Even then there was no WhatsApp sign for the office. “Their directions were ‘Find the Evernote building. Go round the back. Find an unmarked door. Knock,’” says Michael Donohue, one of WhatsApp’s first BlackBerry engineers recalling his first interview. ...

Wednesday, February 19, 2014

Caltech has a serious problem with undergraduates cheating on academic work, which Caltech administrators appear to be ignoring. A few years ago, one alumnus considered the problem so bad that he urged other alumni to stop donating. I attended Tech (that’s what we called it) for a year and a half in the 1970s. I didn’t think cheating was a problem then. Now it is.

A recent article in the Times Higher Education Supplement by Phil Baty praised Caltech’s “honor system”, which includes trusting students not to cheat on exams. A Caltech professor of biology named Markus Meister told Baty that “cheats simply cannot prosper in an environment that includes such small-group teaching and close collaboration with colleagues because they would rapidly be exposed.” That strikes me as naive. How convenient for Meister that there is no need to test his theory — it must be true (“cheats simply cannot prosper”).

... There is a small and growing population of students at Caltech [who] are systematically cheating, and the Caltech administration is aware of it but refuses to do anything about it. I suspect the problem began when Caltech started advertising its ‘Honor Code’ to prospective high school students in the 90′s, which lead to self-selection of students who were willing to bend the rules. ...

The comment below is consistent with my experiences at Caltech in the 80s. We took the honor code very seriously ...

Bill Mitchell Says:
February 19th, 2014 at 3:29 pm
The honor system appeared to work when I was there. Decades later, Caltech remains one of the hardest things I’ve ever done. As the school’s president put it to incoming students at our orientation, “we will challenge not just your mental limits, but your physical limits,” meaning sleep deprivation. The intensity was incredible. Exhausting, but rewired me to think better.

But it does depend upon the student culture. If cheating takes root, an honor system can’t work. I would hate to see them lose a well-earned reputation by not putting a lid on cheating, if it is a growing problem.

Bill Mitchell Says:
February 19th, 2014 at 3:53 pm
Actually I just realized the honor system was likely working in the 1980s, deducible from two facts: most courses there were still being graded on a C curve, and individual student scores in applied math and electrical engineering tests often averaged below 70%, with wide variance.

If cheating were rampant at that time, this seemingly could not have occurred. The bell-curve grading would have driven competition among cheaters, which would either have driven scores up, or driven variance down, or both. That doesn’t seem to have happened.

Sunday, February 16, 2014

This is video of a talk given to cognitive scientists about a year ago (slides). The original video had poor sound quality, but I think I've improved it using Google's video editing tool. I found that the earlier version was OK if you used headphones, but I think they are not necessary for this one.

I don't have any update on the BGI Cognitive Genomics study. We are currently held up by the transition from Illumina to Complete Genomics technology. (Complete Genomics was acquired by BGI last year and they are moving from the Illumina platform to an improved CG platform.) This is highly frustrating but I will just leave it at that.

On one of the last slides I mention Compressed Sensing (L1 penalized convex optimization) as a method for extracting genomic information from GWAS data. The relevant results are in this paper.

Thursday, February 13, 2014

The study below is sensitive to rare variants which are implicated in schizophrenia risk. These rare variants add to the heritability already associated with common variants, estimated to be at least 32%. In related work, mutations affecting schizophrenia risk were shown to depress IQ in individuals who did not present for schizophrenia.

Schizophrenia is a common disease with a complex aetiology, probably involving multiple and heterogeneous genetic factors. Here, by analysing the exome sequences of 2,536 schizophrenia cases and 2,543 controls, we demonstrate a polygenic burden primarily arising from rare (less than 1 in 10,000), disruptive mutations distributed across many genes. Particularly enriched gene sets include the voltage-gated calcium ion channel and the signalling complex formed by the activity-regulated cytoskeleton-associated scaffold protein (ARC) of the postsynaptic density, sets previously implicated by genome-wide association and copy-number variation studies. Similar to reports in autism, targets of the fragile X mental retardation protein (FMRP, product of FMR1) are enriched for case mutations. No individual gene-based test achieves significance after correction for multiple testing and we do not detect any alleles of moderately low frequency (approximately 0.5 to 1 per cent) and moderately large effect. Taken together, these data suggest that population-based exome sequencing can discover risk alleles and complements established gene-mapping paradigms in neuropsychiatric disease.

The increasing concentration of wealthy students at highly selective colleges is widely perceived, but few analyses examine the underlying dynamics of higher education stratification over time. To examine these dynamics, the authors build an analysis data set of four cohorts from 1972 to 2004. They find that low-income students have made substantial gains in their academic course achievements since the 1970s. Nonetheless, wealthier students have made even stronger gains in achievement over the same period, in both courses and test scores, ensuring a competitive advantage in the market for selective college admissions. Thus, even if low-income students were “perfectly matched” to institutions consistent with their academic achievements, the stratification order would remain largely unchanged. The authors consider organizational and policy interventions that may reverse these trends. [ Italics mine ]

Thursday, February 06, 2014

New report on higher education costs and staffing trends based on Delta Cost Project data, including IPEDS input. Also discussed at the Chronicle.

... This report looks at long-term employment changes on college and university campuses during the past two decades and examines fluctuations in faculty staffing patterns, growth in administrative positions, and the effects of the recent recession on long-standing employment trends. It goes beyond other studies (Zaback, 2011; Bennett, 2009) to explore the effects of these staffing changes on total compensation, institutional spending patterns, and ultimately tuitions.

The overarching trends show that between 2000 and 2012, the public and private nonprofit higher education workforce grew by 28 percent, more than 50 percent faster than the previous decade. But the proportion of staff to students at public institutions grew slower in the 2000s than in the 1990s because the recent expansion in new positions largely mirrored rising enrollments as the Millennial Generation entered college. By 2012, public research universities and community colleges employed 16 fewer staff per 1,000 full-time equivalent (FTE) students compared with 2000, while the number of staff per student at public master’s and bachelor’s colleges remained unchanged.

... Growth in administrative jobs was widespread across higher education — but creating new professional positions, rather than executive and managerial positions, is what drove the increase. Professional positions (for example, business analysts, human resources staff, and admissions staff) grew twice as fast as executive and managerial positions at public nonresearch institutions between 2000 and 2012, and outpaced enrollment growth.

Colleges and universities have invested in professional jobs that provide noninstructional student services, not just business support. Across all educational sectors, wage and salary expenditures for student services (per FTE staff) were the fastest growing salary expense in many types of institutions between 2002 and 2012.

Part-time faculty/graduate assistants typically account for at least half of the instructional staff in most higher education sectors. Institutions have continued to hire full-time faculty, but at a pace that either equaled or lagged behind student enrollments; these new hires also were likely to fill non-tenure-track positions.

As the ranks of managerial and professional administrative workers grew, the number of faculty and staff per administrator continued to decline. The average number of faculty and staff per administrator declined by roughly 40 percent in most types of four-year colleges and universities between 1990 and 2012, and now averages 2.5 or fewer faculty and staff per administrator.

Monday, February 03, 2014

Two recent articles in the Times on Preimplantation Genetic Diagnosis, or PGD. The first article focuses on ethical issues and a couple in the US, while the second follows the fertility travails of a Times writer in Israel, where PGD and IVF are covered by national health insurance.

NYTimes: ... Genetic testing of embryos has been around for more than a decade, but use of the procedure has soared in recent years as methods have improved and more disease-causing genes have been discovered. The in vitro fertilization and testing are expensive — typically about $20,000 — but they make it possible for couples to ensure that their children will not inherit a faulty gene and to avoid the difficult choice of whether to abort a pregnancy if testing of a fetus detects a genetic problem.

But the procedure also raises unsettling ethical questions that trouble advocates for the disabled and have left some doctors struggling with what they should tell their patients. When are prospective parents justified in discarding embryos? Is it acceptable, for example, for diseases like GSS, that develop in adulthood? What if a gene only increases the risk of a disease? ...

There is no question that the method’s use is increasing rapidly, though no group collects comprehensive data, said Dr. Joe Leigh Simpson, vice president for research at the March of Dimes and past president of the American Society for Reproductive Medicine.

The center in Chicago that tested the Kalinskys’ embryos for the GSS gene, the Reproductive Genetics Institute, has tested embryos from more than 2,500 couples and sought to identify 425 gene mutations. The institute’s caseload has risen 40 percent in just the past two years, according to its director, Svetlana Rechitsky, an author of the new paper on the Kalinskys’ case.

Ms. Kalinsky’s gene causes a particularly dire disease. Her grandfather, great-aunt, uncle, father and cousins died of it. Sometime between her mid-30s and her mid-50s, Ms. Kalinsky, who is now 30, will begin to stumble like a drunk. Dementia will follow, and possibly blindness or deafness. Five years after the onset of symptoms, she will most likely be dead. ...

Meanwhile, in Israel:

NYTimes: ... We knew what treatment we wanted: full-force I.V.F. using Preimplantation Genetic Diagnosis or Screening (P.G.D./P.G.S.), which genetically screens embryos before implantation. The method can be used to search for specific diseases such as Tay Sachs or Gaucher, but it can also weed out aneuploidy (abnormal number of chromosomes in a cell), the reason for many first-trimester miscarriages.

Some worry about the use of P.G.D. for gender selection (“Will this become a form of eugenics?”). But I wonder if genetic screening will be the future of reproductive medicine, especially since new technologies keep emerging. Then again, reproductive medicine is still in its nascent stages, so who knows what the future will bring? Maybe all women will freeze their eggs at age 25. Maybe no one will ever have fertility trouble again.

Genetic screening may not increase my chances of having a baby, but it would greatly reduce my chances of miscarrying or bearing a baby with genetic defects. After three miscarriages, it seemed like the best option: I couldn’t put myself through another such trauma. Besides, if all our embryos were bad, we would definitively know it was time to move on.

Embryo screening costs at least $5,000 for one to eight embryos in the United States, but in Israel it would be covered by national health care. ...