If I was studying today, I would go get a master's in statistics, and maybe do a bunch of accounting courses and then write from that perspective. I think that's the way to survive. The role of the generalist is diminishing. Journalism has to get smarter.

+

<blockquote>If I was studying today, I would go get a master's in statistics, and maybe do a bunch of accounting courses and then write from that perspective. I think that's the way to survive. The role of the generalist is diminishing. Journalism has to get smarter.</blockquote>

Quotations

I can calculate the motion of heavenly bodies but not the madness of people

Isaac Newton
After losing a fortune in the South Sea Company bubble of 1720

Trying is the first step towards failure. -- Homer Simpson

At the website Language Log [1], Mark Liberman posted the following quotation from Miranda Robertson, in “Ockham’s broom,” a new series said to have been introduced on October 16, 2009, in the Journal of Biology:

[I]t is probably safe to assume that most readers are familiar with Ockham’s razor – roughly, the principle whereby gratuitous suppositions are shaved from the interpretation of facts …. Ockham's broom is a somewhat more recent conceit, attributable to Sydney Brenner, and embodies the principle whereby inconvenient facts are swept under the carpet in the interests of a clear interpretation of a messy reality. (Or, some – possibly including Sydney Brenner – might say, in order to generate a publishable paper.)

If I was studying today, I would go get a master's in statistics, and maybe do a bunch of accounting courses and then write from that perspective. I think that's the way to survive. The role of the generalist is diminishing. Journalism has to get smarter.

Malcolm Gladwell TIME, October 20, 2009

Forsooths

This forsooth is from the October 2009 RSS Forsooth.

Of course in those days we worked on the assumption that everything was normally distributed and we have seen in the last few months that there is no such thing as a normal distribution.

University of North Dakota researchers found that pilots who ate the fattiest foods such as butter or gravy had the quickest response times in mental tests and made fewer mistakes when flying in tricky cloud conditions.

According to a New Yorker (October 12, 2009) review [3] of Matthew Stewart's The Management Myth: Why the Experts Keep Getting It Wrong, Stewart tells a story about how "his boss taught his twenty-something[-old] trainees ... how to conduct a 'two-handed regression'":

"When a scatter plot failed to show the signifiant correlation between two variables that we all knew was there, he would place a pair of meaty hands over the offending clouds of data points and thereby reveal the straight line hiding from conventional mathematics." Management consulting isn't a science, Stewart says; it's a party trick.

Failure to disclose

Researchers from the U.S. Army and Thailand failed to disclose that some results of a potential HIV vaccine trial were not statistically significant, although they had this information when they announced the discovery.

"We thought very hard about how to provide the clearest, most honest message," [one researcher] said. "We stand by the fact that this is a vaccine with a modest protective effect." He called the trial results "complex."

The first analysis, a “modified intent to treat” analysis, included “virtually everyone who enrolled in the study, regardless of whether they ended up getting the full course of the vaccine. …. By this measure, the vaccine tested in Thailand reduced by 31% the chance of infection with HIV ….”

New infections occurred in 51 of the 8,197 people who got the vaccine, compared with 74 of the 8,198 volunteers who got placebo shots. Statistical calculations showed there was a 3.9% probability that chance accounted for the difference. In drug and vaccine trials, anything above a 5% probability of a chance result is deemed statistically insignificant.

The second analysis, a “per protocol” analysis, included only the “study participants who got the full regimen of vaccine shots at the right time.” Apparently, for this group, in which 86 people were infected, there is a “16% chance the study results were a fluke.” It reduced by 26% the chance of infection with HIV.

The article’s authors comment:

It isn't clear why the vaccine was seemingly ineffective among participants who followed the guidelines to the letter.

Submitted by Margaret Cibes

More on AIDS Vaccine

“Hardly ever believe what you read” is a maxim that will stand you in good stead. Googling “aids vaccine Thailand” will get 248,000 hits, most of which are misleading. In essence, the URLs say that for the first time an effective vaccine against AIDS has been manufactured. But that was last month. Reality has now set in.

The following chart found in the Wall Street Journal of October 9, 2009 paints a different picture. “New infections occurred in 51 of the 8,197 people who got the vaccine, compared with 74 of the 8,198 volunteers who got placebo shots.” Note that the “125” infections represent “51 + 74.”

The announcement on September 24, 2009 indicated that the p-value is 3.9%. A Minitab run shows that, in fact, the p-value is higher (i.e., worse) as indicated by the Fisher exact test. However, the .048 is still under the mystical .05:

In the final column of the chart--“Strictly adheres to trial design”--appears the unreleased “per protocol” version. According toScience Magazine:

The second analysis is called “per protocol” and adheres strictly to how the trial was designed by only including the study participants who got the full regimen of vaccine shots at the right time. Because it excludes study participants who didn't get the full vaccine regimen, it usually provides corroboration to the looser “intent to treat” findings.

The article doesn’t say what the breakdown of the 86 infections is. Nevertheless, it indicates that the p-value of 16% puts a damper on enthusiasm for the vaccine.

The press conference was not a scholarly, rigorously honest presentation,” said one leading HIV/AIDS investigator, who like others asked that his name not be used. “It doesn’t meet the standards that have been set for other trials, and it doesn’t fully present the borderline results. It’s wrong.”

Discussion

1. “Strictly adheres to trial design” has an efficacy of 26.2% and 86 infections. Show that this leads approximately to 36 and 50 infections, respectively.

2. The articles fail to tell us the number of participants in the “per protocol” situation. However, use the 36 and 50 cited above and show via a statistics package such as Minitab that the Fisher exact test comes up with about 16% for the p-value regardless of whether the sample sizes are the original ones or 4000 each, 5000 each, etc.

3. The “researchers with the U.S. Army who helped run the study, strongly objected to the assertion that they gave the data a positive spin… The debate over the way the results were presented will have no immediate practical impact because even under the most optimistic assessment, the vaccine offered too little protection to be a serious candidate for widespread use.” If this is so, why was there so much positive publicity in September?

Submitted by Paul Alper

Carrying a gun increases risk of getting shot and killed

People who carry guns are far likelier to get shot – and killed – than those who are unarmed, a study of shooting victims in Philadelphia, Pennsylvania, has found. It would be impractical – not to say unethical – to randomly assign volunteers to carry a gun or not and see what happens. So Charles Branas's team at the University of Pennsylvania analyzed 677 shootings over two-and-a-half years to discover whether victims were carrying at the time, and compared them to other Philly residents of similar age, sex and ethnicity. The team also accounted for other potentially confounding differences, such as the socioeconomic status of their neighborhood.

Their article will appear in the American Journal of Public Health. The current version of this article can be found here and the most resent abstract can be found here in this abstract we read:

Objectives. We investigated the possible relationship between being shot in an assault and possession of a gun at the time.

Methods. We enrolled 677 case participants that had been shot in an assault and 684 population-based control participants within Philadelphia, PA, from 2003 to 2006. We adjusted odds ratios for confounding variables.

Results. After adjustment, individuals in possession of a gun were 4.46 (P<.05) times more likely to be shot in an assault than those not in possession. Among gun assaults where the victim had at least some chance to resist, this adjusted odds ratio increased to 5.45 (P<.05).

Conclusions. On average, guns did not protect those who possessed them from being shot in an assault. Although successful defensive gun uses occur each year, the probability of success may be low for civilian gun users in urban areas. Such users should reconsider their possession of guns or, at least, understand that regular possession necessitates careful safety countermeasures.

Discussion

Why do you think the New Science and other's discussing this study titled there article "Carrying a gun increases risk of getting shot and killed" rather than the title of of the article "Investigating the Link Between Gun Possession and Gun Assault"?

Of course this is the kind of article that lends iself to interesting comments. For example:

I am definitely going to have to find the complete article. I want to see how they determined which victims of being shot were included in the study and how they determined which civilians would be included in the study. With out that information, this study doesn't really mean anything.

Follow this advice and see if you think the study really means anything.

Sounds to me like a completely ignorant study and weighted to get the result they want. If you check a place like Philidelphia, of course this is the result you would get, because the people carrying guns are more likely to be involved in crimes or living in crime ridden areas. Check Dallas, or Oklahoma City. You wouldn't get that result at all. And that's because dang near everybody has guns, and we have far fewer shootings.

Identifying financial market cycles - or not

This article focuses on the work of Martin Armstrong, a technical financial analyst, who found that, "on average, there had been a panic every 8.6 years" over the period 1683-1907:

He discerned a recurrence of major turning points in the economy and in world affairs that followed a distinct and unwavering 8.6-year rhythm.

Then he found that the October 1987 crash “took place on the minor halfway point up the first leg of the 8.6-year cycle, at 2.15 years,” noting that "8.6 years was exactly … 3,141 [days], the number pi times a thousand.”

Eventually:

The model … failed, among other things, to foresee its developer’s demise. In September, 1999, Armstrong was charged with defrauding Japanese investors of nearly a billion dollars. …. The upshot, though, is that he has now spent more than nine years in jail – a pi cycle and then some.

The article includes discussions of Fibonacci-based market behavior models and the "reasoning" behind them.

Submitted by Margaret Cibes

Learning by the petabyte

Some Statistics textbooks have been criticized for having small "toy" problems that do not reflect the complexity of data analysis out in the real world. What sort of data sets are out in the real world?

Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. It was not long ago that the notion of one company having anything close to 40 billion photos would have seemed tough to fathom. Google, meanwhile, churns through 20 times that amount of information every single day just running data analysis jobs. In short order, DNA sequencing systems too will generate many petabytes of information a year.

Even at the best universities, students are not asked to handle data sets this large. And this is a problem.

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow. "If they imprint on these small systems, that becomes their frame of reference and what they’re always thinking about," said Jim Spohrer, a director at I.B.M.'s Almaden Research Center.

Two years ago, I.B.M. and Google set out to change the mindset at universities by giving students broad access to some of the largest computers on the planet. The companies then outfitted the computers with software that Internet companies use to tackle their toughest data analysis jobs. And, rather than building a big computer at each university, the companies created a system that let students and researchers tap into giant computers over the Internet. This year, the National Science Foundation, a federal government agency, issued a vote of confidence for the project by splitting $5 million among 14 universities that want to teach their students how to grapple with big data questions.

Submitted by Steve Simon

Questions

1. What is the size of the largest data set that you have ever analyzed. Did the size of the data set force you to use a different computing system, different software, or a different statistical method?

2. Could a random sample of a few megabytes from a petabyte of data be sufficiently useful to learn on? Note that a megabyte is six orders of magnitude smaller than a petabyte. Is it possible to have a representative sample with a data set sampled this sparsely?

3. Moore's Law says (more or less) that computing capacity doubles every two years (some sources say 18 months). If Moore's Law applies, calculate how long will it take before we see petabyte sized hard drives on laptop computers?

The unluckiest fan

The Washington Nationals baseball team posted a dismal won-lost record of 59-103 for the 2009 season. From the link above, you can listen to an interview with season-ticket holder Stephen Krupin, who watched the team lose all 19 games he attended this year. The host speculates that this must be a record for bad luck. In fact, Mr. Krupin reports that his cousin, a PhD economist, calculated the chance that this would happen as 1 in 131,204.

In comments posted on the NPR site, several listeners attempt to reproduce this calculation, but find that the event appears to be more likely than reported. It turns out that their analyses are based on the full season record--which seems natural since that record is featured so prominently in the story. However, it comes out in the interview that Mr. Krupin attended only home games. From the Major League Baseball standings we see that the Nationals were 33–48 at home and 26–55 on the road. The chance that 19 randomly selected home games are all losses is <math>{48 \choose 19}/ {81 \choose 19} </math>, which equals 1 in 131203.8, in agreement with Mr. Krupin's report.

We had another curious experience trying to get the data to match the calculation. An initial try found a sortable schedule on the Washington Nationals web site. Selecting the home games produces 85 entries: 34 wins, 48 losses and 3 postponements. Baseball fans will recognize that 34 plus 48 gives one too many home games, but how do we account for the extra win? It turns out that the May 5 game against the Astros was suspended by rain in the 11th inning, with the score tied 10-10. The game was completed on July 9, with the Nationals ultimately winning 11-10. This result appears twice in the schedule, once on each date.