November 2011

P stands for pandemic. And this article nicely describes the predicament of policymakers as they grapple with the early stages of a possible outbreak in a new strain of influenza. At best, they are basing their policies on educated guesses, with the emphasis on guessing since data is in short supply.

This situation is akin to the one described in Chapter 2 of Numbers Rule Your World. At this point of the investigation, they only identified one cluster of cases, in the U.S. that can be traced back to pigs. "We don't want to overplay or underplay... we're trying to get that right." according to an official at WHO. Nice goal but unfortunately, unrealistic. The reality is that there are winners and losers on either side of this zero-sum decision. Public health advocates have little to lose from false alarms - they have the "better safe than sorry" mentality. Powerful business lobbies have much to lose from false alarms - and their voices are being heard: WHO has been warned not to call this "swine flu", according to the journalist.

It's a good thing for Alex Tabarrok to draw attention to Nicholas Kristof's article on modern-day slavery in Cambodia (link), but what he calls the "arresting statistic" should really be filed under "Sentences to ponder".

Here is the statistic:

By my calculations, at least 10 times as many girls are now trafficked into brothels annually as African slaves were transported to the New World in the peak years of the trans-Atlantic slave trade.

I call this an anti-statistical statistic.

Why?

There is no statistic! In spite of his "calculations", he never told us the final count. In the last paragraph, he implied that the unit is "millions". Given that he sourced himself for this statistic, one would hope to at least see what the number is.

The only number in the "statistic" is at least 10 times. Unless he assumes that his readers know by heart the number of African slaves, we have no way of knowing if that is a big number or a small number.

There is no indication that he controlled for population growth. Recently, it has been remarked how quickly world population has grown. There are more than 4 times as many people now as in the 19th century.

Comparing "annually" to "peak years of" weakens an otherwise potent statement. I'm sure his point still stands if the African slave trade volume is expressed as an annual average rather than a multi-year aggregate. Why introduce an element that looks suspicious?

I'm not sure the comparison to the African slave trade is relevant. By juxtaposing them, he equates the suffering of one to that of the other.

The other problem with the Asian slavegirl- African slave comparison is the base population: from this population pyramid, one might guess that roughly 20% of Cambodia's population fall within the age and gender definitions for this slave trafficking, and that would be roughly 2.8 million. The base population for the slave trade involving black people in Africa is a vastly different calculation. The more appropriate statistic is what proportion of these women are part of the slavery.

Alberto Contador is the elite cyclist and 2010 Tour De France winner who failed a test for clenbuterol, and his case is still being litigated with seemingly endless delays. Contador's lawyers have claimed that the clenbuterol came from an imported steak brought into France by a buddy that he ate coming from a cow that has allegedly been illegally injected with the substance. (I previously wrote about this case here.)

If that were true, poor Contador is the unluckiest athlete in the world.

***

The case is supposedly coming to an end although it has dragged on and on. We were told recently that his defense may be boosted by a situation that happened in Mexico this year.

In the Under-17 World Cup tournament, held earlier in Mexico, 109 out of 208 samples collected from players tested positive for clenbuterol! The anti-loping labs have decided not to prosecute these cases because Mexican farmers have been known to use clenbuterol, and all these athletes may have been victims of food "poisoning".

According to the "conventional wisdom" such as this, the Mexican situation should bolster Contador's case. It shows that Contador's explanation is credible.

***

On the contrary, this situation actually makes the anti-doping labs look good.

First, it shows the effectiveness of the test for clenbuterol. The test performs as expected. It detects the substance.

Second, what happens in Mexico does not translate to Europe. If it did, then we should have seen half of the athletes in the Tour De France failing the test but that didn't happen. Not even close. In addition, we know that European farmers are banned from using clenbuterol and in 2008-9, only one out of 83,000 samples tested positive for it. (That's why Contador must be extremely unfortunate if his story were indeed true.)

As I concluded in Chapter 4 of Numbers Rule Your World, it's not the job of anti-doping labs to do lie detection - their job is to do chemistry. There is no argument that clenbuterol was found in his sample. The debate is over how the clenbuterol got therel; that debate can almost never be settled by a lab test. We need a powerful polygraph but we don't have one.

If Contador's positive result is upheld, there is a small chance that he would have been falsely accused. But in my view, the anti-doping world needs more false positives, not fewer. See my previous post here.

Yahoo! news tell us that it is "remarkable" that a baby born at 11:11 on 11-11-2011 (Veterans' Day) was born to a mother who was a veteran.

But how truly remarkable is this?

***

Here's how to think about this: How many babies are born at 11:11 on 11-11-2011? And what's the likelihood that a baby is born to parents at least one of whom is a veteran?

According to the CDC (link to PDF here), the average number of babies born on an average Friday in the U.S. in 2009 was 12,364. Next, say 12,500 babies were born today, what proportion of these babies would have at least one parent who is a veteran? From the Census 2000 data, we learn that about 13% of civilians over 18 are veterans. Assuming that veterans are equally likely to have children as non-veterans, then about 13% of those 12,500 babies would be born to veterans, or 1625 babies.

It would be great to know the distribution of these births throughout the day but the time of day data does not seem to be available. If births are uniformly distributed throughout the 24 hours (or 1440 minutes), then there are 1.13 babies born per minute to veterans.Yahoo! tells us they located the one baby born at 11:11 precisely.

***

That's a start. Some refinements are possible. For example, since only one of the two parents need be a veteran, the probability that a baby would have at least one veteran parent is actually higher than 13%. The probability of neither parent being a Veteran is (1-0.13)*(1-0.13) = 76% so the probability of at least one veteran parent is 24%. This calculation assumes that one parent being veteran is independent of the other parent being veteran, which also may require refinement but I'll leave it at that. Using 24% instead of 13% would attribute 3,000 births a day to veteran parents.

The 13% is an overestimate because over 18 is not the same as child-bearing age. And of course, the assumption of births uniformly distributed throughout the day is almost surely wrong. For example, it is believed that more C-sections happen on Fridays as doctors would rather not operate over the weekend, and one could imagine that these births, if they are not emergencies, are more likely to take place in the middle of the day than say late at night. (See, for instance, this article which concluded "Nonacute C-sections (70% cephalopelvic disproportion) were not performed as frequently at night (12-8 A.M.) as at other times at three of the four hospitals".)

In his anti-science diatribe against medical "experts" on the U.S. Preventive Services Task Force, Forbes falsely characterized the recommendation as being based on cost (even though the Task Force is excluded from looking at costs). He concluded by saying:

What we need in health care is more free enterprise, not Soviet-style controls.

If health care were truly a free enterprise , then everyone should decide for themselves whether or not they want to be screened for prostate cancer, breast cancer, etc. This should also mean that everyone should pay for their own screening tests.

But that's not what Forbes wants. He wants everyone to make their own decisions but the collective to pick up the tab! This means those people who do not want to pay for ineffective tests are forced to subsidize those who do (via insurers). How is that "free enterprise"?

He also ignores the fact that nothing stops anyone from paying for their own screening tests today. (And remember, even the discoverer of the PSA Test has disowned it in a New York Times editorial last year.)

I thought readers might like a peek at how I spent three hours of my life earlier in the week.The original task was to transfer a set of customer account numbers from a SQL SERVER box to a Teradata (a data warehouse) box in order to find all customer activities associated with those accounts before a certain cutoff date.

Within minutes, it became clear that the real task was getting Teradata to recognize a column of dates as dates. The column looks like this: 07/20/2010, 07/25/2010, ...

What's the problem? You say, of course, anyone can see these are dates. Well, Teradata disagrees - and until and unless Teradata is fully convinced I have offered it a column of dates, I couldn't proceed with my original task, which was to compare these dates to the cutoff date.

Teradata thought those were strings of characters, not dates. I tried the simplest solution first: cast(datecolumn as date). Teradata wouldn't budge, complaining that my datecolumn were "invalid dates".

From the manual, I confirmed that a valid way to use the cast function is cast('2010-07-20' as date). So, I needed a way to convert 07/20/2010 into '2010-07-20'.

I fumbled around a bit as I learned that Teradata does not support many classes of solutions I'm familar with, like regular expressions, an MDY type function (which produces a date given month, day and year inputs), a find-and-substitute function, etc. So I stopped being cute, and reluctantly did it the brute-force way, using substrings and concatenates.

First, I tested cast('2010-07-20' as date) and it created a Teradata date 07/20/2010. Yes, that looked exactly like my input data but human eyes deceive, if the database proclaimed it not to be a date, then it was not a date.

Next, I substituted '2010-07-20' with that brute-force substring-concatenate expression. To my surprise, it failed, complaining again of invalid dates. I fished out some samples of these dates. On visual inspection, they looked like dates. Smelled like dates.

Undeterred, I took out the cast-as-date function and applied the substring-concatenate expression to the full column of dates, and it ran without a hitch. Then, I put the cast-as-date function back in, which instantaneously bombed.

Now that the dumb way stumbled, I went back to being cute. Maybe I could trick Teradata by splitting into two steps, first creating a new data table with the substring-concatenate output, which seemed to have worked before, and then running the cast-as-date function.

Maybe not. As soon as I put the substring-concatenate expression together with two lines of code that generate data tables, it choked. The mystery deepened. The same code when used without the data-table generation code, succeeded in producing Teradata dates, and yet as soon as I wrapped it inside the data-table code, it stalled. The error officially had to do wtih a missing something between the date variable and the comma sign inside the substring function. It was a so-called "syntax error". Very troubling because the same code ran smoothly when the output was ported to a pop-up window but when the output was to be stored in a data table, the server apparently wanted a different syntax! In any case, I couldn't figure out what Teradata was grumbling about.

***

Teradata and I were not friends at the moment. What to do? Like a spurned lover, I sought out my other good friend, SQL SERVER. What if I converted the column of dates to dates first before the data got transferred into Teradata?

So I did that. After a laborious procedure, the data got moved into Teradata. Alas, the dates still showed up as strings of text. Doubled back to SQL SERVER: there, the dates were dates. This meant the program used to transfer data between the two platforms would not recognize those as dates.

My colleague suggested a Hail Mary. (Yes, now two "data scientists" were working on this glamorous problem.) Makes the dates "datetime". Datetime looks like this: 07/20/2010 00:00:00. The time component will be all zeroes since my data contains no information about time. It was a Hail Mary because we had really no rationale why the database would read datetime but not date. You just do random things when you run out of rational ideas.

It worked. It worked. It worked.

SQL Server converted the column of dates to datetime format. Teradata not only reads this correctly but reads this three times, once as a datetime, once as a date and once as a time.

And I skipped over the exasperating data transfer procedure. Importing the data into Teradata requires a special utility. Half the time, the utility will not launch properly, and when that happens, I issue a command-line instruction to reset the utility from the operating system. On this particular day, after the utility opened, it took five tries to establish a network connection. One setting must be switched from default before running the utility. Switching that setting always snips the network connection. So, it was another five tries to restore the connection. And only then could the data transfer get off the ground.

***

Three hours later, it worked. The "it" has morphed from finding customer activites to getting a database to see '07/20/2010' as a date not text.

The SQL SERVER box has closed up shop for the night. I still haven't found a single customer transaction. The project has a long way to go.

Every project one runs into situations like this. It's not an outlier. Welcome to the world of data science.

Andrew Gelman has two posts (link 1, link 2) about a pollster Doug Schoen who has been making news about his survey of Occupy Wall Street protestors. Doug claimed his firm interviewed 200 people in Zuccotti Park.

Gelman cites the work of Azi Paybarah, who complained that (1) Doug drew ridiculous conclusions that contradict his data; and (2) he seemed to have altered the wording a survey question after the fact. Of course, both offenses make a mockery of polling. If you've read Charles Seife's Proofiness (my review here and here), you'd already have lost all respect for political polls like these.

***

We should examine why Doug Schoen even conducted this poll. The headline of the results was:

Our research shows clearly that the movement doesn’t represent unemployed America and is not ideologically diverse. Rather, it comprises an unrepresentative segment of the electorate that believes in radical redistribution of wealth, civil disobedience and, in some instances, violence.

Judging from this, I would say the primary purpose of this poll is to compare protestors in Zuccotti Park to "unemployed America".

In other words, the concept is flawed beyond repair.

Firstly, comparison is only possible if you have A vs. B. There is no indication that Doug's firm reached out to the cross-section of "unemployed America" to assess what they believe in. Unless Doug is unemployed America, I'm not sure how they could come to such a conclusion.

Secondly, Occupy Wall Street can never represent "unemployed America". One doesn't need a poll to know that. New York City is not America. People who live in New York City are definitely richer and more Democratic than most parts of America. New York City, as the beneficiary of bailout money, also does not have the unemployment rate of say Detroit. So, a more interesting question is whether Occupy Wall Street represents "unemployed New York City".

This overarching issue is an extension of Andrew's complaint that most of the country are in favor of taxing the very rich, which makes Doug's point that these protestors - who also support taxing the very rich - are radicals look silly. Basically, this poll has no reference point, without which one can not draw any conclusions.

Thirdly - and this is a commentary on the general media coverage rather than Doug specifically, there is this strange idea that if you're not one of the "unemployed America", you have no standing to advocate for "unemployed America". That's like saying a white person cannot possibly support civil rights because you're not a colored person.