14 December 2015

I have written this post principally for people who have started following me (formally on Twitter, or in some other way) because of my somewhat peripheral involvement on the PACE trial discussions.

First off, while I try to be reasonably politically correct, I don't always get all the details right. I've tried to be respectful to all involved here. In particular, someone told me that "CFS/ME" is not always an appropriate label to use. I hope anyone who thinks that will allow me a pass on that, from my position of ignorance.

I've learned a lot about CFS/ME over the past few weeks. Some of what I've been told --- but above all, what I've observed --- about how some of the science has been conducted, has disturbed me. The people whose opinions I tend to trust on most issues, who usually put science ahead of their personal political position, seem to be pretty much unanimous that the PACE trial data need to be released so that disinterested parties can examine them.

But I want to make it clear that I have no specific interest in CFS/ME. I don't personally know anyone who suffers from it, and it's not something I've really ever thought about much. I don't especially want to become an advocate for patients, except to the extent that, having had my own health problems in the last couple of years, I wish every sick person a speedy recovery and access to the finest medical treatment they can get. So I'm not sure I can even call myself an "ally"; allies have to take a non-trivial position, and I don't think my position here is much more than trivial. If the PACE trial data emerge tomorrow, I will not personally be reanalysing them. I don't know enough about this kind of study to do so.

What I do care about is the integrity of science. You can see this, I hope, if you Google some of the stuff I've been doing in psychology. Science, imperfect though it is, is about the only rational game in town when it comes to solving the problems facing society, and when scientists put their own interests above those of the wider community, it usually doesn't turn out well.

So, on to the PACE trial... I want to say that I can understand a lot of defensiveness on the part of the PACE researchers. They have heard stories of others being harassed and even receiving death threats. Maybe some of them have experienced this themselves. For the purposes of this post (please bear with me!), I'm going to assume --- because I have no evidence to the contrary, and people
generally don't make these accusations lightly --- that the
stories of CFS/ME researchers being harassed in the past are true; arguably, for the purposes of this discussion, it
doesn't make any difference whether they are true or not. (Of course, in another context, such claims are very important, but let me try to put that aside for now.) Apart from anything else, given the size of the CFS/ME community, it would be unreasonable not to
expect there to be some fairly unpleasant people to have also developed
the condition. We all know people like that, whatever our and their
health status. CFS/ME strikes people from all walks of life, including some saints and some sinners.

Now, with that said, I am unconvinced --- actually, "bewildered" would be a better word --- by the argument that releasing the data would somehow expose the researchers to (further) harassment. Indeed, it seems to me that withholding the data plays directly into the hands of those who claim that the PACE researchers have "something to hide", and they are presumably the most likely to escalate their anger into harassment. I actually don't believe that the researchers have anything to hide, in the sense of feeling guilty because they did something bad in their analyses. I've seen enough cases like this in my working life to know that incompetence --- generally in the form of a misplaced sense of loyalty to a group rather than to the wider truth and public interest --- is always to be preferred as an alternative explanation to malice, first because malice is harder to prove, and second because it just almost always turns out to be the case than incompetence was behind a screw-up.

About the only reason I can sort of imagine for the argument that releasing the data might lead to harassment of the researchers, is if the alternative were for the question to somehow go away. That's perhaps a reasonable argument with some political issues; for example, there is (I think) a legitimate debate to be had over whether it's helpful to reproduce, say, cartoons that might cause people to get over-excited, when they could just be left to one side. But that's simply not going to happen here. People with a chronic, debilitating condition, and no cure in sight, are not going to suddenly forget tomorrow that they have that condition. So far, none of the replies to people who have asked for the data, and been told it will lead to harassment, have explained the mechanism by which that is supposed to happen.

The researchers' argument also seems to conflate the presence in the CFS/ME activist community of some unpleasant people --- which, again, for the sake of this discussion, I will assume is probably true --- to the idea that "anyone from the CFS/ME activist community who asks about PACE is
probably trying to harass us". This is not good logic. It's what leads airline passengers to demand that Muslim passengers be thrown off their plane. It's called the base rate fallacy, and avoiding it is supposed to be what scientists --- particularly, for goodness sake, scientists involved in epidemiology --- are good at.

A further problem with the arguments that a request for the data --- whether it comes from patients with scientific training, or scientists such as Jim Coyne --- is designed to be "vexatious" or to "lack serious purpose" or that its intent is "polemical" (all terms used by King's in their reply to Coyne), is that such arguments are utterly unfalsifiable. Given the public profile of this matter, essentially anyone who asks for the data is going to have their credentials examined, and unless they meet the unspecified high standards of the researchers, they won't get to see the data. (Yes, Jim Coyne --- who, full disclosure, is my PhD supervisor --- can be a bit shouty at times. But this is not kindergarten. Scientists don't get to withhold data from other scientists just because they don't play nice. Ask any scientist if science is about robust disagreement and you will get a "Yes", but if that idealism isn't maintained when actual robust disagreement takes place, then we might as well conduct the whole process through everything-is-fine press releases.)

Actually, in their reply to Coyne, King's College did seem to give a hint as to who might be allowed to see the data, in their statement "We would expect any replication of data to be carried out by a trained Health Economist", with an nice piece of innuendo carried over from the preceding sentence that this health economist had better have a lot of free time, because the original analysis took a year to complete. This suggests that unless you declare your qualifications as an unemployed health economist, you aren't going to be judged worthy to see the data (and if you come up with conclusions after a week, it might well be suggested that you didn't look hard enough). But the idea that it will take a year, or indeed need specialised training in health economics, to determine whether the Fisher's exact tests from the contingency tables were calculated correctly, or whether the results really show that people got better over the course of the study, is absurd. Apart from anything else, science is about communicating your results in a coherent manner to the rest of the scientific community. If you submit an article and then claim that its principal conclusions cannot be verified except by a few dozen highly trained specialists with a year's effort, that's an admission right there that your article has failed. Of course there will be questions of interpretation, over things like what "getting better" means, but nobody should have to accept the researcher's claims that their interpretation is the right one. There needs to be a debate, so that a consensus, if one is possible, can emerge. (Who knows? Maybe the evidence for CBT is overwhelming. There are plenty of neutral scientists who can reach a fair conclusion about that, but right now, they are being deprived of the opportunity to do so.)

A further point about the failure to share data is that the researchers agreed, when they published in PLoS ONE, to make their data available to anyone who asked for it. This is a condition of publishing in that journal. You can't have the cake of "we're transparent, we published in an open access journal" and then eat that cake too with "but you can't see the data". PLoS ONE must insist that the authors release the data as they agreed to do as a condition of publication, or else retract the article because their conditions of publication have been breached. See Klaas van Dijk's formal request in this regard.

These data are undoubtedly going to come out at some point anyway. The UK's Information Commissioner will see to that, even if PLoS ONE doesn't persuade the authors to release the data. As the risk management specialist Peter Sandman points out, openness and transparency at the earliest possible stage translate into reduced pain and costs further down the line.

I want to end with a small apology. I wrote a post yesterday on an unrelated topic (OK, it was also critical of some poor science, but the relation with the subject of this post was peripheral). Two people submitted comments on that post which drew a link with the PACE trial. After some thought, I decided not to publish those comments, as I wanted to keep discussion on that other post on-topic. I apologise to the authors of those comments that Blogger.com's moderation system did not let me explain the reasons why they were not published. I would happily publish those same comments on this post; indeed, I will publish pretty much any reasonable comments on this post.

13 December 2015

*** Post updated 2015-12-19 20:00 UTC
*** See end of post for a solution that matches the reported percentages and chi-squares.
A few days ago, I blogged about Professor Amy Cuddy's op-ed piece in the New York Times, in which she cited a non-published, non-peer reviewed study about "iPosture" by Bos and Cuddy of how people allegedly deferred more to authority when they used smaller (versus larger) computing devices, because using smaller devices caused them to hunch (sorry, "iHunch") more, and then something something assertiveness something something testosterone and cortisol something. (The authors apparently didn't do anything as radical at to actually measure, or even observe, how much people hunched, if at all; they took it for granted that "smaller device = bigger iHunch", so that the only possible explanation for the behaviours they observed was the one they hypothesized. As I noted in that other post, things are so much easier if you bypass peer review.)

Just for fun, I thought I'd try and reconstruct the contingency tables for "people staying on until the experimenter came and asked them to leave the room" from the Bos and Cuddy article, mainly because I wanted to make my own estimate of the effect size. Bos and Cuddy reported this as "[eta] = .374", but I wanted to experiment with other ways of measuring it.

In their Figure 1, which I have taken the liberty of reproducing below (I believe that this is fair use, according to Harvard's Open Access Policy, which is to be found here), Bos and Cuddy reported (using the dark grey bars) the percentage of participants who left the room to go and collect their pay, before the experimenter returned. Those figures are 50%, 71%, 88%, and 94%. The authors didn't specify how many participants were in each condition, but they had 75 people and 4 conditions (phone, tablet, laptop, desktop), and they stated that they randomised each participant to one condition. So you would expect to find three groups of 19 participants and one of 18.

However, it all gets a bit complicated here. It's not possible to obtain all four of the percentages that were reported (50%, 71%, 88%, and 94%), rounded conventionally, from a whole number of participants out of 18 or 19. Specifically, you can take 9 out of 18 and get 50%, or you can take 17 out of 18 and get 94% (0.9444, rounded down), but you can't get 71% or 88%, with either 18 or 19 as the cell size. So that suggests that the groups must have been of uneven size. I enumerated all the possible combinations of four cell sizes from 13 to 25 which added up to 75 and also allowed for the percentages of participants who left the room, correctly rounded, to be one of the integers we're looking for. Here they those possible combinations, with the total numbers of participants first and the percentage and number of leavers in parentheses:

Well, I guess that's also "randomised" in a sense. But if your sample sizes are uneven like this, and you don't report it, you're not helping people to understand your experiment.

But maybe they still round their numbers by hand at Harvard for some reason, and sometimes they make mistakes. So let's see if we can get to within one point of those percentages (49% or 51% instead of 50%, 70% or 72% instead of 71%, etc). And it turns out that we can, just, as shown in the figure below, in which yellow cells are accurately-reported percentages, and orange cells are "off by one". We can take 72% for N=18 instead of 71%, and 89% for N=19 instead of 88%. But then, we only have a sample size of 73. So we could allow another error, replacing 94% for N=18 with 95% for N=19, and get up to a sample of 74. Still not right. So, even allowing for three of their four percentages to be misreported, the per-cell sample sizes must have been unequal.

However, if I was going to succeed in my original aim of reconstructing plausible contingency tables, there would be too many combinations to enumerate if I included these "off-by-one" percentages. So I went back to the five possible combinations of numbers that didn't involve a reporting error in the percentages, and computed the chi-square values for the contingency tables
implied by those numbers, using the online calculator here. They came out between 10.26 and 12.37,
with p values from .016 to .006; this range brackets the numbers reported by Bos and Cuddy (chi-square 11.03, p = .012), but none of them matches those values exactly; the closest is the last set (22, 21, 16, 16) with a chi-square of 11.22 and a p of .011.

So, I'm going to tentatively presume that in fact the sample sizes were all equal (give or take one for not having a number of participants divisible by four), and it's in fact the percentages on the dark grey bars in Bos and Cuddy's Figure 1 that are wrong. For example, if I build this contingency table:

Leavers

9

14

16

18

Stayers

9

5

3

1

%
Leavers

50%

74%

84%

95%

then the sample size adds up to 75, the per-condition sample sizes are equal, and the chi-square is 11.086 and the p value is .0113. That was the closest I could get to the values of 11.03 and .012 in the article, although of course I could have missed something. These numbers are close enough, I guess, although I'm not sure if I'd want to get on an aircraft built with this degree of attention to detail; we still have inaccuracies in three of the four percentages as well as the approximate chi-square statistic and p value.

Normally in circumstances like this, I'd think about leaving a comment on the article on PubPeer. But it seems that, in bypassing the normal academic publishing process, Professor Cuddy has found a brilliant way of avoiding, not just regular peer review, but post-publication peer review as well. In fact, unless the New York Times directs its readers to my blog (or another critical review) for some reason, Bos and Cuddy's study is impregnable by virtue of not existing in the literature.

PS: This tweet, about the NY Times article, makes an excellent point:

We're living in an era dominated by flimsy pop psychology. To see how, replace "phone" with "book" in this article. https://t.co/ACvfgjRCjs

Presumably we should all adopt the wide, expansive pose of the
broadsheet newspaper reader. Come to think of it, in much of the
English-speaking world at least, broadsheets are typically associated
with higher status than tabloids. Psychologists! I've got a study for
you...

PPS: The implications of the light grey bars, showing the mean
time taken to leave the room by those who didn't stay for the full 10
minutes, are left as an exercise for the reader. In the absence of
standard deviations (unless someone wants to reconstruct possible values
for those from the ANOVA), perhaps we can't say very much, but it's
interesting to try and construct numbers that match those means.

*** Update 2015-12-19 20:00 UTC: An alert reader has pointed out that there is another possible assignment of subjects to the conditions:
16 (50%=8), 24 (71%=17), 17 (88%=15), 18 (94%=17)
This gives the Chi-square of 11.03 and p of .012 reported in the article.
So I guess my only remaining complaint (apart from the fact that the article is being used to sell a book without having undergone peer review) is that the uneven cell sizes per condition was not reported. This is actually a surprisingly common problem, even in the published literature.

Daniel Kahneman's warning of a looming train wreck in social psychology took another step closer towards realisation today with the publication of this opinion piece in the New York Times.

In the article, entitled "Your iPhone Is Ruining Your Posture — and Your Mood", Professor Amy Cuddy of Harvard Business School reports on "preliminary research" (available here) that she performed with her colleague, Maarten Bos. Basically, they gave some students some Apple gadgets to play with, ranging in size from an iPhone up to a full-size desktop computer. The experimenter gave the participants some filler tasks, and then left, telling them that s/he would be back in five minutes to debrief and pay them, but that they could also come and get him/her at the desk outside. S/he then didn't come back after five minutes as announced, but instead waited ten minutes. The main outcome variable was whether the participants came to get their money, and if they did how long they waited before doing so, as a function of the size of the device that they had. This was portrayed as a measure of their assertiveness, or lack thereof.

It turned out that, the smaller the device, the longer they waited, thus showing reduced assertiveness. The authors' conclusion was that this was caused by the fact that, to use a smaller device, participants had to slouch over more. The authors even have a cute name for this: the "iHunch". And — drumroll please, here's the social priming bit — the fact that the participants with smaller devices were hunched over more made them more submissive to authority, which made them more reluctant to go and tell the researcher that they were ready to get paid their $10 participation fee and go home.

It's hard to know where to begin with this. There are other plausible explanations, starting with the fact that a lot of people don't have an iPhone and might well enjoy playing with one compared to their Android phone, whereas a desktop computer is still just a desktop computer, even if it is a Mac. And the effect size was pretty large: the partial eta-squared of the headline result is .177, which should be compared to Cohen's (1988) description of a partial eta-squared of .14 as a "large" effect. Oh, and there were 75 participants in four conditions, making a princely 19 per cell. In other words, all the usual suspect things about priming studies.

But what I find really annoying here is that we've gone straight from "preliminary research" to the New York Times without any of those awkward little academic niceties such as "peer review". The article, in "working paper" form (1,000 words) is here; check out the date (May 2013) and ask yourself why this is suddenly front-page news when, after 30 months, the authors don't seem to have had time to write a proper article and send it to a journal, although one of them did have time to write 845 words for an editorial in the New York Times. But perhaps those 845 words didn't all have to be written from scratch, because — oh my, surprise surprise — Professor Cuddy is "the author of the forthcoming book 'Presence: Bringing Your Boldest Self to Your Biggest Challenges.'" Anyone care to take a guess as to whether this research will appear in that book, and whether its status as an unreviewed working paper will be prominently flagged up?

If this is the future — writing up your study pro forma and getting it into what is arguably the world's leading newspaper, complete with cute message that will appeal to anyone who thinks that everybody else uses their smartphone too much — then maybe we should just bring on the train wreck now.

23 June 2015

Twitter was buzzing, or something, this morning, with the news that Amazon is going to change the commission rates that it charges researchers who use Mechanical Turk (henceforth: MTurk) participants to take surveys, quizzes, personality tests, etc.

(This blog post contains some MTurk jargon. My previous post was way too long because I spent too much time summarising what someone else had written, so if you don't know anything about MTurk concepts, read this.)

The changes to Amazon's rates, effective July 21, 2015, are listed here, but since that page will probably change after July, I took a screenshot:

Here's what this means. Currently, if you hire 100 people to fill in your survey and want to give them $1 each, you pay Amazon $110 for "regular" workers and $130 for "Masters". Under the new pricing scheme, this will be $140 and $145, respectively. That's an increase of 27.3% and 11.5%, respectively. (I'm assuming, first, that the wording about "10 or more assignments" means "10 or more instances of the HIT being executed, not necessarily by the same worked", and second, that any psychological survey will need more than 10 assignments.)

Twitter users were quite upset about this. Someone portrayed this as a "400% increase", which is either a typo, or a miscalculation (Amazon's commission for "regular" workers is going from 10% to 40%, which even expressed as "$10 to $40 on a $100 survey" is actually a 300% increase), or a misunderstanding (the actual increase in cost for the customer is noted in the previous paragraph). People are talking of using this incident as a reason to start a new, improved platform, possibly creating an international participant pool.

Frankly, I think there is a lot of heat and not much light being generated here.

First, researchers are going to have to face up to the fact that by using MTurk, they are typically exploiting sub-minimum wage labour. (There are, of course, honourable exceptions, who try to ensure that online survey takers are fairly remunerated.) The lowest wage rate I've personally seen in the literature was a study that paid over 100 workers the princely sum of $0.25 each for a task that took 20 minutes to complete. Either those people are desperately poor, or they are children looking for pocket money, or they are people who just really, really like being involved in research, to an extent that might make some people wonder about selection bias.

I have asked researchers in the past how they felt about this exploitation, and the standard answer has been, "Well, nobody's forcing them to do it". The irony of social psychologists --- who tend not to like it when someone points out that they overwhelmingly self-identify as liberal and this is not necessarily neutral for science --- invoking essentially the same arguments as exploitative corporations for not paying people adequately for their time, is wondrous to behold. (It's not unique to academia, though. I used to work at an international organisation, dedicated to human rights and the rule of law, where some managers who made six-figure tax-free salaries were constantly looking for ways to get interns to do the job of assistants, or have technical specialists agree to work for several months for nothing until funding "maybe" came through for their next contract.)

Second, I have doubts about the validity of the responses from MTurk workers. Some studies have shown that they can perform as well as college students, although maybe it's best to take on the "Master"-level workers, whose price is only going up 11.5%; and I'm not sure that college students ought to be regarded as the best benchmark [PDF] here. But there are technical problems, such as issues with non-independence of data [PDF] --- if you put three related surveys out there, there's a good chance that many of the same people may be answering them --- and the population of MTurk workers is a rather strange and unrepresentative bunch of people; the median participant in your survey has already completed 300 academic tasks, including 20 in the past week. One worker completed 830,000 MTurk HITs in 9 years; if you don't want to work out how many minutes per HIT that represents assuming she worked for 16 hours a day, 365 days a year, here's the answer. Workers are overwhelmingly likely to come from one of just two countries, the USA and India, presumably because those are the countries where you can get paid in real cash money; MTurk workers in other countries just get credit towards an Amazon gift card (which, when I tried to use it, could only be redeemed on the US site, amazon.com, thus incurring shipping and tax charges when buying goods in Europe). Maybe this is better than having your participants being all from just one country, but since you don't know what the mix of countries is (unless you specify that the HIT will only be shown in one country), you can't even make claims about the degree of generalisability of your results.

Third, this increase really does not represent all that much money. If you're only paying $33 to run 120 participants at $0.25, you can probably afford to pay $42. That $9 increase is less than you'll spend on doughnuts at the office mini-party when your paper gets accepted (but it won't go very far towards building, running, and paying the electricity bill for your alternative, post-Amazon solution). And let's face it, if these commission rates had been in place from the start, you'd have paid them; the actual increase is irrelevant, just like it doesn't matter when you pay $20 for shipping on a $2 item from eBay if the alternative is to spend $50. All those people tweeting "Goodbye Amazon" aren't really going to switch to another platform. At bottom, they're just upset because they discovered that a corporation with a monopoly will exploit it, as if they really, really thought that things were going to be different this time (despite everyone knowing that Amazon abuses its warehouse workers and has a history of aggressive tax avoidance). Indeed, the tone of the protests is remarkable for its lack of direct criticism of Amazon, because that would require an admission that researchers have been complicit with its policies, to an extent that I would argue goes far beyond the average book buyer. (Disclosure: I'm a hypocrite who orders books or other goods from Amazon about four times a year. I have some good and more bad justifications for that, but basically, I'm not very political, the points made above notwithstanding.)

Bottom line: MTurk is something that researchers can, and possibly (this is not a blog about morals) "should", be able to do without. Its very existence as a publicly available service seems to be mostly a matter of chance; Amazon doesn't spend much effort on developing it, and it could easily disappear tomorrow. It introduces new and arguably unquantifiable distortions into research in fields that already have enough problems with validity. If this increase in prices led to people abandoning it, that might be a good thing. But my guess is that they won't.

Acknowledgement: Thanks to @thosjleeper for the links to studies of MTurk worker performance.

05 June 2015

(Note: this is more or less my first solo foray into unaided statistical and methodological criticism. Normally I hitch a ride on the coat-tails of my more experienced co-authors, hoping that they will spot and stop my misunderstandings. In this case, I haven't asked anybody to do that for me, so if this post turns out to be utter garbage, I will have only myself to blame. But it probably won't kill me, so according to the German guy with the fancy moustache, it will make me stronger.)

Among all the LaCour kerfuffle last week, this article by Hu et al. in Science seems to have slipped by with relatively little comment on social media. That's a shame, because it seems to be a classic example of how fluffy articles in vanity journals can arguably do more damage to the cause of science than outright fraud.

I first noticed Hu et al.'s article in the BBC app on my tablet. It was the third article in the "World News" section. Not the Science section, or the Health section (for some reason, the BBC's write-up was done by their Health correspondent, although what the study has to do with health is not clear); apparently this was the third most important news story in the world on May 29, 2015.

Hu et al.'s study ostensibly shows that certain kinds of training can be reinforced by having sounds played to you while you sleep. This is the kind of thing the media loves. Who cares if it's true, or even plausible, when you can claim that "The more you sleep, the less sexist and racist you become", something that is not even suggested in the study? (That piece of crap comes from the same newspaper that has probably caused several deaths down the line by scaremongering about the HPV vaccine; see here for an excellent rebuttal.) After all, it's in Science (aka "the prestigious journal, Science"), so it must be true, right? Well, let's see.

Here's what Hu et al. did. First, they had their participants take the Implicit Association Test (IAT). The IAT is, very roughly speaking, a measure of the extent to which you unconsciously endorse stereotypically biased attitudes, e.g. (in this case) that women aren't good at science, or Black people are bad. If you've never taken the IAT, I strongly recommend that you try it (here; it's free and anonymous); you may be shocked by the results, especially if (like almost everybody) you think you're a pretty open-minded, unbigoted kind of person. Hu et al.'s participants took the IAT twice, and their baseline degree of what I'll call for convenience "sexism" (i.e., the association of non-sciencey words with women's faces; the authors used the term "gender bias", which may be better, but I want an "ism") and "racism" (association of negative words with Black faces) was measured.

Next, Hu et al. had their participants undergo training designed to counter these undesirable attitudes. This training is described in the supplementary materials, which are linked to from the article (or you can save a couple of seconds by going directly here). The key point was that each form of the training ("anti-sexism" and "anti-racism") was associated with its own sound that was played to the participants when they did something right. You can find these sounds in the supplementary materials section, or play them directly here and here; my first thought is that they are both rather annoying, having seemingly been taken from a pinball machine, but I don't know if that's likely to have made a difference to the outcomes.

After the training session, the participants retook the IAT (for both sexism and racism), and as expected, performed better. Then, they took a 90-minute nap. While they were asleep, one of the sounds associated with their training was selected at random and played repeatedly to each of them; that is, half the participants had the sound from the "anti-sexism" part of the training played to them, and the other half had the sound from the "anti-racism" aspect played to them. The authors claimed that "Past research indicates" that this process leads to reinforcement of learning (although the only reference they provided is an article from the same lab with the same corresponding author).

Now comes the key part of the article. When the participants woke up from their nap, they took the IAT (again, for both sexism and racism) once more. The authors claimed that people who were "cued" with the sound associated with the anti-sexism training during their nap further improved their performance on the "women and science" version of the test, but not the "negative attitudes towards Black people" version (the "uncued"training); similarly, those who were "cued" with the sound associated with the anti-racism training became even more unconsciously tolerant towards Black people, but not more inclined to associate women with science. In other words, the sound that was played to them was somehow reinforcing the specific message that had been associated with that sound during the training period.

Finally, the authors had the participants return to their lab a week later, and take the IAT for both sexism and racism, one more time. They found that performance had slipped --- that is, people did worse on both forms of the IAT, presumably as the effect of the training wore off --- but that this effect was greater for the "cued" than the "uncued" training topic. In other words, playing the sound of one form of the training during their nap not only had a beneficial effect on people's implicit, unconscious attitudes (reinforcing their training), but this effect also persisted a whole week later.

So, what's the problem? Reactions in the media, and from scientists who were invited to comment, concentrated on the potential to save the world from sexism and racism, with a bit of controversy as to whether it would be ethical to brainwash people in their sleep even if it were for such a good cause. However, that assumes that the study shows what it claims to show, and I'm not at all convinced of that.

Let's start with the size of the study. The authors reported a total of 40 participants; the supplementary materials mention that quite a few others were excluded, mostly because they didn't enter the "right" phase of sleep, or they reported hearing the cueing sound. That's just 20 participants in each condition (cued or uncued), which is less than half the number you need to have 80% power to detect that men weigh more than women. In other words, the authors seem to have found a remarkably faint star with their very small telescope [PDF].

The sample size problem gets worse when you examine the supplemental material and learn that the study was run with two samples; in the first, 21 participants survived the winnowing process, and then eight months later, 19 more were added. This raises all sorts of questions. First, there's a risk that something (even it was apparently insignificant: the arrangement of the computers in the IAT test room, the audio equipment used to play the sounds to the participants, the haircut of the lab assistant) changed between the first and second rounds of testing. More importantly, though, we need to know why the researchers apparently chose to double their sample size. Could it be because they had results that were promising, but didn't attain statistical significance? They didn't tell us, but it's interesting to note that in Figures S2 and S3 of the supplemental material, they pointed out that the patterns of results from both samples were similar(*). That doesn't prove anything, but it suggests to me that they thought they had an interesting trend, and decided to see if it would hold with a fresh batch of participants. The problem is, you can't just peek at your data, see if it's statistically significant, and if not, add a few more participants until it is. That's double-dipping, and it's very bad indeed; at a minimum, your statistical significance needs to be adjusted, because you had more than one try to find a significant result. Of course, we can't prove that the six authors of the article looked at their data; maybe they finished their work in July 2014, packed everything up, got on with their lives until February 2015, tested their new participants, and then opened the envelope with the results from the first sample. Maybe. (Or maybe the reviewers at Science suggested that the authors run some more participants, as a condition for publication. Shame on them, if so; the authors had already peeked at their data, and statistical significance, or its absence, is one of those things that can't be unseen.)

The gee-whiz bit of the article, which the cynic in me suspects was at least partly intended for rapid consumption by naive science journalists, is Figure 1, a reasonably-sized version of which is available here. There are a few problems with the clarity of this Figure from the start; for example, the blue
bars in 1B and 1F look like they're describing the same thing, but they're actually
slightly different in height, and it turns out (when you read the labels!) that in 1B, the left and right
sides represent gender and race bias, not (as in all the other charts)
cued and uncued responses. On the other hand, the green bars in 1E and 1F both represent the
same thing (i.e., cued/uncued IAT results a week after the training), as do the red bars in 1D and 1E, but not 1B (i.e., pre-nap cued/uncued IAT results).

Apart from that possible labelling confusion, Figure 1B appears otherwise fairly uncontroversial, but it illustrates that the effect (or at least, the immediate effect) of anti-sexism training is, apparently, greater than that of anti-racism training. If that's true, then it would have been interesting to see results split by training type in the subsequent analyses, but the authors didn't report this. There are some charts in the supplemental material showing some rather ambiguous results, but no statistics are given. (A general deficiency of the article is that the authors did not provide a simple table of descriptive statistics; the only standard deviation reported anywhere is that of the age of the participants, and that's in the supplemental material. Tables of descriptives seem to have fallen out of favour in the age of media-driven science, but --- or "because"? --- they often have a lot to tell us about a study.)

Of all the charts, Figure 1D perhaps looks the most convincing. It shows that, after their nap, participants' IAT performance improved further (compared to their post-training but pre-sleep results) for the cued training, but not for the uncued training (e.g., if the sound associated with anti-sexism training had been played during their nap, they got better at being non-sexist but not at being non-racist). However, if you look at the error bars on the two red (pre-nap) columns in Figure 1D, you will see that they don't overlap. This means that, on average, participants who were exposed to the sound associated with anti-sexism were performing significantly worse on the sexism component of the IAT than the racism component, and vice versa. In other words, there was more room for improvement on the cued task versus the uncued task, and that improvement duly took place. This suggests to me that regression to the mean is one possible explanation here. Also, the significant difference (non-overlapping error bars) between the two red bars means that the authors' random assignment of people to the two different cues (having the "anti-sexism" or "anti-racism" training sound played to them) did not work to eliminate potential bias. That's another consequence of the small sample size.

Similar considerations apply to Figure 1E, which purports to show that cued "learning" persisted a week afterwards. Most notable about 1E, however, is what it doesn't show. Remember, 1D shows the IAT results before and after the nap. 1E uses data from a week after the training, but it doesn't compare the IAT results from a week later with the ones from just after the nap; instead, it compares them with the results from just before the nap. Since the authors seem to have omitted to display in graphical form the most direct effect of the elapsed week, I've added it here. (Note: the significance stars are my estimate. I'm pretty sure the one star on the right is correct, as the error bars just fail to overlap; on the left, there should be at least two stars, but I'm going to allow myself a moment of hyperbole and show three. In any case, as you'll see in the discussion of Figure 1F, this is all irrelevant anyway.)

So, this extra panel (Figure 1E½?) could have been written up something like this: "Cueing during sleep did not result in sustained counterbias reduction; indeed, the cued bias increased very substantially between
postnap and delayed testing [t(37) = something, P = very small], whereas the increase in the uncued bias during the week after postnap testing was considerably smaller [t(37) = something, P = 0.045 or thereabouts]." However, Hu et al. elected not to report this. I'm sure they had a good reason for that. Lack of space, probably.

Combining 1D and 1E, we get this chart (no significance stars this time). My "regression to the mean" hypothesis seems to find some support here.

Figure 1F shows that Hu et al. have committed a common fallacy in comparing two conditions on the basis of one showing a statistically
significant effect and the other not (in fact, they committed this fallacy several times in their article, in their explanation of almost every panel of Figure 1). They claimed that
1F shows that the effect of cued (versus uncued) training
persisted after a week, because the improvement in IAT scores over
baseline for the cued training (first blue column versus first green column) was statistically significant, whereas
the corresponding improvement for the uncued training (second blue column versus second green column) was not. Yet, as
Andrew Gelman has pointed out in several blog posts with similar titles over the past few years,
the difference between “statistically significant” and “not
statistically significant” is not in itself necessarily statistically
significant. (He even wrote an article [PDF] on this, with Hal Stern.) The question of interest here is whether the IAT performance for the topics (sexism or racism) of cued and uncued training, which were indistinguishable at baseline (the two blue columns) was different at the end of the study (the two green columns). And. as you can see, the error bars on the two green columns overlap substantially; there is no evidence of a difference between them.

One other point to end this rather long post. Have a look at Figure 2 and the associated description. Maybe I'm missing something, but it looks to me as if the authors are proudly announcing how they went on a fishing expedition:

They don't tell us how many combinations of parameters they tried to come up with that lone significant result; nor, in the next couple of paragraphs, do they give us any theoretical justification other than handwaving why the product of SWS and REM sleep duration (whose units, the label on the horizontal access of Figure 2 notwithstanding, are "square minutes", whatever that might mean) --- as opposed to the sum of these two numbers, or their difference, or their ratio, or any one of a dozen other combinations --- should be physiologically relevant. Indeed, selecting the product has the unfortunate effect of making half of the results zero - I count 20 dots that aren't on the vertical axis, for 40 participants. I'm going to guess that if you remove those zeroes (which surely cannot have any physiological meaning), the regression line is going to be a lot flatter than it is at present.

Bottom line: I have difficulty believing that there is anything to see here. We can put off the debate about the ethics of subliminally improving people for a while, or at least rest assured that it's likely to remain an entirely theoretical problem.

(*) Incidentally, each red- or green-coloured column in one of the panes of Figure S3 corresponds to approximately five (5) participants. You can't even detect that men are taller than women with that.

21 May 2015

Another story of apparent scientific fraud has hit the headlines. I'm sure that most people who are reading this post will have seen that story and formed their own opinions on it. It certainly doesn't look good. And the airbrushing of history has already begun, as you can see by comparing the current state of this page on the website of the MidWest Political Science Association with how it looked back in March 2015 (search for "Fett" and look at the next couple of paragraphs). Meanwhile, Michael LaCour hastily replaced his CV (which was dated 2015-02-09) with an older version (dated 2014-09-01) that omitted his impressive-looking list of funding sources (see here for the main difference between the two versions); at this writing (2015-05-22 10:37 UTC), his CV seems to be missing entirely from his site.

This rapidly- (aka "hastily-") written post is in response to some tweets calling for fraudsters to be banned from academia for life. I have a few problems with that.

First, I'm not quite sure what banning someone would mean. Are they to have "Do Not Hire In Any Academic Context" tattooed on their forehead? In six languages? Or should we have a central "Do Not Hire" repository, with DNA samples to prevent false identities (and fingerprints to prevent people impersonating their identical twin)?

Second, most fraudsters don't confess, nor are they subjected to any formal legal process (Diederik Stapel is a notable exception, having both confessed in a book [PDF] and been given a community service penalty, as well as what amounts to a 6-figure fine, by a court in the Netherlands). As far as I can tell, these people tend to deny any involvement, get fired, disappear for a while, and then maybe turn up a few years later teaching mathematics at a private high school or something, once the publicity has died down and they've massaged their CVs sufficiently. Should that be forbidden too? How far do we let our dislike of people who have let us down extend to depriving them of any chance of earning a living in future?

After all, we rehabilitate people who kill other people; indeed, in some cases, we rehabilitate them as academics. And as the case of Frank Abagnale shows, sometimes a fraudster can be very good at detecting fraud in others. Perhaps we should give the few fraudsters who confess a shot at redemption. Sure, we should treat their subsequent discoveries with skepticism, and we probably won't allow them to collect data unsupervised, but by simply casting them out, we miss an opportunity to learn, both about what drove (and enabled) them to do what they did, and how to prevent or mitigate future cases. We study all kinds of unpleasant things, so why impose this blind spot on ourselves?

Let's face it, nobody likes being the victim of wrongdoing. When I came downstairs a couple of years ago to find that my bicycle had been stolen from my yard overnight, the one time that I didn't lock it because it was raining so hard when I arrived home that I didn't want to stay out in the rain a second longer to do it, I was all in favour of the death penalty, or at the very least lifelong imprisonment with no possibility of parole, for bicycle thieves. The inner reactionary in me had come out; I had become the conservative that apparently emerges whenever a liberal gets mugged. Yet, we know from research (that we have to presume wasn't faked --- ha ha, just kidding!) that more severe punishments don't deter crime, and that what really makes a difference [PDF] is the perceived chance of being caught (and/or sentenced). And here, academia does a really, really terrible job.

First, our publishing system is, to a first approximation, completely broken. It rewards style over substance in a systematic way (and Open Access publishing, in and of itself, will not fix this). As outside observers of any given article, we are fundamentally unable to distinguish between reviewers who insist on more rigour because our work needs more rigour, and those who have missed the point completely; anyone who has had an article rejected from a journal that has also recently published some piece of "obvious" garbage will know this feeling (especially if our article was critical of that same garbage, and seems to be being held to a totally different set of standards [PDF]).

Second, we --- society, the media, the general public, but also scientists among ourselves (I include myself in the set of "scientists" here mostly for syntactic convenience) --- lionize "brilliant" scientists when they discover something, even though that something --- if it's a true scientific discovery --- was surely just sitting there waiting to be discovered. (Maybe this confusion between scientists and inventors will get sorted out one day; I think it's a very fundamental problem. Perhaps we would be better off if Einstein hadn't been so photogenic.) And that's assuming that what the scientist has discovered is even, as the saying goes, "a thing", a truth; let's face it, in the social sciences, there are very few truths, only some trends, and very little from which one can make valid predictions about people with any worthwhile degree of reliability. (An otherwise totally irrelevant aside to illustrate this gap: one of the most insanely cool things I know of from "hard" science is that GPS uses both special and general relativity to make corrections to its timing, and those corrections go in opposite directions.) We elevate the people who make these "amazing discoveries" to superstar status. They get to fly business class to conferences and charge substantial fees to deliver a keynote speech in which they present their probably unreplicable findings. They go on national TV and tell us how their massive effect sizes mean that we can change the world for $29.99.

Thus, we have a system that is almost perfectly set up to reward people who tell the world what it wants to hear. Given those circumstances, perhaps the surprising thing is that we don't find out about more fraud. We can't tell with any objectivity how much cheating goes on, but judging by what people are prepared to report about their own and (especially) their colleagues' behaviour, what gets discovered is probably only the tip of a very large and dense iceberg. It turns out that there are an awful lot of very hungry dogs eating a lot of homework.

I'm not going to claim that I have a solution, because I haven't done any research on this (another amusing point about reactions to the LaCour case is how little they have been based on data and how much they have depended on visceral reactions; much of this post also falls into that category, of course). But I have two ideas. First, we should work towards 100% publication of datasets, along with the article, first time, every time. No excuses, and no need to ask the original authors for permission, either to look at the data or to do anything else with them; as the originators of the data, you'll get an acknowledgement in my subsequent article, and that's all. Second, reviewers and editors should exercise extreme caution when presented with large effect sizes for social or personal phenomena that have not already been predicted by Shakespeare or Plato. As far as most social science research is concerned, those guys already have the important things pretty well covered.

(Updated 2015-05-22 to incorporate the details of LaCour's CV updates.)

09 May 2015

The European Commission is giving financial backing to a company that
claims its technology can read your emotional state by just having you
look into a webcam. There is some sceptical reporting of this story here.

Highlights:"Realeyes
is a London based start-up company that tracks people's facial
reactions through webcams and smartphones in order to analyse their
emotions. ...Realeyes has just received a 3,6 million euro funding
from the European Commission to further develop emotion measurement
technology. ...The technology is based on six basic emotional states that, according to
the research of Dr Paul Ekman, a research psychologist, are universal
across cultures, ages and geographic locations. The automated facial coding platform records and then analyses these universal emotions: happiness, surprise, fear, sadness, disgust and confusion. ... [T]his technological development could be a very powerful tool not only
for advertising agencies, but as well for improving classroom learning,
increasing drivers’ safety, or to be used as a type of lie detector test
by the police."

Of course, this is utterly stupid. For one thing, it treats emotions as if they are real tangible things that everyone agrees upon, whereas emotions research is a messy field full of competing theories and models. I don't know what Ekman's research says, or what predictions it makes, but if it really suggests that one can reduce everything about what a person is feeling at any given moment to one of six (or nine, or twelve) choices on a scale, then I don't think I live in that world (and I certainly don't want to). For another, without some form of baseline record of a person's face, it's going to be close to impossible to tell what distortions are being heaped on top of that by emotions. Think of people you know whose "neutral" expression is basically a smile, and others who walk round with a permanent scowl on their faces.

Now, I don't really care much if this kind of thing is sold to gullible "brand-led" companies who are told that it will help them sell more upmarket branded crap to people. If those companies want to waste their marketing and advertising dollars, they're welcome. (After all, many of them are currently spraying those same dollars more or less uselessly in advertising on Twitter and Facebook.) But I do care when public money is involved, or public policy is likely to be influenced.

Actually, it seems to me that the major problem here is not, as some seem to think, the "big brother" implications of technology actually telling purveyors of high-end perfumes or watches, or the authorities, how we're really feeling, although of course that would be intensely problematic in its own right. A far bigger problem is how to deal with all of the false positives, because this stuff just won't work - whatever "work" might even mean in this context. At least if a "traditional" (i.e., post-2011 or so) camera wrongly claims to have located you in a given place at a given time, it's plausible that you might be able to produce an alibi (for example, another facial recognition camera placing you in another city at exactly the same time, ha ha). But when an "Emocam" says that you're looking fearful as you, say, enter the airport terminal, and therefore you must be planning to blow yourself up, there is literally nothing you can do to prove the contrary. Dr. Ekman's "perfect" research, combined with XYZ defence contractor's "infallible" software, has spoken.

The computer says you are disgusted. I am a member of a different ethnic group. Are you disgusted at me? Are you some kind of racist?

Welcome to this job interview. Hmm, the computer says you are confused. We don't want confused people working for us.

So now we're all going to have to learn another new skill: faking our emotions so as to fool the computer. Not because we want to be deceptive, but because it will be messing with our lives on the basis of mistakes that, almost by definition, nobody is capable of correcting. ("Well, Mr. Brown, you may be feeling happy now, but seventeen minutes ago, you were definitely surprised. We've had this computer here for three years now, and I've never seen it make a wrong judgement.") I suspect that this is going to be possible although moderately difficult, which will just give an advantage to the truly determined (such as the kind of people that the police might be hoping to catch with their new "type of lie detector").

In a previous life, but still on this blog, I was a "computer guy". In a blog post from that previous life, I recommended the remarkable book, "Digital Woes: Why We Should Not Depend on Software" by Lauren Ruth Wiener. Everything that is wrong with this "emotion tracking" project is covered in that book, despite its publication date of 1993 and the fact that, as far as I have been able to determine, the word "Internet" doesn't appear anywhere in it. I strongly recommend it to anyone who is concerned about the degree to which not only politicians, but also other decision-makers including those in private-sector organisations, so readily fall prey to the "Shiny infallible machine" narrative of the peddlers of imperfect technology.

01 May 2015

Introductory
disclaimer: This
blog post is intended to be about the selective interpretation of statistics. Many of the figures under discussion are about
reported rates of violence against women, and any criticisms or suggestions
regarding research in this field are solely in reference to research methods. Nothing in this commentary is in any way
doubting the very real experiences of women facing violence and abuse, nor
placing responsibility for the correct reporting of abuse on the women
experiencing it. Violence against women
and girls (VAWG) is an extremely serious issue, which is exactly why it
deserves the most robust research methods in order to bring it to light.

Back
in February 2014, I wrote a post
in which I noted the seemingly high correlation between “national happiness”
ratings for certain countries and per-capita consumption of antidepressants in
those countries. Now I’ve found what I
think is an even better example of the limitations of ranking countries based
on some simplified metric. I’ve asked my
friend Clare Elcombe Webber, a commissioner for VAWG services, to help me here.So from this point on, we’re writing in the
plural...

A
few months ago, this
tweet from Joe Hancock (@jahoseph)
appeared in Nick’s feed. It shows, for
28 EU countries, the percentage of women who report having been a victim of
(sexual or other) violence since the age of 15. Guess which country tops this list? Yep, Denmark. Followed by Finland, Sweden, and the
Netherlands. Remember them? The countries that are up there in the top 5
or 10 of almost every happiness survey ever performed? Down near the bottom: miserable old Portugal,
ranked #22 out of 23 in happiness in the post linked to above. (The various lists of countries don’t match
exactly between this blog post and the one linked to above because there are
different membership criteria, with some reports coming from the OECD, EU, or
UN. Portugal was kept off the bottom of
the happiness list in the post about antidepressants by South Korea.)

This
warranted some more investigating, along the lines of Nick’s previous
exploration of the link between happiness and antidepressants. The original survey data page is here; click on “EU map” and use
the dropdown list to choose the numbers you want. Joe’s tweet is based on the first drop-down
option, “Physical and/or sexual violence by a partner or a non-partner since
the age of 15”. While performing the
tests that we describe later in this post, we also tried the next option, “Physical
and/or sexual violence by a partner [i.e., not a non-partner] since the age of
15”, but this didn’t greatly change the results.In what follows, unless otherwise stated, we
have used the numbers for VAWG perpetrated by both partners and non-partners.

First,
Nick took his existing dataset with 23 countries for which the OECD supplied
the antidepressant consumption numbers, and stripped it down to those 17 which
are also EU members. Then, he ran the
same Spearman correlations as before, looking for the correlations between UN
World Happiness Index ranking and: /a/ antidepressant consumption (Nick did
this last time, but the numbers will be slightly different with this new subset
of 17 countries); /b/ violence reported by women. Here are the results, which are first sight
are rather disturbing:

Let’s
repeat that: Among the 17 largest economies within the EU, the degree of
violence since age 15 reported by women is very strongly correlated with
national happiness survey outcomes. When
things turn out to be correlated at .831, you generally start looking for
reasons why you aren’t in fact measuring the same thing twice without knowing it.

Trying
to look for some way of mitigating these figures, Nick tried another approach,
this time with parametric statistics. He
took the percentage of women reporting being the victims of violence in all 28
EU countries, and compared it with the points score (out of 10) from the UN
Happiness Survey. Here is the least
pessimistic result obtained from the various combinations:

Across all 28 EU countries,
violence against women correlated (Pearson’s r) .497 (p=.007)
with national happiness.

This
is still not very good news. If you’re
hoping to show that two phenomena in the social sciences are correlated, and
you find a correlation of .497, you’re generally pretty pleased.

Of
course, correlation is not the same as causation. Probably nobody would suggest that higher
levels of violence against women makes for a happier society, or that higher
levels of general societal happiness cause people to become more violent
towards women.

So
what is going on here? Maybe the FRA’s
methods are indeed seriously flawed. We
have difficulty imagining why Austrian women would report rates of
interpersonal violence barely half those experienced by Luxembourgers, or that
Scandinavians are assaulting women at over twice the rate of Poles, or that the
domestic violence problem in the UK is 70% worse than in next-door Ireland.

But
perhaps there are some other factors that might help to explain these numbers. Remember, these are answers being given to an
interviewer from the EU Fundamental Rights Agency (FRA); they are not extracted
from, say, police databases of complaints filed. Thus, while we can perhaps assume that the
reports ought not to be affected too much by the perceived level of danger or
social shame involved in revealing one’s situation to the authorities (it’s
easy to imagine that that people in countries with high levels of equality and
openness—Denmark, say—might feel more able to file charges about violence than
in some other countries that are perceived as being more “macho”), the degree
to which these data reflect reality will depend to a large extend on people’s degree
of willingness to admit being a victim to a stranger. While one would hope that the FRA had thought
about that and done the maximum in terms of study and questionnaire design,
training of interviewers, etc., to allow women to be frank about their
experiences, this isn’t something we were able to find definitively in their reported
methodology (available here).

There
are huge issues, which have dogged this type of research for many decades, when
it comes to asking women to disclose their experiences of abuse. The conventional wisdom amongst researchers
and service providers is that victims of abuse are extremely unlikely to reveal
their experiences to anyone, and short of the FRA interviewers spending months
building rapport with each respondent (which, obviously, they did not do) there
is little to be done to mitigate this. Here
are just some possible reasons why experiences of abuse might not have been
disclosed to researchers, and how this could impact on the results:

·The
sampling method involved visiting randomly selected addresses. A common tactic used by abusive partners is to
isolate their victim, primarily as a way of stopping any disclosure or attempt
to seek support; so it is not unlikely that women currently in abusive
relationships were “not allowed” to take part in the research at all. (If we wish to make great leaps of logic here,
we could theorise that this could lead to a higher apparent incidence of VAW in
countries with better support services, as women in those countries were more
likely to have been able to leave an abusive situation, and therefore were more
able to take part in the research. But
we don’t have data for that…)

·Many
women do not identify their experiences as violent or abusive, even when most
external observers would say that they plainly are. This may be a defence mechanism, allowing them
to avoid having to face up to the truth about their partner, the fragility of
their personal safety, or the frightening nature of the world.Admitting that they are the victims of
violence or abuse would also imply that they may have to act to change their
situation. Therefore, respondents could
simply be lying; and, even if a measure of social desirability might be able to
detect this (possibly a tall order for such a serious subject), it’s unlikely
that the interviewer would administer such a measure.Alternatively, the degree to which women deny
that their experiences are violent or abusive might have a substantial cultural
component; perhaps women in more “traditional” countries are more likely to
justify some behaviours towards them as “normal”.

·It
is not clear, from the methodological background of the report, how issues of
confidentiality were explained to respondents. We can reasonably conjecture that if a
respondent disclosed that they were currently at serious risk from someone,
that the interviewer would have been ethically obliged to do something
additional with this information. Many
abusers make threats of violence or serious reprisals should their victim make
a disclosure (something borne out by the fact that the majority of serious
injuries or murders of women by men they know occur at or shortly after the
point of separation or disclosure of the abuse to a third party), and this
would significantly impact whether or not a woman would answer these questions
truthfully.In addition, perceived fear
of the authorities may discourage a woman from disclosing; in many countries,
the police and social workers often do not have a glowing reputation for
providing support, and women may feel that involving them would exacerbate
their problems, rather than help to resolve them.

·Finally,
victims who have disclosed their abuse often talk of their feelings of guilt, or
that they are to blame for abuse.This
shame could be an additional barrier to giving a truthful answer.

We
can make some—admittedly sweeping—inferences from the fact that the data do not
tell us what we would intuitively expect. We could speculate that those countries we
might expect to be more socially “advanced” in terms of attitudes to violence
against women could have higher rates of disclosures of abuse in this research because
women in those countries feel more able to recognise and name their
experiences, or feel more confidence in the authorities being supportive, or
have greater trust in the confidentiality of the survey; and therefore are more
prepared to report having been the victims of violence. A further conjecture could be that in these
countries, women are socially “trained” that these experiences are neither
normal nor acceptable, and that victims of violence are entitled to be heard,
without being stigmatised. (However, a
skeptic might respond that, while these assumptions enable us to put a positive
spin on this slightly unusual dataset, they are still only assumptions for
which we have little evidence, and do little to address the initial observation,
namely that the countries in the EU deemed to be happiest also reported the
highest levels of violence against women.)We could add all sorts of social variables into the mix here: availability
of relationship education, social stigma towards single mothers, the perception
of the state as supportive (or not), and so on. Violence against women and girls is a melting
pot of individual, social, and cultural variants, and to date researchers have
not been able to neatly set out what it is which makes some men decide to be
abusive towards women, nor what makes some communities turn a blind eye to such
abuse or even place the blame on the women being abused. Respondents potentially have many more reasons
to conceal their experiences of violence and abuse than they might in other
research areas, and there is no straightforward way of controlling for these.(Psychologists have devised various ways of
controlling for social desirability biases, but it is not clear to us that
these take sufficient account of cross-cultural factors; see Saunders, 1991.)

However,
let’s assume for a moment that it might be valid to take the numbers in the
report as not being directly reflective of the underlying problem, but instead
as presenting a combination of the actual prevalence, multiplied by a “willingness
to acknowledge” factor. At a certain
point, this could mean that you could see higher numbers in the survey for
countries where there’s actually less of a problem. For example, let’s say that the true rate of
violence against women in Denmark is 60%, and that 87% of Danish women are
prepared to discuss their experiences of violence openly; multiply those
together, and there’s the 52% reported rate from the EU survey. Meanwhile, perhaps the true rate in Poland is
76% (note: we have no evidence for this; we are choosing Poland here only
because it is the country at the bottom end of the FRA’s list), but only 25% of
Polish women are prepared to discuss it; again, multiply those numbers together
and you get the reported rate of 19%. In
fact, this line of reasoning is commonly used by people working on the front
line of VAWG support.For example, in
one London borough, reports to the police of domestic abuse in 2014 were over 40%
higher than in 2013, and this is considered to be a good thing; it’s assumed that
the majority of domestic abuse goes unreported, and thus additional reports are
just that: additional reports, rather than additional instances. But without more data from other sources and
approaches, we just don’t (and can’t) know.

Here’s
the kicker, though: if you choose to take the line that these figures “can’t
possibly be right”, and that in fact they may even show the opposite of the
real problem, that raises the question of why it’s OK to look for an
alternative explanation for the figures on violence (or other social issues,
such as, perhaps, antidepressant usage), but not for those on other phenomena,
such as (self-reported) happiness. What
gives data on happiness the kind of objective quality that legitimises all the column
inches, TV airtime of happiness gurus, and government policy initiatives to try
and boost their country’s rank from 18 to 10 in the UN World Happiness Index,
if you’re simultaneously prepared to try to look very hard for reasons to explain
away numbers that appear to show that your favourite “happy” country is a
hotbed of violence against women?

And,
even more importantly: whatever your position, do you have evidence for it?

You can find the dataset for this post here.
(Yes, the filename does give away how
long we have been working on this post!)It also includes all the data you need to re-examine the post about
antidepressants from February 2014.

29 March 2015

When I used to work in an office, my boss used
to say that any time he had a good idea, he could come to me and in ten
minutes he'd know everything that might go wrong with it. He often went
ahead anyway, and his ideas often worked, but at least he was
forewarned. So in that spirit, here goes.

I'm a little concerned by all the hype around Open Access (OA) journals.

Yes,
I know that traditional journal publishers are evil, and make more
money and higher gross margins and have bigger car parking spaces than Apple, and
I agree that when taxpayers fund research then taxpayers should
have access to it. However, I'm not sure that just because all of the
above may be true, that our current model of OA journals is necessarily the solution. I have a number of concerns of what may happen as the OA model takes hold and, as everybody tells me is going to happen, becomes dominant. This post is intended to start a discussion on those concerns, if anyone's interested.

1. It's the economy, stupid

One of
the strengths
of the traditional publishing model is that, to a first approximation
and allowing for all kinds of special circumstances, the editor-in-chief
of a half-decent journal doesn't have to worry about filling it.
Indeed, many journals proudly promote their rejection rate on their home
page, next to their average turnaround time. "We reject 90% of
submissions; don't waste our time unless you've got a good story to
tell", is the message (with, of course, all of the predictable effects
on publication bias that this implies). Doubtless the editor-in-chief
has some financial targets to meet, perhaps in terms of not blowing the
production budget on full-page colour pictures of kittens, but this is
not a job whose holder is principally tasked with revenue generation.
The money is coming in
pretty steadily from sales of packages of journals to institutions
around the world (even if some of these institutions are starting to
take exception). The editor gets to concentrate on, among other things, maintaining the journal's impact factor --- hopefully using methods that are a little less blatant than this.

With
OA journals, funded principally by
article processing charges paid by authors, things are likely to be a
little different. No matter how dedicated to academic integrity and the
highest possible scientific standards the editorial staff want to be,
money is right there in the equation every day, especially for an online journal
with almost no physical limits to its size. How many articles can we
get through the
review process this month? Can we upsell the author to the full-colour
package? Why do so many people want hardship waivers? (Oh, and I have yet to see any suggestion that OA journals will be less concerned about their impact factor than traditional journals.)

The
idea, of course, is that authors will not be reaching into their own
pockets to pay the article processing charges. The intention is that
the fee of $1,000 or so to publish the results should be budgeted for out of the project's funding. After all, it's only a waffer-thin thousand bucks, the kind of money some projects
probably have slopping around at the end anyway if the participants didn't eat all of the M&Ms. But once funding
agencies catch on, will they allow grant proposals to include specific line items for
OA publishing, when publication in one of the high-prestige traditional
journals --- which you promised them, earlier in the proposal, were
definitely going to be interested in your groundbreaking project --- is
free? And what about the independent researcher with no budget, who may
have something interesting to say, but no money? Should such a person
have to fund publication from their own pocket?

I'm afraid that the money always, always
finds a way to affect things. Someone, somewhere in the process, will be directly incentivised to
increase revenue. (In France, where I live, gambling is a state
monopoly, which means that whatever arms-length construction they have
put together, somewhere there is someone who essentially works for the
government and yet has a performance target to sell more
scratchcards to the urban poor, even though gambling is officially a social problem.) How
does this affect you as the editor-in-chief of an OA journal?
Maybe you ask your action editors to tell reviewers to be less picky
about certain things.
Maybe you suggest to an author that splitting these results into two
articles will be to everyone's advantage - after all, the publication
fee is coming out of the grant money, and as it stands it is a pretty
long paper manuscript for someone to have to wade through at one
sitting...

The corollary of this is that the PI presenting an article for publication is a paying customer. Now when I go to make a $1,000 purchase, I'm generally greeted with open arms. I certainly don't expect to have to pass quality control checks before I'm allowed to spend my $1,000. The psychology of the OA model is going to be interesting indeed. (Compare what happened in the UK when public universities started to charge tuition fees; all of a sudden, the idea of a student being given a failing grade became, for many people, a consumer protection issue. "I paid to come here and get a degree, how dare you tell me I can't have one?", ran the argument. Too many unhappy punters, and the Vice-Chancellor is touring the stricter departments to ask them to be a little more, um, flexible in their marking criteria.)

I found a pertinent example shortly before putting this post (which has taken a while to draft) online. Here
is a note from Nandita Quaderi, who is "Publishing Director, Open
Research" at Scientific Reports, which is part of Nature Publishing
Group. Nandita is pleased to announce that henceforth, "a selection of
authors submitting a biology manuscript to Scientific Reports
will be able to opt-in to a fast-track peer-review service". Needless
to say, this service comes "at an additional cost", being provided by a
for-profit organisation called Research Square. (An editor of Scientific Reports has resigned over this.) So now, I'm paying to publish, and I'm paying to have my article
reviewed. What could possibly go wrong with the objectivity and rigour
of the scientific process?

2. Access is not the biggest problem science faces right now

Another issue is that most
OA journals do
not address the ongoing problems of the peer review system. I would
argue that
currently, failures of peer review are a bigger threat to science than
paywalls.
If reviewers are allowing bad science through --- or erroneously
recommending rejection of good articles --- then getting free access to
the
resulting error-filled literature is the least of our problems; and I
have yet to see a coherent argument why the OA review process might be
inherently any more rigorous than that at traditional journals.

Some
online journals, such as The Winnower, have adopted a radical solution
to this: anyone can publish an article, without any prior review process, with the idea that people will come along and
review it afterwards. This seems attractive at first sight, except that people
typically have even less incentive to act as a reviewer once the article
is "out there", even if it doesn't yet have the status of a citable
article with a DOI (a status which, incidentally, the article's own
author decides to award it, at a time of his or her own choosing).

It
seems to me that OA journals are to some extent hitching a ride on the
back of the traditional journals, which have created (and still sustain)
the fundamental mode of operation that we know and love/hate: author sends in MS,
editor checks it, editor selects reviewers, reviewers approve or request
changes, editor finally accepts or rejects. This system more or
less works --- give or take the criticisms of peer review as "broken",
which have a lot of merit but which, as I noted above, it seems to me that OA (in and of
itself) doesn't do much to address --- because people generally
have confidence in it. Not necessarily absolute confidence, but we know how it's meant to work and how to spot when it isn't working. We (like to) believe that the editors do not
generally accept (too many) articles from themselves and their buddies (or at least, that they risk getting called out for it if they do), that they
select reviewers who are competent in the relevant subfields, that the
reviewers do an honest and unbiased job, etc. (Of course, the reviewer who is doing
"excellent quality control" with *your* article is an
incompetent idiot who has failed to understand even the most basic concepts
of *my* article, but that's part of the game.)

So, when something like Collabra,
the new OA mega-journal from the University of California, launches,
they can put pictures of respected people on the front page where they introduce their editorial board, thus
sending a message that the
review process will be every bit as rigorous as it is for a traditional
journal. Readers are reassured, and authors know they will need to
submit work of a high standard. But to me this only works because the
majority of people who are being held up as examples of the quality of
the journal have good reputations, which have been made within the
traditional process. How does this scale? What does the publication
process look like in 10 or 20 years time, if the traditional journals
have mostly gone and we make our reputations with OA (web-)publishing,
blogs, and social media presence? (Yes, impact factor is broken. But where is the dominant, credible alternative that everyone will be prepared to switch to?)

This doesn't mean that Collabra will be full of articles promoting
homeopathy after a few months. But over time, the relationship between authors, reviewers, and journals will change, in ways that we can't necessarily predict. That doesn't mean the sky will fall, but it does mean that there will be perverse situations that may or may not be worse than what we have to put up with now.

3. Ham, spam, and all points in between

I also worry that the line between "legitimate" and
"spam"
OA journals will start to blur. Currently we can all point and laugh at
the semi-literate invitations to publish in (or join the Editorial
Board of) those pseudo-journals with
plausible-sounding names, strange salutation styles in their e-mails,
and an editorial address in a Regus suite in
San Antonio, from which manuscripts are presumably forwarded to the
journal's real staff in Cairo
or
Mumbai. But these fraudulent (whatever that means...) journals will
improve, and it will become hard to tell the
"fake" from the "real".
A few weeks ago, I was asked to review an
article by an OA journal that was part of a London-based publishing
outfit. I genuinely couldn't decide
if they were spammers or genuine: the journals mentioned on their web
site all seem to exist, and about a third of them are indexed in
PubMed. How good or bad is that? I recommended rejection, as the
article
would have been of little interest to the
readers of
the journal, according to its own profile. I wonder what the lead author did next (assuming that my recommendation to reject was the editor's verdict as well)? Did he appeal, as a "paying customer", to the editor in chief? Or did he maybe send the article to another OA journal, on the basis that he will eventually find somebody, somewhere, who wants $1,000? (*)

I think,
though, that perhaps the bigger risk in the meeting of "legitimate" and
"spam" journals is through the trimming of standards at the "legitimate"
end. Look at what happened when the Saudis decided
to throw some money at education, and suddenly King Abdulaziz University is ranked #7 in the world in mathematics.
Uh-huh. Sure. So what happens when that university, or others with rather more money to
burn than academic integrity, starts
its own OA mega-journal? Exactly what will be the conditions of
scientific neutrality under which the editor-in-chief reviews articles
by, say, the children of minor members of the Saudi ruling family?
Perhaps someone will create an authoritative
clearing house to administer a sliding scale of which
journals are "real" versus "spam". But who would run such an
organisation? The AAAS? ISO? Standard & Poor's? Google?
And who would ultimately be responsible for the "legit"/"spam" decisions?

Historically, publisher-led journals seem
to have been mostly
spam-free; it would be interesting to establish why this was.
High barrier to entry in the world of ink and paper? Old-fashioned
academic and intellectual integrity, despite the profits? Risk of
reputational
damage if, say, Springer (cough) or Sage (cough) were to acquire a reputation for publishing garbage? I
don't know what the reasons are, but it created the current situation
whereby --- whatever the other problems in the system --- a journal that exists in a print edition is generally regarded, at
least by default, as having some degree of seriousness. I worry that
we will end up in a situation where we don't have a simple way to tell
whether we can take a "journal" (in the widest possible sense) seriously
or not. In such situations, humans tend to apply some simple
heuristics, which scammers have many centuries worth of experience exploiting.

4. A modest (and, as yet, barely sketched out) proposal

Do I
have an alternative? Well, when my boss came to me with his ideas, I
usually didn't, but in this case I do have a tentative suggestion. What
if the funding agencies ran a few journals? After all, these are (generally) the representatives of the
taxpayers, who --- as the Open Access movement is right to point out ---
pay for the research and ought to have free access to the results. Yet currently, they rely on "the system" to work, and for researchers to muddle their way through that system. In the traditional model, the readers pay, and in theOA model, the authors pay. Both systems have their deficiencies. Supposing we had a parallel model where nobody paid (except a general fund, set up to guarantee neutrality)?

Those of a libertarian bent might argue that the government shouldn't be
involved in academic publishing, but the stable door closed on that
when we started to take their money to do the research. Some might also argue that an funding agency-sponsored journal might be highly politicised, but then, /a/ why should it be more politicised than the handing out of the money, /b/ the Rind/Lilienfeld saga showed that politicians can pressure "independent" journal publishers into submission too, and /c/ there will always be other outlets; I'm just modestly proposing a "third way". (As a bonus, this would
seem to be a good fit with the aims of the pre-registration movement.)

Notes:

1. I'm aware that this is a rather long and at times rambling post. It started life in a frenzied evening of writing just
after I got out of hospital after a stay that lasted the better part of
three weeks, and that still shows. I should probably have scrapped it and started
again, or at least sat down and rearranged the paragraphs, but I wanted
to get the ideas out there within a reasonable time frame. I hope some of them are useful.

2. I want to thank Rolf Zwaan for some helpful discussions on an earlier draft of this post.
Rolf disagreed with much of what I had written, and I've only made a
few changes, so he probably still disagrees with a lot of it. I should point out that my use of the example of Collabra (for whom Rolf is an editor) above is not based on any specific criticism of that journal, but merely as a salient example; Rolf's tweet about his appointment as an editor at Collabra was the spark for my writing of this post.

(*) Update 2016-11-28: I was re-reading this post because reasons, and I noticed this dangling question. I googled the title of the article... sure enough, it was accepted, despite my recommendation to reject.