Sunday, 26 May 2013

A couple of weeks ago, the Observer printed a debate headlined “Do we need to change the way we are thinking about mental illness?” I read it with interest, as I happen to think that we do need to change, and that the new Diagnostic and Statistical Manual of the American Psychiatric Association (DSM5) has numerous problems.

The discussion was opened by Simon Wessely, a member of the Royal College of Psychiatrists, who responded No. He didn’t exactly defend the DSM5, but he disagreed with the criticism that it reduces psychiatry to biology. The Yes response was by Oliver James, an author and clinical psychologist, who attacked the medical model of mental illness, noting the importance of experience, especially childhood experience, in causing psychiatric symptoms. I happen to take a middle way here; there’s ample evidence of biological risk factors for many forms of mental illness, but in our contemporary quest for biomarkers, the role of experience is often sidelined. The idea that you might be depressed because bad things have happened to you goes unmentioned in much contemporary research on affective disorders, for instance.

But, rather than getting into that debate, I want to make a more general point about evidence. In his statement, Oliver James came out with some statistics that surprised me. In particular, he said:

13 studies find that more than half of schizophrenics suffered childhood abuse. Another review of 23 studies shows that schizophrenics are at least three times more likely to have been abused than non-schizophrenics. It is becoming apparent that abuse is the major cause of psychoses.

The frustrating thing about this claim is that no sources were given. I don't work in this area, so I thought I’d see if I could track down the articles cited by James. My initial attempt was based on a hasty search of Web of Knowledge on the morning that the article appeared. I described the results of my searches on a blogpost that day, but a commentator pointed out that I'd limited myself to looking at the link between schizophrenia and sexual abuse, whereas James had been referring to childhood abuse in general. I realised that a fair and proper appraisal of his claims should look at this broader category, and accordingly I removed the original post until I could find time to do a more thorough job.

Accurate assessment of child abuse is difficult because it is often hidden away and many cases may be missed. Retrospective accounts of abuse are notoriously hard to validate: false memories can be induced, but true memories may be suppressed. All those writing in this field note the problems of getting accurate data, and the wide variations in rates of abuse reported in the general population, depending on how it is defined. For instance, Fryers and Brugha (2013) noted that for the general population, estimated rates of child physical abuse have ranged between 10% to 31% in males and 6% to 40% of females, and child sexual abuse from 3% to 29% in males and 7% to 36% in females.

Fryers and Brugha focused on prospective, longitudinal studies, taking evidence from over 200 studies. They concluded that “most abuses were associated statistically with almost all classes of disorder (psychosis being largely an exception)”, p 26, and “schizophrenia and closely related syndromes have not generally been much associated with previous child abuse but the picture is not simple.” P.27. They noted that one Australian cohort study found an increase in schizophrenic disorders in children who had been sexually abused, though this was an unusual finding in the context of the research literature as a whole.

Other recent meta-analyses have included case-control studies, where participants are recruited in adulthood, and histories of participants with schizophrenia are compared with those of a control group. Matheson et al (2012) summarised seven studies involving a comparison between patients with schizophrenia and non-psychiatric controls, but definitions of adversity varied widely. In some studies, 'adversity' extended beyond abuse, though physical, sexual and emotional abuse predominated in the definitions. Overall, this review gave results similar to those reported by James, with an adversity rate of 58% in the schizophrenia group and 27% in the controls. This high rate in those with schizophrenia depends, however, on one large study that included adversity factors going beyond abuse, such as having a parent with nervous or emotional problems, or a lot of conflict and tension in the household. In this same study, 91% of controls compared with 71% of those with schizophrenia described their childhood as "happy". If this study is excluded, rates of adversity in those with schizophrenia fall to 28% compared to 8% in controls – still a notable and statistically reliable effect, but with less dramatic absolute rates of adversity than those cited by James.

A larger meta-analysis including a total of 36 studies was conducted by Varese et al (2012), who obtained similar results: a higher rate of childhood adversities in those who develop psychosis, with an odds ratio estimated at 2.78 (95% CI = 2.34-3.31). Significant associations of similar magnitude were found for all types of adversity other than parental death. The odds ratio is not, however, the same as a risk ratio, so should not be interpreted as indicating that those with schizophrenia are three times more likely to have suffered abuse. I don’t want to downplay the importance of the association, which is nevertheless striking, and supports the authors' conclusion that clinicians should routinely inquire about adverse events in childhood when seeing patients with psychiatric conditions.

Overall, the research literature confirms a reliable association between childhood adversity, including abuse, and schizophrenia in adulthood. The conclusion drawn by James, however, that “abuse is the major cause of psychoses” is not endorsed by any of the academic authors of the reviews I looked at. The complexity of causation in the field of neuroscience and mental health is a topic I hope to return to in a later blogpost, but for the time being, I would recommend another review of this literature by Sideli et al (2012), which discusses possible explanations for links between adversity and psychosis. Most researchers familiar with this area would endorse this quote by Fryers and Brugha (2013), reflecting on our state of knowledge in this area:

From all this work an understanding has emerged of the 'cause' of serious mental illness as complex, varied and multi-factorial, encompassing elements of genetic constitution, childhood experience, characteristics of personality, significant life events, the quality of relationships, economic and social situation, life-style choices such as alcohol and other drugs, and aging. Some of these factors have been elucidated to the point of representing acknowledged risk factors for specific forms of mental illness or mental illness in general, such as familial genes, relative poverty, major trauma, excessive alcohol consumption, extreme negative life-events, poor education, and long-term unemployment.

These may all be experienced in childhood and we do not need research to tell us that poverty, inadequate education and life events such as loss of a parent or displacement as a refugee by war, or trauma such as child sex abuse are bad. Nor should it need evidence of later consequences such as mental illness to argue for the prevention of such situations and experiences. The strongest argument is in terms of human rights. However, the issues are not generally given a high priority and people may think them exaggerated or assume that these things are just part of human life and children get over them anyway. But we should not be willing to accept these as inevitably part of human life, but fight for a better life for our children – and hope thereby for a better life for adults and the whole community.

But to come back to the impetus for the current blogpost, the point I’d really like to make is that if the Observer wants to run articles like this, where scientific evidence is cited, the editor should ask for sources for the evidence, and should provide these with the article. As noted by Prof Michael O'Donovan in a letter to the Observer, Oliver James is "unknown in the scientific community as a researcher into the origins of psychosis". This does not make his opinions worthless, but if he wants to argue his case from the scientific evidence, then we need to know what evidence he is using, just as we would expect for any reputable scientist making such claims. Most readers don't have the resources or skills to trawl through research databases trying to establish whether the evidence is accurately reported or cherry-picked. As O'Donovan points out, confident assertions about childhood causes of schizophrenia can only cause distress to families affected by this condition, and a responsible newspaper should take care to ensure that such claims have a verifiable basis.

Wednesday, 15 May 2013

This week, a paper by Woodley et al (2013) was widely quoted in the media (e.g. Daily Mail, Telegraph). The authors dramatically announced that the average intelligence of populations of Western industrialised societies has fallen since the Victorian era. This is provocative because previous analyses of large archived datasets of intelligence tests scores by Flynn and others show the opposite. However, Woodley et al did not examine average intelligence test scores obtained from different generations. They compared 16 sets of data from Simple Reaction - Time (SRT) experiments made on groups of people at various times between 1884 and 2002. In all of these experiments volunteers responded to a single light signal by pressing a single response key. Data for women are incomplete but averages of SRTs for men increase significantly with year of testing. Because Woodley et al regard SRTs as good inverse proxy measures for intelligence test scores, which are in some senses “purer” measures of intelligence than pencil and paper tests, they concluded that more recent samples are less intelligent than earlier ones

Throughout their paper the authors argue that higher intelligence of persons alive during the Victorian era can explain why their creativity and achievements were markedly greater than for later, duller generations. We can leave aside an important question whether there is any sound evidence that creativity and intellectual achievements have declined since a Great Victorian Flowering because only two of the 16 datasets they compared were collected before Victoria’s death in 1901. The remaining 14 datasets date between 1941 and 2004 and, of these, only four were collected before 1970. So most of the studies analysed were made within my personal working lifespan. This provokes both nostalgia and distrust. Between 1959 and 2004 I collected reaction times (RTs) from many large samples of people but it would make no sense for me to compare absolute values of group mean RTs that I obtained before and after 1975. This was because, until 1975, like nearly all of my colleagues, the only apparatus I had were Dekatron counters, the Birren Psychomet or SPARTA apparatus, none of which measured intervals shorter than 100 msec. Consequently, when my apparatus gave a reading of 200 msec. the actual Reaction Time might be anywhere between 200 and 299 msec. Like most of my colleagues I always computed and published mean RTs to three decimal places, but this was pretentious because all the RTs I had collected had been, in effect, rounded down by my equipment. After 1975, easier access to computers and better programs gradually began to allow true millisecond resolution. More investigators took advantage of new equipment and our reports of millisecond averages became less misleading. I am unsurprised that mean RTs computed from post-1975 data were consistently, and significantly longer than those for pre-1975 data.

Changes in recording accuracy are a sufficient reason to withold excitement at Woodley et al’s comparison. It is worth noticing that different methodological issues also make it tricky to compare absolute values for means of RTs that were collected at different times and so with different kinds of equipment. For example RTs are affected by differences in signal visibility and rise-times to maximum brightness between tungsten lamps, computer monitor displays, neon bulbs and LCDs. The stiffness and “throw” of response buttons will also have varied between the set-ups that investigators used. When comparing absolute values of SRTs, another important factor is whether or not each signal to respond is preceded by a warning signal, whether the periods between warning signals and response signals are constant or variable and just how long they are (intervals between, approximately, 200 and 800 ms allow faster RTs than shorter or longer ones) Knowing these methodological quirks makes us realise that, in marked contrast to intelligence tests, methodologies for measuring RT have been thoroughly explored but never standardised.

So I do not yet believe that Wooley et al’s analyses show that psychologists of my generation were probably (once!) smarter than our young colleagues (now) are. This seems unlikely, but perhaps if I read further publications by these industrious investigators I may become convinced that this is really the case.

Dr Woodley has published a response to my critique on James Thompson's blog. He asks me to answer. I am glad to do so. Sluggishness has been due only to the pleasure of reading the many articles to which Woodley drew my attention. Dorothy’s remorseless archaeology of this trove, summarised in the table below, has provoked much domestic merriment during the past few days. We are grateful to Dr Woodley for this diversion. Here are my thoughts on his comments on my post.

Woodley et al used data from a meta-analysis by Silverman (2010). I am grateful to Prof Silverman for very rapid access to his paper in which he compared average times to make a single response to a light signal from large samples in Francis Galton's anthropometric laboratories and from several later, smaller samples dating from 1941 to 2006. To these Woodley et al added a dataset from Helen Bradford Thompson's 1903 monograph "The mental traits of sex".

As Silverman (2010) trenchantly points out there is a limit to possible comparisons from these datasets,: “In principle, it would be possible to uncover the rate at which RT increased (since the Galton studies) by controlling for potentially confounding variables in a multiple regression analysis. However, this requires that each of these variables be represented by multiple data points, but this requirement cannot be met by the present dataset. Accurately describing change over time also requires that both ends of the temporal dimension be well represented in the dataset and that the dataset be free of outliers (Cohen, Cohen, West, & Aiken, 2003); neither of these requirements can be met …… Thus, it is important to reiterate that the purpose … is not to show that RT has changed according to a specific function over time but rather to show that modern studies have obtained RTs that are far longer than those obtained by Galton."

Neither Silverman nor Woodley et al seem much concerned that results of comparisons might depend on differences between studies in apparatus and methods, which are shown here, together with temporal resolution where reported.

Since Galton's dataset is the key baseline for the conclusion that population mean RT is increasing, it is worth considering details of his equipment described here and in a wonderful archival paper “Galton’s Data a Century Later” by Johnson et al (1985): “……during its descent the pendulum gives a sight-signal by brushing against a very light and small mirror which reflects a light off or onto a screen, or, on the other hand, it gives a sound-signal by a light weight being thrown off the pendulum by impact with a hollow box. The position of the pendulum at either of these occurrences is known. The position of the pendulum when the response is made is obtained by means of a thread stretched parallel to the axis of the pendulum by two elastic bands one above and one below, the thread being in a plane through the axes of the pendulum, perpendicular to the plane of the pendulum's motion. This thread moves freely between two parallel bars in a horizontal plane, and the pressing of a key causes the horizontal bars to clamp the thread. Thus the clamped thread gives the position of the pendulum on striking the key. The elastic bands provide for the pendulum not being suddenly checked on the clamping. The horizontal bars are just below a horizontal scale, 800 mm. below the point of suspension of the pendulum. Galton provided a table for reading off the distance along the scale from the vertical position of the pendulum in terms of the time taken from the vertical position to the position in which the thread is clamped." (p. 347).

Contemporary journal referees would press authors for reassurance that the apparatus produced consistent values over trials and had no bias to over or underestimate. Obviously this would have been very difficult for Galton to achieve.

In my earlier post I noted that over the mid-to late 20th century it became obvious that to report reaction times (RT) to three decimal places is misleading if equipment only allows centi-second recording. In the latter case a reading of 200 ms will remain until a further 100 ms have elapsed, effectively "rounding down" the RT. Woodley argues that we cannot assume that rounding down occurred. I do not follow his reasoning on this point. He also offers a statistical analysis to confirm that if the temporal resolution of the measure is the only difference between studies, this would not systematically underestimate RT. Disagreement on whether rounding occurred may only be resolved with empirical data comparing recorded and true RTs between equipments.

A general concern with comparisons of RTs between studies is that they are significantly affected by the methodology and apparatus used to collect them. This is not only due to differences in resolution but can lead to systematic bias in timing of trials. For a comprehensive account of how minor differences between different 21st century computers and commercial software can flaw comparisons between studies see Plant and Quinlan (2013), who write: "All that we can state with absolute certainty is that all studies are likely to suffer from varying degrees of presentation, synchronization, and response time errors if timing error was not specifically controlled for." I earlier suggested that apparently trivial procedural details can markedly affect RTs. Among these are whether or not participants are given warning signals, whether the intervals between warning signals and response signals are constant or vary across trials and how long these intervals are, the brightness of signal lamps and the mechanical properties of response keys. A further point also turns out to be relevant to assessment of Woodley et al's argument: average values will also depend on the number of trials recorded, and averaged, for each person, and whether outliers are excluded. Note, for instance, that the equipment used in the studies by Deary and Der, though appropriate for the comparisons that they made and reported, did not record RTs for individual trials but an averaged RT for an entire session. This makes it impossible to exclude outliers, as is normal good practice. The point is that comparisons that are satisfactory within the context of a single well-run experiment may be seriously misleading if made between equally scrupulous experiments using different apparatus and procedures. Johnson et al (1985) and Silverman (2010) stress that Galton’s data were wonderfully internally consistent. This reassures us that equipment and methods were well standardised within his own study. It cannot give any assurance that his data can be sensibly comparable with those obtained with other very diverse equipments and methodologies.

Another excellent feature of the Galton dataset is that re-testing of part of his large initial sample allowed estimates of reliability of his measures. With his large sample sizes even low values of test/re-test correlations were better than chance. Nevertheless it is interesting that the test-retest correlation for visual RT, at .17, on which Silverman’s and Woodley’s conclusions depends, was lower than the next lowest (high frequency auditory acuity,.28), or Snellen eye-chart (.58) and visual acuity (.76 to.79) (Johnson et al, 1985, Table 2).

We do not know whether warning signals were used in Galton's RT studies, or, if so, how long the preparatory intervals between warning and response signals might have been. Silverman (2010) had earlier acknowledged that preparatory interval duration might be an issue but felt that he could ignore it because a report by Teichner of Wundt’s discovery that fore-period duration effects could not be independently substantiated and also because he accepted Seashore et al ‘s (1941) reassurance that there are no effects on RT of fore-period duration.

Ever since a convincing study by Klemmer (1957) it has been recognised that the durations of preparatory intervals do significantly affect reaction times, that the effects of fore-period variation are large and that results cannot be usefully compared unless these factors are taken into consideration. Indeed during the 1960s fore-period effects were the staple topic of a veritable academic industry (see review by Niemi and Naatanen, 1981, and far too many other papers by Bertelson, Nickerson, Sanders, Rabbitt etc. etc). In this context Seashore et al’s (1941) failure to find for-period effects does not increase our confidence in their results as one of the data points on which Woodley et al’s analysis is based.

Silverman’s lack of interest in fore-period duration was also heightened by Johnson et al’s (1985) comment that, as far as they were able to discover, each of Galton’s volunteers was only given one trial. Silverman implies that if each of Galton’s volunteers only recorded a single RT, variations in preparatory intervals are hardly an issue. It is also arguable that this relaxed procedure might have lengthened rather than shortened RTs. Well… Yes and No. First, it would be nice to know just how volunteers were alerted that their single trial was imminent? By a nod or a wink? A friendly pat on the shoulder? A verbal “Ready”? Second, an important point of using warning signals, and of recording many rather than just one trial is that the first thing that all of us who run simple RT Experiments discover is that volunteers are very prone to “jump the gun” and begin to respond before any signal appears, so recording absurdly fast “RTs” that can be as low as 10 to 60 ms. 20th and 21st century investigators are usually (!) careful to identify and exclude such observations. Many also edit out what they regard as implausibly slow responses. We do not know whether or how either kind of editing occurred in the Galton laboratories. Many participants would have jumped the gun and if this was their sole recorded reaction the effects on group means would have been considerable. If Galton’s staff did edit RTs, both acceptance of impulsive responses or dismissal of very slow responses would reduce means and favour the idea of “Speedy Victorians”.

I would like to stress that my concerns are methodological rather than dogmatic. Investigators of reaction times try to test models for information processing by making small changes in single variables in tasks run on the same apparatus and with exactly the same procedures. This makes us wary of conclusions from comparisons between datasets collected with wildly different equipments, procedures and groups of people. My concerns were shared by some of those whose data are used by Silverman and Woodley et al. For example, the operational guide for the Datico Terry 84 device used by Anger et al states that "A single device has been chosen because it is very difficult to compare reaction time data from different test devices".

Because I have spent most of my working life using RTs to compare the mental abilities of people of different ages I am very much in favour of using RT measurements as a research tool for individual differences. (For my personal interpretation of the relationships between people’s calendar ages and gross brain status and their performance on measures of mental speed, of fluid intelligence, of executive function, and of memory see e.g. Rabbitt et al, 2007). I also strongly believe that mining archived data is a very valuable scientific endeavour and becomes more valuable as the volume of available data exponentially increases. For example, Flynn’s dogged analyses of archived intelligence test scores show that data mining has raised provocative and surprising questions. I also believe, with Silverman, that large population studies provide good epidemiological evidence of the effects of changes in incidence of malnutrition or of misuse of pesticides or antibiotics. I am more amused than concerned when, in line with Galton’s strange eugenic obsessions, they are also discussed as potential illustrations of growing degeneracy of our species due to increased survival odds for the biologically unfit. As I noted in my original post, my only concern is that it is a time-wasting mistake to uncritically treat measurements of Reaction Times as being, in some sense, “purer”, more direct and more trustworthy indices of individual differences than other measures such as intelligence tests. Of course RTs can be sensitive and reliable measures of individual differences but, as things stand, equipments and procedures are not standardised and, because RTs are liable to many methodological quirks, we obtain widely different mean values from different population samples even from apparently very similar tasks.

Thursday, 9 May 2013

Here’s an interesting
question to ask any scientist: If you were to receive no more research funding,
and just focus on writing up the data you have, how long would it take? The
answer tends to go up with seniority, but a typical answer is 3 to 5 years.

I don’t have any hard
data on this – just my own experience and that of colleagues – and I suspect it
varies from discipline to discipline. But my impression is that people generally
agree that the academic backlog is a real phenomenon, but they disagree on
whether it matters.

One view is that
completed but unpublished research is not important, because there’s a kind of
“survival of the fittest” of results. You focus on the most interesting and
novel findings, and forget about the rest.It’s true that we’ve all done failed studies with inconclusive results,
and it would be foolish trying to turn such sow’s ears into silk purses.But I suspect there’s a large swathe of
research that doesn’t fall into that category, but still never gets written up.Is that right, given the time and money that
have been expended in gathering data? Indeed, in clinical fields, it’s not only
researchers putting effort into the research – there are also human participants
who typically volunteer for studies on the assumption that the research will be
published.

I’m not talking about
research that fails to get published because it’s rejected by journal editors,
but rather about studies that don’t get to the point of being written up for
publication. Interest in this topic has been stimulated by Ben Goldacre’s book
Bad Pharma, which has highlighted the numerous clinical trials that go
unreported – often because they have negative results. In that case the concern
is that findings are suppressed because they conflict with the financial
interests of those doing the research, and the Alltrials campaign is doing a sterling job to tackle that issue. But beyond the field of clinical trials, there’s still a
backlog, even for those of us working in areas where financial interests are
not an issue.

It’s worth pausing to
consider why this is so. I think it’s all to do with the incentive structure of
academia. If you want to make your way in the scientific world, there are two
important things you have to do: get grant funding and publish papers. This
creates an optimisation problem, because both of these activities take time,
and time is in short supply for the average academic.It’s impossible to say how long it takes to
write a paper, because it will depend on the complexity of the data, and will
vary from one subject area to the next, but it’s not something that should be
rushed. A good scientist checks everything thoroughly, thinks hard about
alternative interpretations of results, and relates findings to the existing research
literature. But if you take too much time, you’re at risk of being seen as
unproductive, especially if you aren’t bringing in grant income. So you have to
apply for grants, and having done so, you have then to do the research that you
said you’d do. You may also be under pressure to apply for grants to keep your
research group going, or to fund your own salary.

When
I started in research, a junior person would be happy to have one grant, but that
was before the REF. Nowadays heads of
department will encourage their staff to
apply for numerous grants, and it’s commonplace for senior investigators have
several active grants, with estimates of around 1-2 hours per week spent on
each one. Of course, time isn’t neatly divided up, and it’s more likely that
the investigator will get the project up and running and then delegate it to
junior staff, then putting in additional hours at the end of the project when
it’s time to analyse and write up the data. The bulk of the day-to-day work will
be done by postdocs or graduate students, and it can be a good training
opportunity for them. All the same, it’s often the case that the amount of time
specified by senior investigators is absurdly unrealistic. Yet this approach is encouraged:I doubt anyone ever questions a senior
investigator’s time commitment when evaluating a grant, few funding bodies
check whether you’ve done what you said you’d do, and even if they do, I’ve
never heard of a funder demanding that a previous project be written up before
they’ll consider a new funding application.

I don’t think the
research community is particularly happy about this: many people have a sense
of guilt at the backlog, but they feel they have no option. So the current
system creates stress as well as inefficiency and waste. I’m not sure what the
solution is, but I think this is something that research funders should start
thinking about. We need to change the incentives to allow people time to think.
I don’t believe anyone goes into science because they want to become rich and
famous: we go into it because we are excited by ideas and want to discover new
things. But just as bankers seem to get into a spiral of greed whereby they
want higher and higher bonuses, it’s easy to get swept up in the need to prove
yourself by getting more and more grants, and to lose sight of the whole
purpose of the exercise – which should be to do good, thoughtful science.We won’t get the right people staying in the
field if we value people solely in terms of research income, rather than in
terms of whether they use that income efficiently and effectively.