Monday, June 03, 2013

UPDATE: Eli just saw a technical comment in Science, which with the change of a word or two summarizes this silliness

Tol criticizes the statistical analyses used to support the conclusions in our paper. His theory biased criticism is disproportionate in view of the robustness of our findings even if different statistical methods are applied and falls short in explaining the prepubetal* nature of his and others' criticisms.

Well Shub, I'd rate that abstract as a 3, which is what Cook et, al. did (NOTE: I read what you posted before checking on it in the TCP database). It wasn't easy, but the fact that it talked about the preference for using biogas as opposed to other sources (coal) for generating electricity. Use of the word "beneficial" provides a big hint. It doesn't require the use of the Parse-o-matic(tm), but it does take reasonably close reading. The endorsement is implicit, but clear.

> A direct comparison of abstract rating versus self-rating endorsement levels for the 2142 papers that received a self-rating is shown in table. More than half of the abstracts that we rated as 'No Position' or 'Undecided' were rated 'Endorse AGW' by the paper's authors.

Willard, B is obviously -- in the extant case -- the better model. Indeed, as reported in the paper a certain number of abstracts (~10-15%) did have this sort of conflict. I've played the "rate the abstracts" game over at SkS and can understand why there might have been instances of disagreement.

FWIW, when I ran into something which gave me trouble I tended to rate on the conservative side (ie. a 4 vs. a 3, which is where most of the difficulty occurred). But then sometimes life just ain't essy...

> Each abstract was categorized by two independent, anonymized raters.

http://iopscience.iop.org/1748-9326/8/2/024024/article

Compare with Richard's report:

> The abstracts were assessed by a team of 24 volunteers [the footnote indice is misplaced] (who discussed their ratings with one another) [...]

https://docs.google.com/file/d/0Bz17rNCpfuDNM1RQWkQtTFpQUmc/edit

This sentence presents three independent ideas. All should be developed. By stringing them together, they are bulldozed as secondary arguments to create a piling on effect.

The parenthesis is made so general as to fear for the worse, and should be replaced by the text in the footnote, which is more precise.

The accusation of lack of independence has not been made, contrary to Wiki's recommendation: Be Bold[1].

These should be three very big tells for anyone used to technical writing. Whatever the merits of these three claims, they are of little relevance to what the authors claimed: independence of inter-rating for each items. The authors have not claimed the raters are not learning or training themselves along the way.

> [When I ran into something which gave me trouble I tended to rate on the conservative side [...]

Exactly. In fact, considering the arbitrary nature of the task, it is only natural to expect raters to err on the conservative side. Not that we should expect raters to bear the same implicatures as the authors themselves. Tom Curtis certainly does not have the same implicatures as Richard Tol:

> That means by a process of elimination, Tol thinks that the following papers all have neutral abstracts (Cook et al rating in brackets): [Follows the analysis of abstract (1) to (5)] He may have a point about (3). He is clearly incorrect about the others.

http://bybrisbanewaters.blogspot.ca/2013/05/tols-gaffe.html

***

Here's an example of a more automatic context, which presumably (h/t Richard Tol) might be analyzed with generalizability theory:

Reliability of assessment tools in rehabilitation: an illustration of appropriate statistical analyses

Gabrielle RankinMaria Stokes

Objective: To provide a practical guide to appropriate statistical analysis of a reliability study using real-time ultrasound for measuring muscle size as an example.

Method: The cross-sectional area (CSA) of the anterior tibial muscle group was measured using real-time ultrasonography.

Main outcome measures: Intraclass correlation coefficients (ICCs) and the 95% confidence interval (CI) for the ICCs, and Bland and Altman method for assessing agreement, which includes calculation of the mean difference between measures (d), the 95% CI for d, the standard deviation of the differences (SD diff), the 95% limits of agreement and a reliability coefficient.

Results: Inter-rater reliability was high, ICC (3,1) was 0.92 with a 95% CI of 0.72 → 0.98. There was reasonable agreement between measures on the Bland and Altman test, as d was -0.63 cm2, the 95% CI for d was -1.4 → 0.14 cm2, the SDdiff was 1.08 cm2, the 95% limits of agreement -2.73 → 1.53 cm2 and the reliability coefficient was 2.4. Between-scans repeatability was high, ICCs (1,1) were 0.94 and 0.93 with 95% CIs of 0.8 → 0.99 and 0.75 → 0.98, for days 1 and 2 respectively. Measures showed good agreement on the Bland and Altman test: d for day 1 was 0.15 cm2 and for day 2 it was -0.32 cm2, the 95% CIs for d were -0.51 → 0.81 cm2 for day 1 and -0.98 → 0.34 cm2 for day 2; SDdiff was 0.93 cm2 for both days, the 95% limits of agreement were -1.71 → 2.01 cm2 for day 1 and -2.18 → 1.54 cm2for day 2; the reliability coefficient was 1.80 for day 1 and 1.88 for day 2. The between-days ICC (1,2) was 0.92 and the 95% CI 0.69 0.98. The d was -0.98 cm2, the SDdiff was 1.25 cm2 with 95% limits of agreement of -3.48 → 1.52 cm2 and the reliability coefficient 2.8. The 95% CI for d(-1.88 → -0.08 cm2) and the distribution graph showed a bias towards a larger measurement on day 2.

Conclusions: The ICC and Bland and Altman tests are appropriate for analysis of reliability studies of similar design to that described, but neither test alone provides sufficient information and it is recommended that both are used.

Well, rattus, here's the thing. According to Tom Curtis, you should have rated that one as a 4, i.e., as neutral.

The raters apparently behaved like automatons. There is nothing in the mere reading of the text that would indicate 'implicit support'.

And to top it, the phrase "global warming" is neither in the title, nor the abstract. It appears in the citation as a keyword: "global warming contributions". So your guesswork is misplaced.

I chose this particular abstract as the first one I came across in the '3', implicit category.

As willard understands, your justification for including this in 3 proves my point.

I would contend, on the other hand, that such papers should be rated neutral, even if they profess undying support to the global warming theory because theirs is not a considered opinion but merely an assumption required to frame certain points they make.

"It is a strange claim to make. Consensus or near-consensus is not a scientific argument. Indeed, the heroes in the history of science are those who challenged the prevailing consensus..." ~ Richard Tol

Over at wottsupwiththatblog Richard has conceded that he only has destructive criticism as an option. He can't repeat the analysis because it is too much work, and shutting up is apparently "wrong". But Jay provides us with the quote that possibly explains that: he must attack the consensus somehow, otherwise he can't be a hero...

There is no doubt in my mind that the literature on climate change overwhelmingly supports the hypothesis that climate change is caused by humans. I have very little reason to doubt that that is indeed true and that the consensus is correct. Cook et al., however, failed to demonstrate this.

What is Lol...? Every thing you want in Lol and Troll, Now you Get all in one Network, ThatIsLol.. Lol Pictures, Lol Videos, Lol Peoples, Funny Peoples, Troll Images, troll pictures, funny pictures, Facebook pictures, facebook funny pictures, facebook lol pictures, Funny videos and Much More only Laughing out of Laughingthatislol.com

Thanks for the heads up. I've posted a response at Wott's. It starts thus:

Dear Richard,

I don’t think your trichotomy captures your options very well. You don’t have to reproduce Cook & al’s experiment to satisfy your (c). That is, you have forgotten about this option:

(c2) Prescribe how to redo that research by clearly stating a specification you’d consider valid.

You do have the resources to do that. Or at least you do choose to invest your resources in less constructive endeavours. See for instance this morning’s tweet where you took the pain to find duplicate records in the data.

Richard's remark about non-homoscedaticity rather disproves this point, unless we're talking about would-be automatons programmed to reproduce the behaviour of human raters.

***

> I would contend, on the other hand, that such papers should be rated neutral, even if they profess undying support to the global warming theory because theirs is not a considered opinion but merely an assumption required to frame certain points they make.

"It is a strange claim to make. Consensus or near-consensus is not a scientific argument. Indeed, the heroes in the history of science are those who challenged the prevailing consensus..." ~ Richard Tol

He appears to have forgotton what comes next.

The challenge, if successful, then becomes the consensus.

And ignored the other outcome - unsuccessful challenges are very likely just wrong.

willardCurtis schooled you in the previous thread that raters just read the abstract to rate(Thanks Eli for breaking up the thread). Here we have rattus, giving a perfect example of the opposite. He read it, understood it after his own way and rated it. The abstract itself, on the other hand, does not have material to support his classification. In response, you are shifting the field of argument?

It is solely the interpreted component of the Cook database that inflates numbers for the consensus position. Cook would have still had decent numbers, maybe not the 97 that he craved but 90 something, had he included the implicits into the neutrals or thrown them out. But no, that couldn't be done.

Please don't think that 'endorse the consensus' is a scientifically better term than claiming it, or contributing to it. If anything, it leads to more problems. 'Endorse' - is already a jacked-up, non-standard term to begin with. 'Endorse' a consensus means a consensus position already exists and you have people agreeing with it. Cook et al take this agreement, and prove that there is a consensus! Totally and completely wrong - at the most, you show that there is a consensus that there is a consensus.

Could there be a more stupid way of doing research?

A rigorous way would have be to just study climate papers that are just about AGW and attribution, and assert that most seem to believe, or disbelieve, or whatever, x or y, with regards to 'human influence'. Everything else is just padding.

A good example of going against mainstream, stirring more research by other people, having evidence accumulate ... would be Bill Ruddiman's work. Look for his new book, earth Transformed, ~October.Much research has happened since 2005's Plows, Plagues and Petroleum. I think it's an interdisciplinary tour de force, but certain people will really, really hate it, because it explains the last 10,000 years of climate all too well.

The point is that going against the mainstream is no predictor of success, making Tol's comment stupid.

Being a fool is at least as likely (probably much more likely) than becoming a hero when one bucks the consensus, and the more fundamental the consensus (say, the radiative properties of CO2), the more likely one is to be made a fool.

Since sundry are getting their knickers in a twist about the fact the raters talked to each other to achieve a common view (not on each abstract, but in general) how about thinking about the Delphi method (from the wiki)----------

The Delphi method (/ˈdɛlfaɪ/ DEL-fy) is a structured communication technique, originally developed as a systematic, interactive forecasting method which relies on a panel of experts.[1][2][3][4] The experts answer questionnaires in two or more rounds. After each round, a facilitator provides an anonymous summary of the experts’ forecasts from the previous round as well as the reasons they provided for their judgments. Thus, experts are encouraged to revise their earlier answers in light of the replies of other members of their panel. It is believed that during this process the range of the answers will decrease and the group will converge towards the "correct" answer. Finally, the process is stopped after a pre-defined stop criterion (e.g. number of rounds, achievement of consensus, stability of results) and the mean or median scores of the final rounds determine the results.[5]

Delphi is based on the principle that forecasts (or decisions) from a structured group of individuals are more accurate than those from unstructured groups.[6] The technique can also be adapted for use in face-to-face meetings, and is then called mini-Delphi or Estimate-Talk-Estimate (ETE). Delphi has been widely used for business forecasting and has certain advantages over another structured forecasting approach, prediction markets.[7]

With regard to independence, most criticism of Cook et al on that ground is based on a simple misreading of the claim of rater "independence". However, some (around ten or so) abstracts were explicitly discussed by raters in the internal forum. The initial rating of these abstracts was not "independent" described in Cook et al. This was a lapse in procedure and should not have happened. Arguably Cook et al should have excluded these abstracts from the final results, and should certainly have deleted the discussion and reinforce the requirement for independent ratings.

I do not think it is a significant lapse given that the stated procedure in the paper called for dispute resolution by, first, rerating by the initial raters with the other rating before them, and then adjudication by a third party. Those who disagree are quite welcome, however, to identify the ten or so abstracts involved - exclude them from the sample and recalculate the results. If they think it will make a difference to the results, they are delusional. If they mention the error without mentioning the scale of the problem, they are not interested in generating informed analysis, but merely in generating "talking points" to allow those who are discomfited by the results of Cook et al to dismiss the results without thought. Those in the later category deserve nothing but contempt.

There is a rather obvious response to all this brouhaha about the consensus. Or rather, there are a number of responses that could be pursued.

Firstly, the non-consensus claimants could trawl through the literature returned by Cook's et al search, and catalog those papers that they think are explicit non-endorsements of human caused climate change. Show the world which papers they believe offer evidence that human carbon emissions are not warming the planet, and numerically quantify this body. A rigorous cataloger would do this as an annotated bibliography, and demonstrate their own thinking about the content of the listed publications. Let's see exactly what is the quality (and quantity) of evidence against human-caused global warming, as gathered by a Cook et al style of search.

Secondly, the non-consensus claimants could trawl through the literature returned by Tol's proposed search, and catalog those papers that they think are explicit non-endorsements of human caused climate change. Which ones were excluded by Cook's et al search, and what is the relevance of the excluded papers to the claim that Cook et al biased the quantification of the consensus?

Then it becomes more interesting...

Thirdly, the papers returned in the above exercises could be assessed for their defensibility by following the Wos/Scopus links to all citing articles - papers that were subsequently and definitively refuted obviously have a credibility issue in any argument against a consensus. For balance the same analysis should be conducted on a sample of the papers returned in the search that explicitly endorse the human cause of current global warming. What do these analyses say about the science underpinning each side of the issue?

Fourthly, randomly selected subsets of papers from each of the two groups of papers defined in the preceding paragraph could be sent to a randomly-chosen selection of scientists in various disciplines, and these scientists asked to assess the merit of each paper in supporting or refuting the human cause of current global warming. For completeness the participating scientists could be asked to consider any subsequent response to the papers, whether in support or in refutation. With appropriate refinement such a survey would truly demonstrate what scientists think about the veracity of the work done in physics and climatology, and whether the science does actually demonstrate that humans are causing the planet to warm. After all, a true consensus should not be based merely on the numbers of papers that present 'for' and 'against' cases, but on those papers that withstand subsequent scrutiny - something largely absent from other surveys... (apologies for any recursive niggles that this might implant in people's minds).

Of course, as I and many others have said previously, the opinions of scientists (especially when offered outside their fields of expertise) don't change the science itself, or the laws of nature. However the results of such a survey would be further (clearer) evidence of the opinions of scientists, and it might help to more tightly profile the issues where understanding diverges with respect to the nature of the climatological work. If it can be empirically demonstrated that non-consensus opinions are based on flaws-in-understanding of one sort or another then non-consensus opinions become even less relevant than they are now.

1) In his third draft, Tol has moved his discussion of the subsidiary survey of rating (4) papers from the footnote and dropped his claim that "While the difference between 97% and 98% may be dismissed as insubstantial, it is indicative of the quality of manuscript preparation and review."

He still insists, however, that there is doubt as to whether the subsidiary survey found five of one thousand or forty of one thousand "uncertain" papers among those rated (4). This despite a public statement by a co-author that the number was five; a statement of which Tol was aware well before his third draft. His lack of clarity is, therefore, purely tactical rather than based on evidence. That is, he is unclear because he ignores evidence of which he is aware in order to retain an unjustified negative criticism in his comment.

2) Tol has now admitted in his third draft that the skewed sample of disciplines relative to a scopus search "introduces a bias against endorsement". He does not make the same admission regarding the WoS search even though based on the same data and logic; and even though he has made that admission in private correspondence.

This admission means that his claim of evidence of bias comes entirely from his unjustified claim that "impacts" and "mitigation" papers should not be rated.

"It is a strange claim to make. Consensus or near-consensus is not a scientific argument. Indeed, the heroes in the history of science are those who SUCCESSFULLY challenged the prevailing consensus..." ~ Richard Tol"

KoFi Button

Subscribe Rabett Run

The Bunny Trail By Email

Contributors

Eli Rabett

Eli Rabett, a not quite failed professorial techno-bunny who finally handed in the keys and retired from his wanna be research university. The students continue to be naive but great people and the administrators continue to vary day-to-day between homicidal and delusional without Eli's help. Eli notices from recent political developments that this behavior is not limited to administrators. His colleagues retain their curious inability to see the holes that they dig for themselves. Prof. Rabett is thankful that they, or at least some of them occasionally heeded his pointing out the implications of the various enthusiasms that rattle around the department and school. Ms. Rabett is thankful that Prof. Rabett occasionally heeds her pointing out that he is nuts.