Monday,
November 20. 2006

SUMMARY: In astrophysics, Kurtz found that articles
that were self-archived by their authors in Arxiv
were downloaded and cited twice as much as those that were not. He
traced this enhanced citation impact to two factors: (1) Early
Access (EA): The self-archived preprint was accessible earlier than
the publisher's version (which is accessible to all research-active
astrophysicists as soon as it is published, thanks to Kurtz's ADS system). (Hajjem, however,
found that in other fields, which self-archive only published
postprints and do have accessibility/affordability problems
with the publisher's version, self-archived articles still have
enhanced citation impact.) Kurtz's second factor was: (2) Quality
Bias (QB), a selective tendency for higher quality articles to be
preferentially self-archived by their authors, as inferred from the
fact that the proportion of self-archived articles turns out to be
higher among the more highly cited articles. (The very same finding is
of course equally interpretable as (3) Quality Advantage (QA),
a tendency for higher quality articles to benefit more than lower
quality articles from being self-archived.) In condensed-matter
physics, Moed has
confirmed that the impact advantage occurs early (within 1-3 years of
publication). After article-age is adjusted to reflect the date of
deposit rather than the date of publication, the enhanced impact of
self-archived articles is again interpretable as QB, with articles by
more highly cited authors (based only on their non-archived articles)
tending to be self-archived more. (But since the citation counts for
authors and for their articles are correlated, one would expect much
the same outcome from QA too.) The only way to test QA vs. QB is to
compare the impact of self-selected self-archiving with mandated
self-archiving (and no self-archiving). (The outcome is likely to be
that both QA and
QB contribute, along with EA, to the impact advantage.)

Michael Kurtz's papers have confirmed that in astronomy/astrophysics
(astro), articles that have been self-archived -- let's call this
"Arxived" to mark it as the special case of depositing in the central
Physics Arxiv -- are cited (and
downloaded) twice as much as non-Arxived articles. Let's call this the
"Arxiv Advantage" (AA).

Kurtz analyzed AA and found that it consisted of at least 2 components:
(1) EARLY ACCESS (EA): There is no detectable AA for old articles
in astro: AA occurs while an article is young (1-3 years). Hence astro
articles that were made accessible as preprints before publication show
more AA: This is the Early Access effect (EA). But EA alone does not
explain why AA effects (i.e., enhanced citation counts) persist
cumulatively and even keep growing, rather than simply being a
phase-advancing of otherwise unenhanced citation counts, in which case
simply re-calculating an article's age so as to begin at preprint
deposit time instead of publication time should eliminate all AA
effects -- which it does not.

(2) QUALITY BIAS (QB): (Kurtz called the second component
"Self-Selection Bias" for quality, but I call it self-selection Quality
Bias, QB): If we compare articles within roughly the same
citation/quality bracket (i.e., articles having the same number of
citations), the proportion of Arxived articles becomes higher in the
higher citation brackets, especially the top 200 papers. Kurtz
interprets this is as resulting from authors preferentially Arxiving
their higher-quality preprints (Quality Bias).

Of course the very same outcome is just as readily interpretable as
resulting from Quality Advantage (QA) (rather than Quality Bias
(QB)): i.e., that the Arxiving benefits better papers more. (Making a
low-quality paper more accessible by Arxiving it does not guarantee
more citations, whereas making a high-quality paper more accessible is
more likely to do so, perhaps roughly in proportion to its higher
quality, allowing it to be used and cited more according to its merit,
unconstrained by its accessibility/affordability.)

There is no way, on the basis of existing data, to decide between QA
and QB. The only way to measure their relative contributions would be
to control the self-selection factor: randomly imposing Arxiving on
half of an equivalent sample of articles of the same age (from
preprinting age to 2-3 years postpublication, reckoning age from
deposit date, to control also for age/EA effects), and comparing also
with self-selected Arxiving.

We are trying an approximation to this method, using articles deposited
in Institutional Repositories
of institutions that mandate
self-archiving (and comparing their citation counts with those of
articles from the same journal/issue that have not been self-archived),
but the sample is still small and possibly unrepresentative, with many
gaps and other potential liabilities. So a reliable estimate of the
relative size of QA and QB still awaits future research, when
self-archiving mandates will have become more widely adopted.

Moed too has shown that in cond-mat the AA effect (which he calls CID
"Citation Impact Differential") occurs early (1-3 years) rather than
late (4-6 years), and that there is more Arxiving by authors of
higher-quality (based on higher citation counts for their non-Arxived
articles) than by lower-quality authors. But this too is just as
readily interpretable as the result of QB or QA (or both): We would of
course expect a high correlation between an author's individual
articles' citation counts and the author's average citation count,
whether the author's citation count is based on Arxived or non-Arxived
articles. These are not independent variables.

(Less easily interpretable -- but compatible with either QA or QB
interpretations -- is Moed's finding of a smaller AA for the "more
productive" authors. Moed's explanations in terms of co-authorships
between more productive and less productive authors, senior and junior,
seem a little complicated.)

The basic question is this: Once the AA has been adjusted for the
"head-start" component of the EA (by comparing articles of equal age --
the age of Arxived articles being based on the date of deposit of the
preprint rather than the date of publication of the postprint), how big
is that adjusted AA, at each article age? For that is the AA without
any head-start. Kurtz never thought the EA component was merely a head
start, however, for the AA persists and keeps growing, and is present
in cumulative citation counts for articles at every age since Arxiving
began. This non-EA AA is either QB or QA or both. (It also has an
element of Competitive Advantage, CA, which would disappear once
everything was self-archived, but let's ignore that for now.)

Moed's analysis, like Kurtz's, cannot decide between QB and QA. The
fact that most of the AA comes in an article's first 3 years rather
than its second 3 years simply shows that both astro and cond-mat are
fast-developing fields. The fact that highly-cited articles (Kurtz) and
articles by highly-cited authors (Moed) are more likely to be Arxived
certainly does not settle the question of cause and effect: It is just
as likely that better articles benefit more from Arxiving (QA) as that
better authors/articles tend to Arxive/be-Arxived more (QB).

Nor is Arxiv the only test of the self-archiving Open Access Advantage.
(Let's call this OAA, generalizing from the mere Arxiving Advantage,
AA): We have found an OAA with much the same profile as the AA in 10
further fields, for articles of all ages (from 1 year old to 10 years
old), and as far as we know, with the exception of Economics, these are
not fields with a preprinting culture (i.e., they don't self-archive
preprublication preprints but only postpublication postprints). Hence
the consistent pattern of OAA across all fields and across articles of
all ages is very unlikely to have been just a head-start (EA) effect.

Is the OAA, then, QB or QA (or both)? There is no way to determine this
unless the causality is controlled by randomly imposing the
self-archiving on a subset of a sufficiently large and representative
random sample of articles of all ages (but especially newborn ones) and
comparing the effect across time.

In the meantime, here are some factors worth taking into account:

(1) Both astro and and cond-mat are fields where it has been repeatedly
claimed that the accessibility/affordability problem for published
postprints is either nonexistent (astro) or less pronounced than in
other fields. Hence the only scope for an OAA in astro and cond-mat is
at the prepublication preprint stage.

(2) In many other fields, however, not only is there no prepublication
preprint self-archiving at all, but there is a much larger
accessibility/affordability barrier for potential users of the
published article. Hence there is far more scope for OAA and especially
QA (and CA): Access is a necessary (though not a sufficient) causal
precondition for impact (usage and citation).

It is hence a mistake to overgeneralize the phys/math AA findings to
OAA in general. We need to wait till we have actual data before we can
draw confident conclusions about the degree to which the AA or the OAA
are a result of QB or QA or both (and/or other factors, such as CA).

For the time being, I find the hypothesis of a causal QA (plus CA)
effect, successfully sought by authors because they are desirous of
reaching more users, far more plausible and likely than the hypothesis
of an a-causal QB effect in which the best authors are self-archiving
merely out of superstition or vanity! (And I suspect the truth is a
combination of both QA/CA and QB.)