Cited references and Medical Subject Headings (MeSH) as two different knowledge representations: clustering and mappings at the paper level

Abstract

For the biomedical sciences, the Medical Subject Headings (MeSH) make available a rich feature which cannot currently be merged properly with widely used citing/cited data. Here, we provide methods and routines that make MeSH terms amenable to broader usage in the study of science indicators: using Web-of-Science (WoS) data, one can generate the matrix of citing versus cited documents; using PubMed/MEDLINE data, a matrix of the citing documents versus MeSH terms can be generated analogously. The two matrices can also be reorganized into a 2-mode matrix of MeSH terms versus cited references. Using the abbreviated journal names in the references, one can, for example, address the question whether MeSH terms can be used as an alternative to WoS Subject Categories for the purpose of normalizing citation data. We explore the applicability of the routines in the case of a research program about the amyloid cascade hypothesis in Alzheimer’s disease. One conclusion is that referenced journals provide archival structures, whereas MeSH terms indicate mainly variation (including novelty) at the research front. Furthermore, we explore the option of using the citing/cited matrix for main-path analysis as a by-product of the software.

Keywords

MeSH Citation Journal Main path Alzheimer

Introduction

The ability to define research fields is one of several great challenges in information science (Chen 2016). Early efforts relied on classifying publication sources, such as journals, to define research fields. In addition to disciplinary journals, however, the literature databases Web of Science (WoS, Thomson Reuters) and Scopus (Elsevier) contain multi-disciplinary journals such as Science and Nature. In recent years, new journals which are not organized along disciplinary lines, have been added to the databases. PLoS ONE, for example, tends to disturb the existing classifications of journals (Leydesdorff and De Nooy, in press). In response to these changes, bibliometricians have begun to cluster the database at the level of documents instead of journals (e.g., Waltman and van Eck 2012; cf. Hutchins et al. 2016).

An alternative to clustering documents on the basis of direct citations could be to use databases that are more specialized than WoS and Scopus, but with professional indexing at the document level. The National Library of Medicine, for example, makes a huge investment to maintain a classification system of Medical Subject Headings (MeSH) as tags to the PubMed/MEDLINE database (which is publicly available at http://www.ncbi.nlm.nih.gov/pubmed/advanced).1 The classification at the article level is elaborated in great detail (Agarwal and Searls 2009), with a hierarchical tree covering sixteen separate branches that can reach up to twelve levels of depth. Diseases, for example, are classified under C.

“Alzheimer’s disease” (AD) for example is classified as C10.228.140.380.100 under “Dementia,” as C10.574.945.249 under “Neurodegenerative diseases,” and as F03.615.400.100 under “Neurocognitive disorders” in the F-branch covering “Psychiatry and psychology.” Unlike other disciplinarily specialized databases such as Chemical Abstracts (Bornmann et al. 2009), the multiple tree-structure of the Index Medicus allows for mapping documents differently across heterogeneous domains (Leydesdorff et al. 2012; Rotolo et al. 2016). Unlike WoS or Scopus, Medline does not cover the full range of disciplines; but a large part of the scholarly literature in the life sciences is included even more exhaustively than in the more comprehensive databases (Lundberg et al. 2006).

A version of MEDLINE is integrated in the databases of Thomson Reuters. The advantage of this installation is that the “times cited” of each record (if the document is also available in the WoS Core Collection of the Citation Indices) is available on screen; but this field is not integrated when the records are downloaded. Rotolo and Leydesdorff (2015) provide software for integrating the “times cited” from the citation indices at WoS into the MEDLINE data. One technical advantage of the installation at PubMed is that the retrieval is not restrained. Using WoS, one can download only 500 records at a time and Scopus has a maximum of 2000 records.

The MeSH terms attributed to a paper can be considered as references to a body of knowledge stored as documents in a database. Whereas the cited references are provided by the authors themselves, the MeSH categories are attributed by professional indexers. Using MeSH terms as references, one can envisage a matrix of documents referencing MeSH comparable to the cited/citing matrix at the article level. Both cited references and MeSH terms can be considered as attributes of articles, and thus be combined and compared using various forms of multi-variate analysis. The two matrices can also be integrated into a 2-mode matrix of MeSH terms versus cited references. In this brief communication, we explore these options computationally and describe software that has been developed and made available for this purpose on the internet. We discuss the opportunities and the pros and cons of various approaches.

Methods

Data

At the professional suggestion of one of us (AS, the scientometrics editor of the Journal of Alzheimer’s Disease), we selected the amyloid cascade hypothesis in Alzheimer’s disease (AD) as a test case to develop software and routines to merge and analyze citation information from the Web of Science and MeSH. The amyloid cascade hypothesis in AD was formulated by Hardy and Allsop (1991) (cf. Hardy and Higgins 1992; Selkoe 1991). Reitz (2012: 1) summarized this hypothesis as follows:

Since 1992, the amyloid cascade hypothesis has played a prominent role in explaining the etiology and pathogenesis of Alzheimer’s disease (AD). It proposes that the deposition of β-amyloid (Aβ) is the initial pathological event in AD leading to the formation of senile plaques (SPs) and then to neurofibrillary tangles (NFTs), neuronal cell death, and ultimately dementia. While there is substantial evidence supporting the hypothesis, there are also limitations: (1) SP and NFT may develop independently, and (2) SPs and NFTs may be the products rather than the causes of neurodegeneration in AD. In addition, randomized clinical trials that tested drugs or antibodies targeting components of the amyloid pathway have been inconclusive.

For the purpose of this study, the search string ‘(“Alzheimer disease”[MeSH Terms] AND “amyloid beta-protein precursor”[MeSH Terms]) AND “mice, transgenic”[MeSH Terms])’ was proposed to encompass the relevant literature. This string provided us (on March 6, 2016) with a retrieval of 3558 records in both PubMed/MEDLINE and the MEDLINE version in WoS. Using PubMed Identifiers (PMID numbers), 3416 of these records could be retrieved in the WoS Core Collection. As noted, not all journals covered by PubMed/MEDLINE are also covered in the WoS Core Collection.

Methods

Two dedicated programs, MHNetw.exe2 and CitNetw.exe,3 have been developed to generate reference matrices using the PubMed/MEDLine and the WoS data, respectively. The matrices are provided in the Pajek format. CitNetw.exe generates the cited/citing matrix with the citing documents as units of analysis in the rows and the cited references as variables in the columns; MHNetw.exe generates a similar matrix, but with the MeSH in the columns. The number of citing documents is determined by the retrieval from PubMed/MEDLINE or Medline in WoS, respectively. Instructions for how to use the databases and routines are provided in Appendix 1.

The routine MHNetw.exe presumes that the data from WoS with the citation information is already organized (by CitNetw.exe) in the same folder so that the citation information can be retrieved locally and attributed to the MeSH categories. If this data is not yet present, the user is first prompted with a search string in the file “string.wos” that can be used at the advanced search interface of WoS.4

The SPSS syntax file “mtrx.sps” reads the reference matrix (“mtrx.txt”) into SPSS and saves this file as an SPSS systems file (“mtrx.sav”). MeSH terms are included as variable labels in the case of MHNetw.exe; in the case of CitNetw.exe, the cited references are the variable labels. The user can combine the two matrices using, for example, Excel.

MHNetw.exe additionally provides:

(a)

Cr_mh.net, which contains the 2-mode matrix of cited references (CR) in the rows and MeSH terms in the columns;

(b)

Jcr_mh.net, which simplifies cr_mh.net by using only the abbreviated journal names in the cited references in the rows and MeSH terms in the columns;

(c)

The file jcr_mh_a.net, which contains the same information (abbreviated journal names and MeSH categories), but organized differently: both CR and MeSH are attributed as variables to the documents under study as the cases (in the rows). Within Pajek, one can convert this matrix into an affiliations matrix (using Network > 2-Mode Network > 2-Mode to 1-Mode > Columns). One can also export this file (e.g., to SPSS) for cosine-normalization of the matrix.

CitNetw.exe, furthermore, provides a file “lcs.net” containing the cited/citing matrix for the bounded citation network of the citing documents under study. The bounded citation network corresponds with what was defined as the “local citation environment” in HistCite™ (Garfield et al. 1964, 2003). The cited references are matched against a string composed from the meta-data of the citing document using the standard WoS-format of the cited references: “Name Initial, publication year, abbreviated journal title, volume number, and page number” (e.g., “Zhang CL, 2002, CLIN CANCER RES, V8, P1234”). The matrix may be somewhat different from the one obtained from using HistCite™ because of different matching and disambiguation rules.

In order to proceed with main-path analysis in Pajek, the network has to be a-cyclical (de Nooy et al. 2011, pp. 244f.). If needed, one can make the network a-cyclical within Pajek by using the following steps in the order specified in Table 1.

The choice of “Main Path > Global Search > Standard”, for example, leads to the extraction of the subnetwork with the main path; this subnetwork is selected as the active network. The main path can then be drawn and/or further analyzed

Note that the cited references are not disambiguated by these routines, but are used as they appear on the input file. The user may wish to disambiguate the references before entering this routine; for example, by using CRExplorer.EXE at http://www.crexplorer.net (Thor et al. 2016).

Results

Descriptive

Figure 1 shows the number of documents in the set over time and the development of the ratio of citations per publication (c/p). As noted, the research program under study was triggered by a paper in 1992 (Hardy and Higgins 1992). However, there are 11 papers in the set with publication dates in 1991 predating this formulation. In the first decade, the number of publications shows exponential growth; but over the full time span linear growth prevails. In other words, this line of research is no longer booming, but since around 2000 can be considered as “normal science.” The c/p ratio declines linearly with the subsequently shorter citation windows for more recent papers. However, the decline in this ratio may also indicate a diminishing attractiveness of this line of research (Hardy and Selkoe 2002). The sharp decline in the number of publications in the most recent years confirms this inference (Selkoe and Hardy 2016). Recently, Herrup (2015) concluded “that the time has come to face our fears and reject the amyloid cascade hypothesis,” albeit at the moment without an alternative explanation of Alzheimer’s Disease.

Number of yearly papers (diamond) and citations per paper ratio (triangle) over time

Table 2 tells us that the number of cited references in the papers under study (176,670) is almost three times that of the MeSH terms attributed (62,648). In terms of unique cited references (67,831) versus unique MeSH terms (3532), the ratio is further worsened. On a map, the citations would completely overshadow the MeSH terms. However, the number of referenced journals (5345) is of the same order as the unique MeSH terms.

Table 2

Some descriptive statistics of the data under study

PubMed/MEDLINE

WoS

N of documents

3558

3416

MeSH references

62,648

Unique MeSH terms

3532

Cited references

176,670

Unique cited references

67,831

Referenced journals

5345

Figure 2 provides a map which can be generated using the 2-mode matrix of 5345 abbreviated journal names in the references (red) versus 3482 MeSH terms (green).5 (To generate this figure, the file jcr_mh.net was input into Pajek and from there into VOSviewer for the visualization). The figure shows the very central position of the Journal of Neuroscience among the references. Although there are more unique references to journals than to MeSH, their concentration indicates that the red-colored journals form a backbone structure with the MeSH terms spreading out as variations. This is the dominant structure in this data: the journals provide a core structure and the MeSH terms the variation. The journals are more concentrated than the MeSH terms (Table 3): the Gini coefficient of the journal distribution is 0.937 while it is 0.852 for the distribution of MeSH.

aWhile “Mice, transgenic” and “Amyloid beta-Protein Precursor” were both part of the original search string, the search also retrieves records with MeSH subsumed under these categories: these are “Mice, knockout” (333 times) and “Amyloid beta-Peptides” (2492 times), respectively

Analysis and decomposition

Whereas multivariate analysis (e.g., factor analysis) is limited by systems and software limitations, the new decomposition algorithms enable us to decompose large and even very large matrices. The above matrix (Fig. 2), for example, can robustly be decomposed into five clusters using the algorithm of Blondel et al. (2008); the modularity of the network is low (Q = 0.066).6 Figure 3, for example, shows the fourth component consisting of 598 cited journals versus 326 MeSH terms focusing on techniques such as neuro-imaging. This cluster can be further subdivided into nine components (Q = 0.375).

The file jcr_mh_a.net organizes the same information as a matrix of the 3558 documents under study as the cases and both the MeSH terms and abbreviated journal titles as variables in the columns. Using this file, one can normalize the variables or proceed to multivariate analysis. After normalization using the Jaccard index—available in UCInet—the highly centralized structure, indeed, has disappeared. The resulting 1-mode similarity matrix can be decomposed into approximately 70 components by the algorithm of Blondel et al. (2008) and into 61 by the algorithm of VOSviewer (Waltman and van Eck 2012). The modularity is an order of magnitude larger than in the previous case (Q = 0.577). After this normalization, however, journal names come even more to the fore on the map (Fig. 4),7 indicating their structural role in this information.

In summary, the abbreviated journal names in the references provide us with far greater access to the structure in the matrix than do the MeSH terms. Referenced journals reflect the archival knowledge base on which the new knowledge claims build, whereas MeSH terms position papers as variation (including novelty; Boudreau et al. 2016) at the research front. The MeSH terms are attributed from the perspective of hindsight. In other words, the MeSH classification which operates at the paper level may be less suited for the normalization of citations than journals or journal categories, which can reveal archival structures.

Main-path analysis

As noted, CitNetw.exe also generates a file “lcs.net” containing the bounded network of the papers under study with “local citation scores” (Garfield et al. 2003). Using the instruction provided in Appendix 2, one can generate a main path using Pajek. Figure 5, for example, shows the so-called “key-route main path” as the most recommended option for this analysis (Liu and Lu 2012). Forty of the 3416 documents downloaded from WoS (or slightly more than 1 %) are located on this main path.8

Forty papers on the so-called “key route global main path” in the citations among the 3416 WoS documents under study. Decomposition using the Louvain algorithm in Pajek (Blondel et al. 2008; Q = 0.757); layout using Kamada and Kawai (1989)

It is beyond the scope of this paper to compare these results with other options for main-path or critical path analysis (Batagelj 2003; Hummon and Doreian 1989). A review of the various options is provided by Liu and Lu (2012), who suggest that a combination of the results of several algorithms into an integrated model can improve the quality of the main-path analysis (cf. Lucio-Arias and Leydesdorff 2008). The resulting main path can be further analyzed as a Pajek file; for example, the colors in Fig. 5 show the results of decomposition using the algorithm of Blondel et al. (2008).

The generation of a main path of forty articles for a line of investigation encompassing approximately 3500 papers is appealing due to the reduction by two orders of magnitude in the amount one would need to read to obtain an understanding of this subfield. However, a main path remains an algorithmic construct that one can use heuristically, but that otherwise requires validation. For example, the paper by Kawabata et al. (1991) published in December 1991 in Nature was retracted on March 19, 1992. This paper received 16 citations by other papers on the main path, thirteen of them in the years after the retraction. From an intellectual perspective, one might consider removing this article from the pool of candidate nodes before regenerating the main path.

The two main scientific awards within the field of AD research are the “Potamkin Prize for Research in Pick’s, Alzheimer’s, and Related Diseases” and the “MetLife Foundation Award for Medical Research in Alzheimer’s Disease.” Both prizes have been awarded since the late 1980s, thus capturing in full the time period of our analysis. Forty investigators have won both awards. The main path (as depicted in Fig. 5) includes one or more papers from twelve of these authors.

Conclusions

We have developed two routines that enable the researcher to generate matrices of citing versus cited documents and/or citing documents versus MeSH terms. The data from WoS and PubMed/Medline was integrated using the PubMed Identifier (PMID). Since the number of citing documents is (almost) the same in both cases, the two matrices can also be juxtaposed and then merged so that combinations of citations and MeSH terms can be analyzed. These combinations can perhaps be considered as hybrid indicators (e.g., Braam et al. 1991).

Aggregation of the cited references at the journal level reduces the number of variables by orders of magnitude; the resulting numbers are comparable to the numbers of MeSH categories attributed. Further analysis leads to the conclusion that the abbreviated journal names in the cited references indicate a core structure of the set,9 whereas the MeSH are attributed regarding to their relevance to current research options. This classification therefore seems less suited for carrying the normalization of citations than journals or journal groups.

In the context of this study, main-path analysis provides another example of the research potential of organizing the data into primary matrices extracted from downloads of PubMed and WoS. As a perspective for further research, Hellsten and Leydesdorff (2016), for example, analyze translational research in medicine in terms of combinations of MeSH terms, institutional addresses, and journal names. By considering these and other (meta-)data as attributes of documents, one can merge matrices and combine dimensions in the data as we have done above for cited references and MeSH terms, but also beyond two dimensions in terms of n-mode arrays and therefore heterogeneous networks (Callon and Latour 1981; Law 1986).

Footnotes

The National Library of Medicine of the United States (NLM) has constantly received substantial funding to maintain and update its biomedical and health information services—for example, the 2015 budget for these services was $117 Million (National Library of Medicine, 2015). This has enabled a relatively uniform application of the MeSH classification to publications by indexers over many years (Hicks and Wang 2011, at p. 292; Petersen et al. 2016).

One can use this string also for computing the Relative Citation Ratios at https://icite.od.nih.gov/analysis (Hutchins et al. 2016). However, this facility has currently a limitation of 200 PubMed identifiers.

Appendix 1

The two routines CitNetw.EXE and MHNetw.EXE (available at http://www.leydesdorff.net/software/citnetw/) can be used for making complete matrices at the article level in the Pajek and SPSS formats for the analysis of citations and medical subject headings, respectively. The two matrices can also be combined.

On the basis of a download of Web-of-Science data, CitNetw.EXE can generate the citation matrix with the citing papers in the rows and cited references in the columns in the following formats: (1) mtrx.net in the Pajek format and (2) mtrx.sps + mtrx.txt for SPSS. The matrix is binary, asymmetrical, 2-mode, and directed. (If so wished, one can transpose this matrix in Pajek or SPSS.) One can process the file “mtrx.net” further in Pajek, UCInet, or Gephi, etc. The file lcs.net (output of CitNetw.Exe) contains the bounded network of citations among the documents under study. This file can be used, for example, for main path analysis (see Appendix 2).

Input to both routines is a file “data.txt” containing downloads from WoS and Medline, respectively, in the “plain text” or “Medline” format (tagged). This file is first processed into a format for relational database management. (One is prompted for skipping this reorganization if it was already done in a previous round.) If one wishes to combine the outputs of the two routines, the files mtrx.* should first be saved and stored elsewhere, since these files are overwritten in subsequent runs.

The objective of using MHNetw.EXE is to combine Medical Subject Headings (MeSH) and citation information at the article level. The MeSH are first retrieved from the PubMed database and can be organized into relational data using the routine pubmed.exe at http://www.leydesdorff.net/pubmed. Note that one also needs the file “pubmed.dbf” to be present in the same folder as the data and pubmed.exe. Alternatively, one can retrieve the data from Medline in WoS. The advantage of retrieval from PubMed above retrieval from WoS is that there is no limitation of 500 records each time. The data from either source has first to be organized in the same folder using PubMed.Exe. The program prompts with a question about either source. Input data have always to be named “data.txt”.

Output of MHNetw.exe is:

mtrx.net (Pajek) and mtrx.sps (for SPSS) containing the citing papers as rows and the MeSH as variables in the columns (analogous to CitNetw.exe).

A file called “string.wos” which contains the search string for obtaining citation information at Web of Science (advanced search).

The citation scores are written into the file with article descriptors ti.dbf in a field “tc”; citation scores are summed for MeSH into mh1.dbf.

The file “string.wos” can be used to generate the corresponding file in the Science Citation indices of WoS; the file “string.pubmed” contains analogously the search string if one has worked from the WoS interface.

The file cr_mh.net contains the citation information (cited references, CR) in the rows and the medical subject headings (MH) in the columns. The cell values provide the number of documents in which cited references and MeSH co-occur.

The file jcr_mh.net contains the abbreviated journal names in the cited references (CR) in the rows and the medical subject headings (MH) in the columns. The cell values provide the number of documents in which the cited journals and MeSH co-occur.

The file jcr_mh_a.net contains the same information (abbreviated journal names and MeSH categories), but differently organized: both are attributed as variables to the documents under study as the cases. Within Pajek, one can convert this matrix into an affiliations matrix (using Network > 2-Mode Network > 2-Mode to 1-Mode > Columns). One can also export this file to SPSS for cosine-normalization of the matrix.

The asterisks in MeSH terms are discarded in this version. All files operate only on files present in the same folder. Note that mtrx.net, mtrx.txt, and mtrx.sps are overwritten in each run of MHNetw.exe or CitNetw.exe. One is advised to save all files mtrx.* elsewhere or to rename them for this reason.

We suggest the following order of the routines:

1.

Download data at PubMed from the user interface at http://www.ncbi.nlm.nih.gov/pubmed/advanced. At the results page thereafter, select under “Send to” the format option MEDLINE and download to a file which has to be (re)named “data.txt”;

Use the resulting string “string.wos” at the advanced user interface of WoS; save the retrieval via “Marked list” in portions of 500 records. Combine the data into a single file data.txt.

4.

Run CitNetw.EXE; save the citation matrices in the files mtrx.* elsewhere;

5.

Run MHNetw.EXE; save the matrices that one wishes to use for further analysis. This analysis may take long.

Appendix 2

Main path analysis

Alongside other files, CitNetw.EXE generates a file lcs.net containing the citations within the bounded domain of the document set(s) under study. (This domain corresponds to the so-called local citation scores (lcs) in HistCite™.) However, the cited references are not disambiguated, but used as they are provided by WoS. The user may wish to disambiguate the references before entering this routine (for example, by using CRExplorer.EXE.) The cited references are matched against a string composed from the citing document using the WoS-format of the cited references “Name Initial, publication year, abbreviated journal title, volume number, and page number” as follows: “Zhang CL, 2002, CLIN CANCER RES, V8, P1234”.

The output file lcs.net contains a matrix with the citing documents in the rows and the cited ones in the columns. The matrix may be somewhat different from the one which one can obtain from using HistCite™ because of different matching and disambiguation procedures.

In order to proceed with main-path analysis in Pajek, the network has to be made a-cyclical (de Nooy et al. 2011, pp. 244f.). One can make the network a-cyclical within Pajek using the following steps in this order:

The subsequent choice among the options of Main Path for “> Global Search > Standard”, for example, leads to the extraction of the subnetwork with the main path; this subnetwork is selected as the active network. The main path can then be drawn and/or further analyzed.

Hicks, D., & Wang, J. (2011). Coverage and overlap of the new social science and humanities journal lists. Journal of the American Society for Information Science and Technology,62(2), 284–294.CrossRefGoogle Scholar

Leydesdorff, L., & de Nooy, W. (in press). Can “Hot Spots” in the sciences be mapped using the dynamics of aggregated journal-journal citation relations? Journal of the Association for Information Science and Technology. http://arxiv.org/abs/1502.00229.

Liu, J. S., & Lu, L. Y. (2012). An integrated approach for main path analysis: Development of the Hirsch index as an example. Journal of the American Society for Information Science and Technology,63(3), 528–542.MathSciNetCrossRefGoogle Scholar

Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology,63(12), 2378–2392.CrossRefGoogle Scholar

Copyright information

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.