This paper has two main objectives:1. Verify whether it is possible to reliable identify the most highly-cited papers in Google Scholar, and indirectly2. Empirically validate whether citations are the primary result-ordering criterion in Google Scholar for generic queries orwhether other factors substantially influence the rank order

METHODOLOGY

Sample

64,000 documents published entre 1950-2013 (1000 per year)

Design

A generic query through conducting a null query (search box is left blank), filtering only by publication year using Google Scholar’s advanced search function. In this way, we avoided the sampling bias caused by the keywords ofa specific query and by other academic search engine optimisation issues. In order to work with a sufficiently large data sample, a longitudinal analysis was carried out by performing 64 generic null queries from 1950 to 2013 (one query per year). Whereas 2013 was the last complete available year when our data collection was carried out, 1950 was selected becausethis particular year reflected an increase in coverage in comparison to the preceding years

Period analyzed

1950-2013

RESULTS

- The overall correlation between the number of citations received by the 64,000 documents and the position they occupied on the results page of Google Scholar at the time of the query is r = −0.67 ( < 0.05). The average annual value of the correlation coefficient is very high (negative values for the correlation are due to position1 being better than position 1000). Fig. 1

- The correlation for the results placed amongst the top 900 positions is r = 0.97 ( < 0.01). However, the correlation obtained for results in the last 100 positions is only r = 0.61 ( < 0.01). the results located in the first 900 positions of each search are displayed in green, while the results in the last 100 positions are shown in red (Fig. 2). In this way we can see clearly how, until approximately the 900th position, the Google Scholar sorting criteria are based largely on the number of citations received by each result. However, after approximately the 900th position, the data show erratic results in terms of the correlation between citations and position (Fig 2.)

- The correlation between the position of a document and the number of versions is low, but significant (r = −0.30; < 0.01).The average correlation per year is slightly higher (r = −0.33; = 0.04). Fig. 6 shows that, despite the wide dispersion of data,there is a slight concentration of documents with between 100 and 300 versions amongst the first 100 rank positions (Fig. 3)

- The annual average number of documents in English for results within the first 100 positions is 99.5. Therefore, thepresence of documents in other languages within this range is abnormal. When analysing this same percentage for the documents in the last 100 positions, the results change significantly. The annual average drops to 34.2%. (Fig. 4)

CONCLUSIONS

Significant and high correlation between the number of citations and the ranking of the documents retrieved by Google Scholar was obtained for a generic query filtered only by year. The fact that we minimised the effects of academic search engine optimisation, together with the size of the sample analysed (64,000 documents), leads us to conclude that the number of citations is a key factor in the ranking of the results and, therefore, that Google Scholar is able to identify highly-cited papers effectively. Given the unique coverage of Google Scholar (no restrictions on document type and source), this makes it an invaluable tool for bibliometric analysis.

What this study adds

Google Scholar can be used to reliably identify the most highly-cited academic documents. Given its wide and varied coverage, Google Scholar has become a useful complementary tool for Bibliometrics research concerned with the identification of the most influential scientific works