Received citations as a main SEO factor of Google Scholar results ranking

Summary The aim of this article is to analyze the web positioning factors that can influence the order, by relevance, in Google Scholar and the subsequent evaluation of the importance of received citations in this ordering process. The methodology of reverse engineering was applied, in which a comparison was made between the Google Scholar ranking and another ranking consisting of only the number of citations received by documents.

This investigation was conducted employing four types of searches without the use of keywords: by publication, year, author, and “cited by”. The results were matched in the four samples with correlation coefficients between the two highest rankings, which exceeded 0.9. The present study demonstrates more clearly than in previous research how citations are the most relevant off-page feature in the ranking of search results on Google Scholar. The other features have minimal influence. This information provides a solid basis for the academic search engine optimization (ASEO) discipline. We also developed a new analysis procedure for isolating off-page features that might be of practical use in forthcoming investigations.

1. Introduction

Search engine optimisation (SEO) is the process employed to optimise websites and their content to place them in favourable positions in search engine results (Enge; Spencer; Stricchiola, 2015). SEO is also a well-established profession within the new industry of digital communication, as shown by the existence of a wide range of monographs, professional publications and academic work. Its purpose is to highlight and strengthen the quality of documents to increase their visibility to the algorithms that establish the ranking positions in search engines, particularly Google. This goal must be achieved without falsifying the characteristics of documents, i.e., without employing fraudulent means.

Google Search results pages are ordered by relevance (Google, 2017). According to Google, this relevance criterion is calculated based on more than 200 features. Google does not specify these features or their specific weight; they merely disclose partial and general information, including that the quality of the content and backlinks are the two predominant factors (Ratcliff, 2016; Schwartz, 2016). The reason provided by Google for this lack of transparency is to fight against spam (Beel; Gipp, 2010). If all of the details of ranking factors were made available, spammers could more easily place low-quality documents in favourable positions. Nevertheless, this black box policy works to the detriment of SEO professionals who conduct their activities ethically and whose work is hindered by a lack of reliable information.

Some SEO companies (Gielen; Rosen, 2016; Localseoguide, 2016; MOZ, 2015; Searchmetrics, 2016) conduct reverse engineering research to measure the impact of the factors involved in Google’s positioning process. In this research, many searches have been analysed to identify positioning factors based on the characteristics of pages placed in the first positions. Due to the great number of factors involved in the process of positioning, it is extremely difficult to establish the factors that are truly relevant and the extent to which they influence the final positioning of documents. In addition, Google’s positioning process is highly dynamic, with the algorithm undergoing dozens of changes per year (MOZ, 2017).

In recent years, SEO has been applied to academic search engines. This new process is known as Academic SEO (ASEO) (Beel; Gipp, 2009b, 2010; Codina, 2016; Martín-Martín et al., 2016a; Muñoz-Martín, 2015). Scholars are placing increasingly greater emphasis on enhancing the visibility of their articles in academic search engines. Articles appearing in the leading positions enhance their visibility, thus increasing the probability of being read and cited, and as a consequence, they are more likely to improve the personal h indices of their authors (Farhadi et al., 2013).

In many cases, the same optimisation procedures used successfully on Google Search are being applied to Google Scholar. However, Google Scholar has its own algorithm. Few studies have addressed the specific ordering factors employed by Google Scholar, and among those that could be cited are Beel and Gipp (2009b; 2009c; 2010); Beel, Gipp and Wilde (2010); Martín-Martín et al. (2014; 2017); Orduña-Malea et al. (2016).

The purpose of the present study was to analyse the features of the documents that can influence relevance rankings in Google Scholar. We are particularly interested in the citations received by documents. We aimed to assess the influence of the number of citations received in the ranking algorithm. The number of times that a document is cited is a key feature for determining the specificity of the Google Scholar ranking process.

We believe that the influence of citations is much greater than authors and publishers might believe. For example, the instructions to authors from academic journals provide guidelines regarding how to improve their ranking positions in Google Scholar (Elsevier, 2012; Wiley, 2015; Emerald Publishing Limited, 2017). In these guides, the citations received are not mentioned or are treated without the importance that they deserve.

This article reports the findings of a reverse engineering study that used a new method of analysis. This method allows us to block some factors of the algorithm of positioning, specifically those depending on external elements of ranked pages. In this manner, we could focus the study on a small set of factors with greater control. Our hypothesis is that if we compare the rankings applying only the number of citations received with the standard Google Scholar ranking in searches in which only external factors participate, then we can identify the weight of the citations in the set of these external factors. If the two compared rankings are similar, then the citations will carry significant weight.

This new methodology is possible because of Google Scholar’s advanced search form, which allows users to restrict the search fields to the author, year and source. Only external factors participate in these types of searches in which there are no keywords. In this way, the results obtained herein are far more reliable than those of previous studies using reverse engineering on Google Search without this control of variables.

2. Related works

Google Scholar has become an alternative to classic scientific citation indexing services, such as Web of Science (WoS) or Scopus. The positions of these commercial indexing services in the market could be jeopardised if Google Scholar offers a free product of similar quality. For this reason, Google Scholar has been analysed using several approaches:

Limited research regarding the process of information retrieval and search effectiveness has, however, been conducted (Jamali; Asadi, 2010; Walters, 2008). Few works about the intervening factors in ranking algorithms according to relevance have been published (Beel; Gipp, 2009a; 2009b; 2009c; Beel; Gipp; Wilde, 2010).

Unlike the process of positioning in Google Search, that used in Google Scholar has aroused little scientific interest, which is somewhat unexpected considering that it influences the articles that are read. It is widely acknowledged that the first items appearing on a search result list receive more attention from users than subsequent items do (Marcos; González-Caro, 2010). A better position in the ranking implies better chances of being found and read.

Some conclusions can be drawn from the existing works regarding relevance rankings in Google Scholar:

The keywords used in the search must appear in the document’s title to enable favourable positioning of the document (Beel; Gipp, 2009a);

The frequency of keywords in the text of the document does not appear to be a determining factor in establishing its ranking order (Beel; Gipp, 2009a);

Recent articles are more highly ranked than older articles (Beel; Gipp, 2009a) to compensate for the Matthew effect (Merton, 1968): articles with many citations tend to be ranked first; therefore, these articles have more readers and more citations and consequently consolidate their positions at the top (Martín-Martín et al., 2016b); and

The number of citations received is a determining factor in establishing the ranking order by relevance (Beel; Gipp, 2009c; Martín-Martín et al., 2014).

The latter conclusion is particularly relevant to the present study. However, these investigations have some limitations. In Beel and Gipp (2009c), all SEO features were analysed together; therefore, the variables related to on-page features were not blocked, and the results are not sufficiently clear.

In Martín-Martín et al. (2014) only searches by year were used. The central aim of the present research was to corroborate this conclusion by applying a methodology that establishes stricter control over variables. This methodology allowed us to obtain an accurate insight into the relevance of received citations in relation to all external features of the ranking algorithm in Google Scholar.