CiteSeerx (originally called CiteSeer) is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. CiteSeer holds a United States patent # 6289342, titled “Autonomous citation indexing and literature browsing using citation context,” granted on September 11, 2001. Stephen R. Lawrence, C. Lee Giles, Kurt D. Bollacker are the inventors of this patent assigned to NEC Laboratories America, Inc. This patent was filed on May 20, 1998, which has its roots (Priority) to January 5, 1998. A continuation patent was also granted to the same inventors and also assigned to NEC Labs on this invention i.e. US Patent # 6738780 granted on May 18, 2004 and was filed on May 16, 2001. CiteSeer is considered as a predecessor of academic search tools such as Google Scholar and Microsoft Academic Search.[citation needed] CiteSeer-like engines and archives usually only harvest documents from publicly available websites and do not crawl publisher websites. For this reason, authors whose documents are freely available are more likely to be represented in the index.

CiteSeer’s goal is to improve the dissemination and access of academic and scientific literature. As a non-profit service that can be freely used by anyone, it has been considered as part of the open access movement that is attempting to change academic and scientific publishing to allow greater access to scientific literature. CiteSeer freely provided Open Archives Initiativemetadata of all indexed documents and links indexed documents when possible to other sources of metadata such as DBLP and the ACM Portal. To promote open data, CiteSeerx shares its data for non-commercial purposes under a Creative Commons license.[1]

The name can be construed to have at least two explanations. As a pun, a ‘sightseer’ is a tourist who looks at the sights, so a ‘cite seer’ would be a researcher who looks at cited papers. Another is a ‘seer’ is a prophet and a ‘cite seer’ is a prophet of citations. CiteSeer changed its name to ResearchIndex at one point and then changed it back.

CiteSeer had not been comprehensively updated since 2005 due to limitations in its architecture design. It had a representative sampling of research documents in computer and information science but was limited in coverage because it was limited to papers that are publicly available, usually at an author’s homepage, or those submitted by an author. To overcome some of these limitations, a modular and open source architecture for CiteSeer was designed – CiteSeerx.

CiteSeerx

CiteSeerx replaced CiteSeer and all queries to CiteSeer were redirected. CiteSeerx[2] is a public search engine and digital library and repository for scientific and academic papers primarily with a focus on computer and information science.[2] However, recently CiteSeerx has been expanding into other scholarly domains such as economics, physics and others. Released in 2008, it was loosely based on the previous CiteSeer search engine and digital library and is built with a new open source infrastructure, SeerSuite, and new algorithms and their implementations. It was developed by researchers Dr. Isaac Councill and Dr. C. Lee Giles at the College of Information Sciences and Technology, Pennsylvania State University. It continues to support the goals outlined by CiteSeer to actively crawl and harvest academic and scientific documents on the public web and to use a citation inquery by citations and ranking of documents by the impact of citations. Currently, Lee Giles, Prasenjit Mitra, Susan Gauch, Min-Yen Kan, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Pucktada Treeratpituk, Jian Wu, Douglas Jordan, Steve Carman, Jack Carroll, Jim Jansen, and Shuyi Zheng are or have been actively involved in its development. Recently, a table search feature was introduced.[3] It has been funded by the National Science Foundation, NASA, and Microsoft Research.

CiteSeerx continues to be rated as one of the world’s top repositories and was rated number 1 in July 2010.[4] It currently has over 6 million documents with nearly 6 million unique authors and 120 million citations.

CiteSeerx also shares its software, data, databases and metadata with other researchers, currently by Amazon S3 and by rsync.[5] Its new modular open source architecture and software (available previously on SourceForge but now on GitHub) is built on Apache Solr and other Apache and open source tools which allows it to be a testbed for new algorithms in document harvesting, ranking, indexing, and information extraction.

CiteSeerx caches some PDF files that it has scanned. As such, each page include a DMCA link which can be used to report copyright violations.[6]

Current features

Automated information extraction

CiteSeerx uses automated information extraction tools, usually built on machine learning methods such ParsCit, to extract scholarly document metadata such as title, authors, abstract, citations, etc. As such, there are sometime errors in authors and titles. Other academic search engines have similar errors.

Focused crawling

CiteSeerx crawls publicly available scholarly documents primarily from author webpages and other open resources, and does not have access to publisher metadata. As such citation counts in CiteSeerx are usually less than those in Google Scholar and Microsoft Academic Search who have access to publisher metadata.

Usage

CiteSeerx has nearly 1 million users worldwide based on unique IP addresses and has millions of hits daily. Annual downloads of document PDFs was nearly 200 million for 2015.

Data

CiteSeerx data is regularly shared under a Creative Commons BY-NC-SA license with researchers worldwide and has been and is used in many experiments and competitions.

Other SeerSuite-based search engines

The CiteSeer model had been extended to cover academic documents in business with SmealSearch and in e-business with . However, these were not maintained by their sponsors. An older version of both of these could be once found at but is no longer in service.

Other Seer-like search and repository systems have been built for chemistry, ChemXSeer and for archaeology, ArchSeer. Another had been built for robots.txt file search, BotSeer. All of these are built on the open source tool , which uses the open source indexer Lucene.

^For example, “CiteSeerx – DMCA Notice”. CiteSeerX10.1.1.604.4916. The document with the identifier “10.1.1.604.4916” has been removed due to a DMCA takedown notice. If you believe the removal has been in error, please contact us through the feedback page, along with the identifier mentioned in this page.