Dear Participants, the competition organizers have decided to add an $10000 intermediate prize in addition to the original $30000 prize pool to reward those who have committed effort by the end of September 29 (23:59 PM UTC time). The prizes are: 1st place, $3000; 2nd and 3rd, $1500 for each team; 4th to 7th, $1000 for each team. Please note that this prize is awarded according to the public leaderboard on September 29 (23:59 PM UTC time).

Due to this problem, we decided to extend the deadline to 24:00, Oct 7, UTC. We will release the final test set at 0:00, Oct 8, UTC. All participants must submit their results (up to 5 times) on final test set before 23:59, Oct 8, UTC. Accordingly, the deadline for changing team is now at 0:00, Oct, UTC.

Thank you for your understanding and good luck with the competition!

Introduction to Open Academic Data Challenge 2017

Academic data has witnessed an exponential growth in recent years as the total number of academic papers worldwide has exceeded 300 million and the number of academic researchers has reached 100 million. However, only about 3% of all the academic data contain semantic annotations. Such severe lack of semantic annotation information greatly restricts the service capacity of the academic big data’ and its industrial development. Open Academic Data Challenge 2017 is hosted against such backdrop, committed to increasing the semantic annotation information in the academic database.

Hosted by Tsinghua University, Microsoft Research, the Knowledge Center of Chinese Academy of Engineering and the National Science Library of Chinese Academy of Sciences, and co-organized by Tsinghua Big Data Industries Association and IEEE Computer Society, Open Academic Data Challenge 2017 is aimed to create accurate academic profiles through mining the description of the scholars, their research interests and academic influence, and to explore the cutting-edge academic profiling techniques.

Based on the datasets provided by AMiner.org, a renowned academic data mining system and Microsoft Academic Graph, the participants are required to extract scholars’ personal description, analyze their research interests and predict the citation counts of their papers, so as to better provide the information of related experts, assess their research results, monitor certain scientific research progress and present academic development trends for the academic circles.

Task 1: Extract a scholar’s profile information
Each scholar’s profile information consists of his or her homepage address, gender, position, etc. As usage of the Internet becomes increasingly widespread, web pages associated with scholars are much more in number and much more complicated than before. Current scholar’s webpages normally contain a large amount of redundant information. A potentially effective scholar profiling technique is to integrate the data about the scholar from various sources on the Internet, and establish a machine learning model to obtain the scholar’s accurate information.

Task 2: Predictthe scholar’s research interests labels
Research interest is an important part of a scholar’s profile, which not only indicates the scholar’s own research experience or research direction, but also provide an insight into the concern and sensitivity of scholars from different backgrounds to the hot spots of a research field or the research trends of a discipline. Similar to the first task, the participants can determine a certain scholar’s research interests by integrating the huge amount of information from multiple sources on the Internet.

Task 3: Predict the scholar’s future influence
Academic influence is a way to measure a scholar’s impact in the field of professional theory and technology. Commonly used indexes to evaluate academic influence include paper citation count, journal impact factor and the h index, among which paper citation count is one of the most important and direct indicators for academic influence. In this task, participants are asked to predict the total citation number of a scholar’ s papers in a given period of time in the future based on the current relevant academic data about the scholar.

Task Description

Each team is required to complete the following three tasks by using given data of scholars:

Task 1: Extract a scholar’s profile information
A scholar’s name, organization, as well as the URL of the cached first page that Google returns with the keywords of “the scholar’s name + organization name” (static page, typically displaying 10 search results) are given. The participants have the access to the links for all the 10 search results and the links on these pages. In this task, participants are required to extract the URL of the scholar’s personal homepage, the URL of his or her headshot picture, his or her email address, gender, title/position list, and the current location (country) of the scholar’s organization. Please read the detailed rule for Task 1 carefully.

Note:
1. If some information is missing on the scholar’s personal homepage, such information in our model answer will be left blank, so please do not include such information in your answer.
2. The information extracted from the homepage needs to follow the format on the homepage.
3. Please read the detailed rule for Task 1 given above carefully.

Task 2: Find the scholar’s research interests labels
The scholar’s paper information and co-author relationship network are given and the participants are required to attach five interest labels to the scholar. All candidate labels are given by the organizer.

Example

input:
name: Jiawei Han

output:
research interests: data mining, database, information networks, knowledge discovery, machine learning (Note: The order of interest labels has no effect on the results)

Task 3: Predict the scholar’s future influence
The participants are given all the paper data of the scholar by the end of 2013 (including the detailed description of the reference relationsips of the papers. See the ‘Data’[link needed] section), and are asked to predict the total citation counts of the scholar as of June 2017.