This page documents a research project in progress.Information may be incomplete and change as the project progresses.Please contact the project lead before formally citing or reusing results from this page.

In this work thesis, we were solving the task of a recommendation system to
recommend articles to edit to Wikipedia contributors. Our system is built on top
of articles’ embeddings constructed by applying Graph Convolutional Network to
the graph of Wikipedia articles. We outperformed embeddings generated from the
text (via Doc2Vec model) by 47% in Recall and 32% in Mean Reciprocal Rank (MRR)
score for English Wikipedia and by 62% in Recall and 41% in MRR for Ukrainian
in the offline evaluation conducted on the history of previous users’ editions. With
the additional ranking model we were able to achieve total improvement on 68% in
Recall and 41% in MRR on English edition of Wikipedia.
Graph Neural Networks are deep learning based methods aimed to solve typ-
ical Machine Learning tasks such as classification, clusterization or link prediction
for structured data - Graphs - via message passing architecture. Due to the explo-
sive success of Convolution Neural Networks (CNN) in the construction of highly
expressive representations - similar ideas were recently projected onto GNN. Graph
Convolutional Networks are GNNs that likewise CNNs allow sharing weights for
convolutional filters across nodes in the graph. They demonstrated especially good
performance on the task of Representation Learning via semi-supervised tasks as
mentioned above classification or link-prediction.

Contents

Our solution for Candidate Generation is based on Graph Convolutional Network that learns to generate articles embeddings to measure the similarity between them. Having user's previous history of edited articles, we calculate user representation (we are planning to compare several aggregator mechanisms - currently only averaging is available). This user representation then acts as query to Nearest Neighbors search. N most similar articles to given query are being produced as candidates.

During offline (preparation) we train our GCN (GraphSAGE) model on the task of link prediction (internal links between articles) with text representations obtained from Doc2Vec model as initial state for each node.