JHU Math Department ArXive Collaboration Analysis

JHU Math Department ArXive Collaboration Analysis

To familiarize myself with the software, I decided to create a data set and load it into Centrifuge.

I needed to find a data set that made the best use of Centrifuge’s strengths — visualizing big data to discover patterns and relationships.

I decided to use the arXiv. Instead of traditional but expensive academic journals, researchers can submit papers to the arXiv. Some, but not all, of the papers are preprints of articles due to appear in academic journals

Before I share my results, I should explain what a relationship graph is. Simply put, a relationship graph is a a way to visualize data that highlights the interconnections between data elements.

A simple relationship graph

In the relationship graph above, each circle represents an author. If there is a line between the circles, the authors collaborated on a paper. For example, the relationship graph above shows that W Wilson collaborated with Nitu Kitchloo. In turn, Nitu Kitchloo collaborated with Jack Morava.

If I was only interested in a pair of authors, it would be easy to use a SQL database to determine if they had collaborated. However, if I needed to identify interconnected groups of authors, a SQL database would be more difficult to use. I would need to complex nested joins to identify groups of collaborators. Even if I did manage to extract the data, I would still be unable to visualize the relationships.

The JHU Math Department

The above graph shows the entire math department, plus any coauthors up to 2 degrees of separation away. Authors currently on the JHU faculty are shown in yellow. Those who are not currently on the JHU faculty are shown in blue. The circles for the JHU authors are all the same size. For the non-JHU authors, the size of circle represents the number of articles. Observe that the graph has two large connected components. That is, there are two large groups of authors who are interconnected through collaboration relationships.

The largest component.

The second largest component.

The remaining author.

I should caveat that there were a few minor issues with getting the article metadata from the arXiv. Not every author participated in the arXiv’s authority controlsystem. As a result, some authors may have been confused with others with similar names. In addition, it’s important to remember that the arXiv isn’t the only venue for publication. Any collaboration outside of the arXiv isn’t shown in the graphs above.