Visualization of Large Dynamic Networks

Yibo Yao

Advisor: Dr. Larry Holder

School of Electrical Engineering and Computer Science

Washington State University, Pullman, WA 99164

We develop a general framework using Gephi's Graph Streaming API to visualize a large dynamic network which is presented in a rapid stream of edges. We consider two popular dynamic networks: paper citation network and Twitter retweet network, and feed their edge streams into Gephi's visualization pool through a low-level programming communication with Gephi's Graph Streaming API. Also, we make those dynamic graphs capable of displaying a few prominent nodes which may be considered as important components, i.e., nodes with in-degrees/out-degrees beyond certain threshold values.

This project may help researchers visualize the implicit structures of most popular timeevolving networks and facilitate the development of algorithms for finding interesting
subgraph patterns in them.

The report can be found here. In the following sections, some detail information of the two dynamic networks is described. The source code(Python) is also included.

Paper Citation Network

The paper citation dataset considered in this project is Arxiv HEPTH (high energy physics
theory), which was originally released in 2003 KDD Cup, and was later refined into XML-based representation. Each paper is denoted by a node with a unique identifier in the graph, and if a paper i cites another paper j, there is a directed edge pointing from node i to node j. The dataset contains papers submitted to Arxiv in the period from January 1993 to April 2003. In order to make the citation data flow into Gephi's workspace in a real-time fashion, we have converted the XML-based data into a set of edge streams and then store them in a plain text file (which can be found here).

Read me: readme (instructions for how to set up the configuration and run the scripts)

Twitter Retweet Network

On Twitter, a retweet is a re-posting of someone else's tweet to make it shared with the public. Among the huge volume of tweets generated per second on Twitter, a large part of them are retweets. In this project, we focus on the retweets regarding certain given topics (or hashtags) in Twitter's public streams. By accessing Twitter's Streaming API through Tweepy, we are able to retrieve all retweets with respect to certain specified topics (or hashtags) from the real-time public streams. Each retweet has the author who did the re-posting and the user who wrote the original tweet, which can imply a retweet-relation between the author and the original user. So in the retweet network, the nodes are used to represent the users on Twitter with the usernames being the unique identifiers. And the edges are used to denote retweet-relations between them. If a user i re-posted another user j's tweet, there will be a directed edge pointing from node i to node j in the retweet network.