Complex Network Resources

Complex Networks Data Sets

In analyzing large-scale complex networks, it is important to establish a standard dataset from which algorithms and claims be compared and verified. Currently, it is often difficult to track down the original data used for computational experiments. Much of it is floating around in various formats throughout the net, imbedded in papers, and often difficult to get from the authors. Moreover, the datasets are often modified (filtered) by research groups interested in different attributes, so that even when the name and descriptions match a citation in a paper, there is no guarantee that the data is identical.

US Patent Citations : citations from nearly 3 million US patents (granted Jan. 1963 - Dec. 1999) with over 16 million citations for patents between 1975-1999. Based on paper from the National Bureau of Economic Research. Categories include Chemical, Computer & Communications, Drugs & Medical, Electrical & Electronic, Mechanical, and Others.

Computational Geometry collaborations: authors collaboration network (weighted graph) with 9072 vertices (authors) and 22577 edges (common publications) with each edge weighted by the number of common publications between two authors.

Erdös Collaborations: List of mathematician Paul Erdös's coauthors and their respective coauthors. More background info can be found here.)

Associative Thesaurus: 23,219 vertices, 325,624 arcs (564 loops). This is a not a traditional semantic network, but rather the result of experiments were words where shown to several people and they responded with the first word that comes to mind. (The word could be a synonym, antonym, or some other cause relation.)

http://www.cheswick.com/ches/map/dbs/index.html:
Bill Cheswick, creator of several Internet maps, keeps some selected datasets of the internet here. (Rather large 2MB - 36MB files.)
The format of these files is an edge list of two IP addresses forming a directed graph, i.e. "12.118.106.6 68.86.96.65 13 4". (According to the author, Bill Cheswick, it seems the last two numbers signify the number of times the edge appears, and the distrance from the root node.)

http://math.nist.gov/matrixmarket:
interpreted as connectivity graph, Matrix Market contains over 500 sparse matrices from various application areas. (Although these graphs are not self-organizing, do they represent unstructured complex geometries, usually the result of grid generation, or human-inspected analysis.)
The files are in a text format (Matrix Market) format, basically a coordinate list (i, j, val_ij) with a header and optional text. A simple filter (mm_extract_pattern) converts these files into a simple edge list "i j" and strips the header information. For example "cat bcsstk05.mtx | mm_extract_pattern > bccstk05.g" creates portable graph text file.
The software tools for this conversion are in ~/projects/MatrixMarket/tools.

http://www.cs.cmu.edu/~enron/http://bailando.sims.berkeley.edu/enron_email.html
Email database from Enron, made public by Federal Energy Regulatory Commission.
According to the authors, this is one of few public email datasets of "real" email from a corporate world. Contains 200,399 messages from 158 users. This version has removed artificial messages, such as folders containing discussion threads and other machine-generated emails. (Large, > 400 MB)
The second link contains a subset of about 1700 email messages (4.5MB) that focus on business-related topics, rather than jokes and personal messages.
(The first link is included in the Barbasi collection above.)

http://www.cs.cornell.edu/projects/kddcup/index.html
KDD 2003 Challenge: citations from the Stanford Linear Accelerator Center, High Energy Physics (HEP) literature online since 1974, citing over 500,00 related articles. The citation graph has about 27,771 vertices and 352,807 edges.
Note: This is a tar-bal archive of the arXiv:he-th (High Enegry Physics: Theory) citation datbase from www.arxiv.org, created specifically for the KDD Cup 2003 challenge held in conjunction with the 9th Annal ACM SIGKDD (Knowledge Discovery and Data Mining). See http://www.cs.cornell.edu/projects/kddcup/index.html

Wikipedia download, with all articles in XML format (150 GB) from which a network graph can be extracted for each page referencing another.