Web image size prediction for efficient focused image crawling

Katerina Andreadou
This is a guest blog post by Katerina Andreadou.
Katerina is a research assistant at CERTH, where she specializes in multimedia analysis and web crawling.


In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. In our web image crawler setup, we noticed that a serious bottleneck pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them.

To address this limitation, we decided to explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. In order to do so, we needed a large amount of images accompanied by their HTML metadata with the purpose of training and testing our image size prediction system. To this end we decided to use a sample of the data from the July 2014 Common Crawl set, which is over 266TB in size and contains approximately 3.6 billion web pages. Since for technical and financial reasons, it was impractical and unnecessary to download the whole dataset, we created a MapReduce job to download and parse the necessary information using Amazon Elastic MapReduce (EMR). The setup is based on a blog post by Steve Salevan. The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

To complete the task, we used 50 Amazon EMR medium instances, resulting in 951GB of data in gzip format. The following statistics were extracted from the corpus:

  • 3.6 billion unique images
  • 78.5 million unique domains
  • ≈8% of the images are big (width and height bigger than 400 pixels)
  • ≈40% of the images are small (width and height smaller than 200 pixels)
  • ≈20% of the images have no dimension information

To predict the size of Web images, we came up with three different methodologies, which are analyzed in the rest of this post. This work is described in detail in a paper presented at CBMI 2015 (13th International Workshop on Content-Based Multimedia Indexing). The paper is available online (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7153609).

Textual Features approach

An n-gram in our case is a continuous sequence of n characters from the given image URL. The main hypothesis we make is that URLs which correspond to small and big images differ substantially in terms of wording. For instance, URLs from small images tend to contain words such as logo, avatar, small, thumb, up, down, pixels. On the other hand, URLs from big images tend to lack these words and typically contain others. If the assumption is correct, it should be possible for a supervised machine learning method to separate items from the two distinct classes.

The disadvantage of this approach is that, although the frequencies of the n-grams are taken into account, what is not considered is the correlation of the n-grams to the two classes, BIG and SMALL. For instance, if an n-gram is very frequent in both classes, it makes sense to get rid of it and not consider it as feature. On the other hand, if an n-gram is not very frequent, but it is very characteristic of a specific class, we should include it in the feature vector. To this end, we performed feature selection by taking into account the relative frequency of occurrence of the n-gram in the two classes, BIG and SMALL. We refer to this method as NG-trf, standing for term relative frequency.

In a variation of the aforementioned approach, we decided to replace n-grams with the tokens produced by splitting the image URLs by all non-alphanumeric characters. The employed regular expression in Java is W+ and the feature extraction process is the same as described above, but with the produced tokens instead of n-grams. We will refer to this method as TOKENS-trf.

Non-textual features approach

Our alternative non-textual approach does not rely on the image URL text, but rather on the metadata, that can be extracted from the image HTML element. The idea behind their choice is for them to reveal cues regarding the image dimensions. For instance, the first five features correspond to different image suffixes and they were selected due to the fact that most real-world photos are in JPG or PNG format, whereas BMP and GIF formats usually point to icons and graphics. Additionally, there is a greater likelihood that a real-world photo has an alternate or parent text than a background graphic or a banner.

Hybrid approach

The goal of the hybrid approach is to achieve higher performance by taking into account both textual and non-textual features. Our hypothesis is that the two methods will complement each other when aggregating their results as they rely on different kinds of features: the n-gram classifier might be best at classifying a certain kind of images with specific image URL wording, while the non-textual features classifier might be best at classifying a different kind of images with more informative HTML metadata.

Evaluation

For training we used one million images (500K small and 500K big) and for testing 200 thousand (100K small and 100K big). The described supervised learning approaches were implemented based on the Weka library. We performed numerous preliminary experiments with different classifiers (LibSVM, Random Tree, Random Forest), and Random Forest (RF) was found to be the one striking the best trade-off between good performance and acceptable training times. The main parameter of RF is the number of trees. Some typical values for this are 10, 30 and 100, while very few problems would demand more than 300 trees. The rule of thumb is that more trees lead to better performance; however, they simultaneously increase considerably the training time.

The comparative results for different number of trees for the Random Forest algorithm are displayed in Table 1. The first column of contains the method name, the second one the number of trees used in the RF classifier, the third one the number of features used, and the remaining columns contain the F-measures for the SMALL class, the BIG class and the average. The reported results lead to several interesting conclusions.

  • Doubling the number of n-gram features improves the performance in all cases.
  • So does adding more trees to the Random Forest classifier.
  • The hybrid method outperforms all standalone methods, its best F-score being 4% higher than the best textual features score.

Table 1: Comparative results (F-measure)

Method

RF trees

Features

F1small

F1big

F1avg

TOKENS -trf

10

1000

0.876

0.867

0.871

TOKENS -trf

30

1000

0.887

0.883

0.885

TOKENS -trf

100

1000

0.894

0.891

0.893

TOKENS -trf

10

2000

0.875

0.864

0.870

TOKENS -trf

30

2000

0.888

0.828

0.885

TOKENS -trf

100

2000

0.897

0.892

0.895

NG-tsrf-idf

10

1000

0.876

0.872

0.874

NG-tsrf-idf

30

1000

0.883

0.881

0.882

NG-tsrf-idf

100

1000

0.886

0.884

0.885

NG-tsrf-idf

10

2000

0.883

0.878

0.881

NG-tsrf-idf

30

2000

0.891

0.888

0.890

NG-tsrf-idf

100

2000

0.894

0.891

0.892

features

10

23

0.848

0.846

0.847

features

30

23

0.852

0.852

0.852

features

100

23

0.853

0.853

0.853

hybrid

0.935

0.935

0.935

Acknowledgement

This work was carried out at the Multimedia Knowledge and Social Media Analytics Lab in collaboration with Symeon Papadopoulos in the context of the REVEAL FP7 project.

Announcing the Common Crawl Index!

ilyaThis is a guest post by Ilya Kreymer
Ilya is a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl. He previously worked at the Internet Archive and led the Wayback Machine development, which included building large indexes of WARC files. Ilya is currently developing a new set of archive replay and access tools and an impressive new on-demand archiving service, webrecorder.io, that allows anyone to create a high-fidelity web archive of their own. Check out his exciting projects, including our new index and query api in the post below.


We are pleased to announce a new index and query api system for Common Crawl.

The raw index data is available, per crawl, at:
s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org. The index is fully functional but we are looking for feedback to improve the usefulness and usability of the index for future updates.

Index Format
The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

Index Query API
To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

The site is built using pywb (https://github.com/ikreymer/pywb), a new collection of web archive replay and query tools, including the index query component.

The index can be queried by making a request to a specific collection.

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

This can be done by using a wildcard queries:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org/*
or
http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/

Pagination
For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned.

The total number of pages can be retrieved with the showNumPages query:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true

This query returns:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

This indicates that there are 989 total pages, at 5 compressed index blocks per page!

Thus, to get all of *.wikipedia.org, one would need to perform the query for each page:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=0

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=988

This allows for the query process to be performed in parallel. For example, it should be possible to run a MapReduce job which computes the number of pages, creates a list of urls, and then runs the query in parallel.

Command-Line Client
For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

First, a good idea is to verify the number of pages:
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org –show-num-pages
809

To perform the query, simply run and
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org -z -d ./wikipedia-index

This query will fetch all pages of the *.wikipedia.org index into a ./wikipedia-index directory and keep the data compressed (-z flag). For a full set of options, you may run
./cdx-index-client.py –help

The script will print out an update of the progress:

2015-04-07 08:35:18,686: [INFO]: Fetching 989 pages of *.wikipedia.org
2015-04-07 08:35:45,734: [INFO]: 1 page(s) of 989 finished
2015-04-07 08:35:46,577: [INFO]: 2 page(s) of 989 finished
2015-04-07 08:35:46,579: [INFO]: 3 page(s) of 989 finished

Adjusting Page Size:
It is also possible to adjust the page size to increase or decrease the number of “blocks” in the page. (Each block is a compressed chunk and can not be split further)
The pageSize query param can be used to set the page size in blocks (the default is 5 blocks per page). For example:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true
{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true&pageSize=1
{“blocks”: 4942, “pages”: 4942, “pageSize”: 1}

In general, blocks / pageSize + 1 = pages. Adjusting the page size can help adjust the parallelization and load of the query as needed.

Capture Index JSON (CDXJ) Line Format
The raw index format (stored and returned from the query api) is as follows:

org,wikipedia)/ 20150227035757 {“url”: “http://www.wikipedia.org/”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “length”: “11996”, “offset”: “853671193”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”}

This format consists of a ‘url<space>timestamp<space>’ header followed by a json dictionary. The header is used to ensure the lines are sorted by url key and timestamp.

Adding the output=json option to the query will ensure the full line is json. Example:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1
{“urlkey”: “org,wikipedia)/”, “timestamp”: “20150227035757”, “url”: “http://www.wikipedia.org/”, “length”: “11996”, “filename”: “crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “offset”: “853671193”}

Currently, the index contains the urlkey (a canonicalized, reverse-order form of the url), the timestamp, the url, and the WARC filename, offset and length, as well as a checksum (digest) of the content. The digest can be used to identify duplicate captures, but also adds significantly to the index size and may be removed in future versions. Other fields may be added to the json dictionary as needed also.

It is possible to only select certain fields from the query with the fl field. For example, the following query will return only the url:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://wikipedia.org/&fl=url
http://wikipedia.org/

or via command-line tool:

./cdx-index-client -c CC-MAIN-2015-11 http://wikipedia.org –fl url

Multiple fields can be also specified, eg. fl=url,length to return only url and warc record length.

For a full reference of available query params, consult the latest CDX Server API reference

Additional Java Tools
For Java users wishing to access the raw index, the IIPC webarchive-commons has support for reading the ZipNum format. Additionally, the openwayback-cdx-server provides the Java implementation of the original cdx server api. However, some modifications would be required to that codebase to support the cdx json format and it has not been tested with this index.

Building the Index / Running CDX Index Server
All the tools for building the index are also available at: https://github.com/ikreymer/webarchive-indexing

The index was built using EMR and the mrjob python library, and the indexing tools from pywb project. This should enable others to build the index in the future, or create customized versions of the index as needed. Please refer to the project for additional reference and do not hesitate to contact with any specific questions.

The service running at http://index.commoncrawl.org is also available at:

https://github.com/ikreymer/cc-index-server

To run locally, the secondary index (for binary search) for each collection will need to be fetched locally, while most of the index will be read from S3.

Request for Feedback and Future Plans
Although the index format is pretty well-tested, there is lots of room for customization, especially of the index query api, as well as what fields to include in the index. Feedback in the form of bug reports/feature requests/questions/suggestions about any aspect of the index is definitely welcome to make the index even more easy to use. Please do not hesitate to get back with any feedback about the index.

After some additional testing of the newly released indexes, we plan to build an index for older crawls as well. A cumulative index of all data ever crawled by CommonCrawl is also under consideration if there is enough interest. We look forward to hearing about any use cases or other feedback that you may have about the indexing project.

Please help us continue our support of great efforts like this by making a donation to the Common Crawl Foundation and follow us @CommonCrawl on Twitter for the latest in Big Open Data.

Evaluating graph computation systems

Frank McSherryThis is a guest blog post by Frank McSherry
Frank McSherry is a computer science researcher active in the area of large scale data analysis. While at Microsoft Research he co-invented differential privacy, and lead the Naiad streaming dataflow project. His current interests involve understanding and improving performance in scalable data processing systems.


The computer science systems and database research communities are abuzz with graph processing systems, large computer systems specialized to process and analyze very large graphs such as global social networks, internet topologies, and the hyperlink structure of the world-wide web. Academic researchers working in these areas have made significant strides in understanding how to process graph-oriented data, but their ability to validate their techniques at scale have been limited by their access to very large graph datasets. The data made available by Common Crawl and Web Data Commons provide an excellent first opportunity for these researchers to understand the performance of graph processing systems at scales that justify their complexity.

Background on graph processing

Graphs are an interesting source of data in which each record (an “edge”) references two distinct entities (called “nodes”). In the case of the web graph, for example, each hyperlink can be viewed as an edge with the source and destination web pages as the nodes it references. Many important social graph analyses are functions of the graph structure: the existence of edges between nodes imposes a constraint on the answer, and many computations can be viewed as maintaining or updating per-node state as a function of the incident edges (and the states of adjacent nodes).

Researchers have since built computational engines based around algorithms in which information flows only along graph edges. These systems distribute the large set of edges across multiple threads, processors, and even computers, and for each edge ensure that the information at each node is shared with the other node. To define a computation, a data analyst then supplies the code for what should happen with this information each time it is presented, for example updating the information maintained by each node to reflect what they have learned from others. The separation of computational logic from system specifics allows each system to effect scalable implementations of many popular graph algorithms, without worrying the programmer about issues of data distribution, network communication, or recovery in the presence of machine failures.

Evaluation at scale

As popular as graph processing systems have become, their evaluation has largely either been on small to medium size data sets, or behind the closed doors of corporate data centers. Evaluations on moderately sized data is not without its use, but it does fail to stress the systems in ways that could inform users, programmers, or administrators when faced with truly large data. Worse, from a research perspective, we shouldn’t expect to make progress in improving graph processing systems without relevant evaluation of when they perform well and when they perform badly.

For example, Gonzalez et al., OSDI 2014 evaluated four of the most popular recent systems used for graph processing —Spark, Giraph,GraphLab, and GraphX— on two graph datasets edges each containing more than one billion edges. Their performance numbers on a cluster of 16 machines, comprising 128 processing cores, reveal that their proposed system improved on prior work with the same features, without losing the performance of specialized systems.

Although one billion edges is quite a lot, the datasets still fit comfortably on a modern laptop. To demonstrate the limitations of evaluations on this size of data, we compared the performance numbers reported by Gonzalez et al. with the performance of the same algorithms executed on my laptop, using about one hundred lines of code. Eliding the specific details of the comparison, no graph processing system strictly out-performed the laptop, and some of the systems were consistently slower than the laptop. With a few improvements to the code, the laptop was able to out-perform all systems on all computations on all datasets evaluated by Gonzalez et al.

These conclusions surprised many people, who had assumed that the scalable systems would obviously out-perform a single core on a laptop. Indeed, the main point of these systems is that one can accomplish more by using multiple computers than one can accomplish using just a single computer. If the systems are not yet accomplishing this goal, doing something you couldn’t otherwise do, it is less clear when and why you would use them, especially given the additional resources (and complexity) they entail. Perhaps one simply should not use them, and instead do graph analysis on my laptop instead.

Scaling up evaluation

One common (and fair) response to our initial evaluation was that the datasets we used (those used by the GraphX evaluation) were not sufficiently large to distinguish between good scalable systems and bad scalable systems (including my laptop, presumably). We completely agree (and this was part of the point of using only existing measurements), but researchers are challenged to access to realistic data at meaningful scales, and the graph datasets used by the GraphX evaluation were among the largest publicly available datasets at the time. Given this difficulty, it is perhaps unfair to blame the researchers for the limits of their evaluation.

This situation has now changed with the data provided by Common Crawl and Web Data Commons (who have processed the former’s crawl data into a compact graph dataset. The graph data they provide is substantially larger, by almost two orders of magnitude, than the graph datasets used by Gonzalez et al. As the dataset is made freely available, researchers can use it to evaluate the improvement their systems provide over simpler single-computer alternatives. If the systems do not improve as much as expected, the researchers can now pin-point what is slow, and can either improve their system’s implementation or improve the ideas underlying its design. In either case, the state of the art can now advance in a more principled manner.

To get the ball rolling on evaluation, we took the graph data supplied by Web Data Commons, 128 billion edges relating 4 billion nodes, approximately one terabyte uncompressed, and evaluated my laptop’s performance on this dataset. This collection is substantially larger than other datasets, and is near the limit of what can be processed by my laptop. The data required some further transformation and compression both to fit on my laptop’s drive, and to avoid exhausting memory when the computations were executed. But, the computations do run, and they take time that is only slightly longer than what one might expect based only on the increase in the number of edges.

The laptop’s performance is good news for people who want to run graph computations at scale, and a healthy challenge (and opportunity) for implementors of graph processing systems. These measurements provide an excellent baseline and proof of concept for the scalable systems to target. A laptop can perform the large-scale computation, admittedly under duress, and so the scalable systems should be expected both to perform the computation and to improve on my laptop’s performance. If a system’s performance is lacking, a comparative evaluation should reveal which components are limiting the system, and they can be improved.

Advancing graph processing systems

It should be mentioned that there are already several graph processing systems that do improve on laptop-scale performance. Ligra is a shared-memory (single machine) system whose performance appears to scale quite well from a single-core baseline when graphs fit in to the computer’s random access memory. For larger graphs, FlashGraph uses an array of solid-state drives to provide performance without requiring nearly as much memory. Finally, Naiad distributes computation across multiple computers, but manages to avoid introducing much overhead when it does, scaling performance up from the appropriate single-machine baseline.

In my opinion, each of these systems represent progress in the state of the art, and we should understand what each have done well (and what each do badly). Given our access to these systems and their ideas, as well as sufficient data to evaluate and distinguish good ideas from bad, there is no obvious reason not to expect the state of the art in graph processing to advance significantly and swiftly. In fact, to the extent that we are serious about performance in graph processing systems, we should demand it.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

Analyzing a Web graph with 129 billion edges using FlashGraph

DaZhengThis is a guest blog post by Da Zheng
Da Zheng is the architect and main developer of the FlashGraph project. He is a PhD student of computer science at Johns Hopkins University, focusing on developing frameworks for large-scale data analysis, particularly for massive graph analysis and data mining.   


FlashGraph is a SSD-based graph processing framework for analyzing massive graphs. We have demonstrated that FlashGraph is able to analyze the page-level Web graph constructed from the Common Crawl corpora by the Web Data Commons project. This Web graph has 3.5 billion vertices and 129 billion edges and is the largest graph publicly available in the world. Thanks to the hard work of the Common Crawl and the Web Data Commons project, we are able to demonstrate the scalability and performance of FlashGraph as well as the graph algorithms designed for billion-node graphs.

You may ask why we need another graph processing framework while we already have quite a few, such as Pregel/Giraph, GraphLab/PowerGraph and GraphX. As pointed out by Frank McSherry in his blog 1 & 2, the current distributed graph processing frameworks have substantial overhead in order to scale out; we should seek performance and capacity (the size of a graph that can be processed). On top of the runtime overheads Frank McSherry mentions, these frameworks also have very large memory overhead. For example, as shown in the performance evaluation of the GraphX paper, Giraph cannot even process a graph with 106 million vertices and 3.8 billion edges in a cluster with aggregate memory of 1088 GB. The similar problem exist in others, as shown here. The large memory overhead prevents them from scaling to a larger graph or unnecessarily wastes resources.

FlashGraph seeks performance, capacity, flexibility and ease of programming at the moment when it was created. We hope FlashGraph can have performance comparable to the state-of-art in-memory graph engines while scaling to graphs with hundreds of billions of edges or even trillions of edges. We also hope that FlashGraph can express varieties of algorithms in FlashGraph and hide the complexity of accessing data on SSDs and parallelizing graph algorithms.

To scale graph analysis and achieve in-memory performance, FlashGraph uses the semi-external memory model, which stores algorithmic vertex state in memory and edge lists on SSDs. This model enables in-memory vertex communication while scaling to graphs that exceed memory capacity. Because vertex communication is the main source of computation overhead in many graph algorithms, it is essential to achieve in-memory performance in vertex communication. To optimize data access on SSDs, FlashGraph deploys two I/O optimizations: access edge lists only required by the applications; conservatively merge I/O requests to achieve higher I/O throughput and reduce CPU overhead caused by I/O.

The graph format used by FlashGraph is designed for both efficiency and flexibility. All graph algorithms in FlashGraph use the same graph format, so each graph only needs to be converted into the format once and to be loaded to SSDs once. FlashGraph stores both in-edges and out-edges in a graph image. In the Web graph, an out-edge is a hyperlink from a Web page to another page, and an in-edge is the reverse of a hyperlink. It is necessary to keep an edge twice for a directed graph because some graph algorithms require in-edges, some require out-edges and some require both. For efficiency, in-edges and out-edges of a vertex are stored separately. This reduces data access from SSDs if an algorithm requires only one type of edges.

FlashGraph provides a very flexible vertex-centric programming interface and supports varieties of graph algorithms. The vertex-centric programming interface allow programmers to “think like a vertex”: each vertex maintains some algorithmic state and performs user-defined computation independently. In FlashGraph, a vertex can communicate with any vertices through message passing and read edge lists of any vertices from SSDs. We have implemented a set of graph algorithms such as breadth-first search, PageRank, connected components and triangle counting. All of these graph algorithms implemented in FlashGraph can run on the page-level Web graph in a single commodity machine and complete at an unprecedented speed, as shown in the table below. The performance result also shows that FlashGraph has a very short initialization time even on this massive graph.

Algorithm Runtime (sec) Init time (sec) Memory (GB)
BFS 298 30 22
Betweenness 595 33 81
Triangle counting 7818 31 55
Weakly connected components 461 32 47
PageRank (30 iterations) 2041 33 46
Scan statistics 375 58 83

 

The more detailed design of FlashGraph is documented by the paper published at FAST’15.

We further explore community detection with FlashGraph on billion-node graphs. Here we detect communities with only active vertices. The activity level of a vertex is measured by a locality statistic (the number of edges in the neighborhood of a vertex). Again, we use the large Web graph to demonstrate the scalability and accuracy of our procedure. The key here is to quickly identify the most active vertices in a graph. Having these vertices, we further cluster them into active communities. In our experiment on the paper, we identify 2000 most active vertices in the Web graph and discover five communities. The sizes of community 1 to 5 are n1 = 35, n2 = 1603, n3 = 199, n4 = 42 and n5 = 121 respectively. Community 1 is a collection of websites that are all developed, sold or to be sold by an Internet media company networkmedia. Community 2 are all hyperlinks extracted from a single Pay-level-domain adult website. In the community 3, most links are social media websites and often used in our daily life such as WordPress.org and Google. Community 4 consists of websites related to online shopping such as the shopping giant Amazon and the bookseller AbeBooks. Community 5 is another collection of 121 adult web pages where each web page comes from a different Pay-level-domain in this cluster. In summary, top 5 active communities in the Web Graph are grouped with high topical similarities.

Active community detection is one application that demonstrates the power of FlashGraph. We are looking forward to seeing more cases that people use FlashGraph for mining massive graphs. We are happy to help others develop algorithms to explore the Web graph as well as other graphs of the similar size or even a larger size.

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

Ross FairbanksThis is a guest blog post by Ross Fairbanks

Ross Fairbanks is a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project wikireverse.org and why he built it.



What is WikiReverse?

WikiReverse [1] is an application that highlights web pages and the Wikipedia articles they link to. The project is based on Common Crawl’s July 2014 web crawl, which contains 3.6 billion pages. The results produced 36 million links to 4 million Wikipedia articles. Most of the results are from English Wikipedia (which had 32 million links) followed by Spanish, Indonesian and German. In total there are results for 283 languages.

I first heard about Common Crawl in a blog post by Steve Salevan— MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [2]. Running Steve’s code deepened my interest in the project. What I like most is the efficiency savings of a large web scale crawl that anyone can access. Attempting to crawl the same volume of web pages myself would have been vastly more expensive and time consuming.

I found that the data can be processed relatively cheaply, as it cost just $64 to process the metadata for 3.6 billion pages. This was achieved by using spot instances, which is the spare server capacity that Amazon Web Services auctions off when demand is low. This saved $115 compared to using full price instances.

There is great value in the Common Crawl archive; however, it is difficult to see with no interface to the data. It can be hard to visualize the possibilities and what can be done with the data. For this reason, my project runs an analysis over an entire crawl with a resulting site that allows the findings to be viewed and searched.

I chose to look at reverse links because, despite it’s relatively simple approach, it exposes interesting data that is normally deeply hidden. Wikipedia articles are often cited on the web and they appear highly in search results. I was interested in seeing how many links these articles have and what types of sites are linking to them.

A great benefit of working with an open dataset like Common Crawl’s is that WikiReverse results can be released very quickly to the public. Already, Gianluca Demartini from the University of Sheffield has released Who links to Wikipedia? [3] on the Wikimedia blog. This is an analysis of which top-level domains appear in the results. It is encouraging to see the interest in open data projects and hopefully more analyses of these types will be done.

Choosing Wikipedia also means the project can continue to benefit from the wide range of open data they release. The DBpedia [4] project uses raw data dumps released by Wikipedia and creates structured datasets for many aspects of data, including categories, images and geographic locations. I plan on using DBpedia to categorize articles in WikiReverse.

The code developed to analyze the data is available on Github. I’ve written a more detailed post on my blog on the data pipeline [5] that was developed to generate the data. The full dataset can be downloaded using BitTorrent. The data is 1.1 GB when compressed and 5.4 GB when extracted. Hopefully this will help others build their own projects using the Common Crawl data.


[1] https://wikireverse.org/
[2] http://blog.commoncrawl.org/2011/12/mapreduce-for-the-masses/
[3] http://blog.wikimedia.org/2015/02/03/who-links-to-wikipedia/
[4] http://dbpedia.org/About
[5] https://rossfairbanks.com/2015/01/23/wikireverse-data-pipeline.html

The Promise of Open Government Data & Where We Go Next

One of the biggest boons for the Open Data movement in recent years has been the enthusiastic support from all levels of government for releasing more, and higher quality, datasets to the public. In May 2013, the White House released its Open Data Policy and announced the launch of Project Open Data, a repository of tools and information–which anyone is free to contribute to–that help government agencies release data that is “available, discoverable, and usable.”

Since 2013, many enterprising government leaders across the United States at the federal, state, and local levels have responded to the President’s call to see just how far Open Data can take us in the 21st century. Following the White House’s groundbreaking appointment in 2009 of Aneesh Chopra as the country’s first Chief Technology Officer, many local and state governments across the United States have created similar positions. San Francisco last year named its first Chief Data Officer, Joy Bonaguro, and released a strategic plan to institutionalize Open Data in the city’s government. Los Angeles’ new Chief Data Officer, Abhi Nemani, was formerly at Code for America and hopes to make LA a model city for open government. His office recently launched an Open Data portal along with other programs aimed at fostering a vibrant data community in Los Angeles.1

Open government data is powerful because of its potential to reveal information about major trends and to inform questions pertaining to the economic, demographic, and social makeup of the United States. A second, no less important, reason why open government data is powerful is its potential to help shift the culture of government toward one of greater collaboration, innovation, and transparency.

These gains are encouraging, but there is still room for growth. One pressing issue is for more government leaders to establish Open Data policies that specify the type, format, frequency, and availability of the data  that their offices release. Open Data policy ensures that government entities not only release data to the public, but release it in useful and accessible formats.

Only nine states currently have formal Open Data policies, although at least two dozen have some form of informal policy and/or an Open Data portal.2 Agencies and state and local governments should not wait too long to standardize their policies about releasing Open Data. Doing so will severely limit Open Data’s potential. There is not much that a data analyst can do with a PDF.

One area of great potential is for data whizzes to pair open government data with web crawl data. Government data makes for a natural complement to other big datasets, like Common Crawl’s corpus of web crawl data, that together allow for rich educational and research opportunities. Educators and researchers should find Common Crawl data a valuable complement to government datasets when teaching data science and analysis skills. There is also vast potential to pair web crawl data with government data to create innovative social, business, or civic ventures.

Innovative government leaders across the United States (and the world!) and enterprising organizations like Code for America have laid an impressive foundation that others can continue to build upon as more and more government data is released to the public in increasingly usable formats. Common Crawl is encouraged by the rapid growth of a relatively new movement and we are excited to see the collaborations to come as Open Government and Open Data grow together.

 

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. She is currently pursuing a master’s degree in public policy from the Goldman School of Public Policy at the University of California, Berkeley.

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Robert MeuselThis is a guest blog post by Robert Meusel.
Robert Meusel is a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project. The post below describes a new tool produced by Web Data Commons for extracting data from the Common Crawl data.


The Web Data Commons project extracts structured data from the Common Crawl corpora and offers the extracted data for public download. We have extracted one of the largest hyperlink graphs that is currently available to the public. We also extract and offer large corpora of Microdata, Microformats and RDFa annotations as well as relational HTML tables. If you ask us, why we do this? Because we share the opinion that data should be available to everybody and because we want to make it easier to exploit the wealth of information that is available on the Web.

For performing the extractions, we need to go through all the hundreds of tera-bytes of crawl data offered by the Common Crawl Foundation. As a project without any direct funding or salaried persons, we needed a time-, resource- and cost-efficient way to process the CommonCrawl corpora. We thus developed a data extraction tool which allows us to process the Common Crawl corpora in a distributed fashion using Amazon cloud services (AWS).

The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed. Each worker receives a new file from the queue whenever it is ready and informs the queue about the status (success of failure) of the processing. Successfully processed files are removed from the queue, failures are assigned to another worker or eliminated when a fixed number of workers could not process it.

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed).  Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

With each new extraction, we improved the extraction tool and turned it more and more into a flexible framework into which we now simply plug the needed file processors (for one single file) and which takes care of everything else.

This framework was now officially released under the terms of the Apache license. The framework takes care of everything that is related to file handling, distribution, and scalability and leaves to the user only the task of writing the code needed for extracting the desired information from a single out of the all CC files.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extraction tasks is found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework. We will continuously improve the framework and are happy about everybody who gives us feedback about her experiences with the framework.

Navigating the WARC file format

Wait, what’s WAT, WET and WARC?

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.

In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!

WARC/1.0
WARC-Type: response
WARC-Date: 2013-12-04T16:47:32Z
WARC-Record-ID: 
Content-Length: 73873
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 23.0.160.82
WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/
WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFB
WARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

HTTP/1.0 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Content-Encoding: gzip
Date: Wed, 04 Dec 2013 16:47:32 GMT
Content-Length: 18953
Connection: close


...HTML Content...

WAT Response Format

WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.

The HTTP response metadata is most likely to be of interest to CommonCrawl users. The skeleton of the JSON format is outlined below.

  • Envelope
    • WARC-Header-Metadata
    • Payload-Metadata
      • HTTP-Response-Metadata
        • Headers
          • HTML-Metadata
            • Head
              • Title
              • Scripts
              • Metas
              • Links
            • Links
    • Container

WET Response Format

As many tasks only require textual information, the CommonCrawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://advocatehealth.com/condell/emergencyservices3
WARC-Date: 2013-12-04T15:30:35Z
WARC-Record-ID: 
WARC-Refers-To: 
WARC-Block-Digest: sha1:3SJBHMFPOCUJEHJ7OMGVCRSHQTWLJUUS
Content-Type: text/plain
Content-Length: 5765


...Text Content...

Processing the file format

We’ve provided three introductory examples in Java for the Hadoop framework. The code also contains wrapper tools for making working with the Web Archive Commons library easier in Hadoop.

These introductory examples include:

  • Count the number of times varioustags are used across HTML on the internet using the WARC files
  • Counting the number of different server types found in the HTTP headers using the WAT files
  • Word count over the extracted plaintext found in the WET files

If you’re using a different language, there are a number of open source libraries that handle processing these WARC files and the content they contain. These include:

If in doubt, the tools provided as part of the IIPC’s Web Archive Commons library are the preferred implementation.

Stephen MerityThis is a guest blog post by Stephen Merity.
Stephen Merity is a Computational Science and Engineering master’s candidate at Harvard University. His graduate work centers around machine learning and data analysis on large data sets. Prior to Harvard, Stephen worked as a software engineer for Freelancer.com and as a software engineer for online education start-up Grok Learning. Stephen has a Bachelor of Information Technology (Honours First Class with University Medal) from the University of Sydney in Australia.

Analysis of the NCSU Library URLs in the Common Crawl Index

Jason RonalloLast week we announced the Common Crawl URL Index. The index has already proved useful to many people and we would like to share an interesting use of the index that was very well described in a great blog post by Jason Ronallo.

Jason is the Associate Head of Digital Library Initiatives at North Carolina State University Libraries. He used the Common Crawl Index to look at NCSU Library URLs in the Common Crawl Index. You can see his description of his work and results below and on his blog. Be sure to follow Jason on Twitter and on his blog to keep up to date with other interesting work he does!

 

Common Crawl URL Index

The Common Crawl now has a URL index available. While the Common Crawl has been making a large corpus of crawl data available for over a year now, if you wanted to access the data you’d have to parse through it all yourself. While setting up a parallel Hadoop job running in AWS EC2 is cheaper than crawling the Web, it still is rather expensive for most. Now with the URL index it is possible to query for domains you are interested in to discover whether they are in the Common Crawl corpus. Then you can grab just those pages out of the crawl segments.

Scott Robertson, who was responsible for putting the index together, writes in the github README about the file format used for the index and the algorithm for querying it. If you’re interested you can read the details there.

If you just want to see how to get the data now, the repository provides a couple python scripts for querying the index. I used the remote_read script. You’ll need to clone the git repository to get the script along with the library files:

1
git clone https://github.com/trivio/common_crawl_index.git

Then enter the cloned repository and make the file executable:

1
2
cd common_crawl_index
chmod u+x bin/remote_read

Since the data set is hosted for free as part of AWS open data sets, it appears that they allow anonymous access. This means that you may not have to sign up for an Amazon Web Services account. The current remote_read script does not have this anonymous access turned on, but there is an open issue and patch submitted to allow anonymous access. You may want to get that version of the remote_read script and use it until that issue is closed.

If you have an account you want to use, you’ll update these lines in remote_read with your own AWS key and secret.

1
2
3
4
5
6
  mmap = BotoMap(
    '<YOUR AWS KEY>',
    '<YOUR AWS SECRET>',
    'commoncrawl',
    '/projects/url-index/url-index.1356128792'
  )

Finally you’ll have to install boto:

1
pip install boto

Now you can run the script:

querying the Common Crawl URL index
1
bin/remote_read edu.ncsu.lib

Note that because of how the index is constructed you’ll be querying for domains in reverse order. This allows you scope your queries to match everything from a TLD down to a specific subdomain. This will return every URL matching under http://lib.ncsu.edu as well as any subdomains like http://d.lib.ncsu.edu.

As I write this, the index is only partial, while folks provide feedback on the index, so your current results may not reflect everything that is currently in the Common Crawl corpus.

NCSU Libraries’ URLs in the Common Crawl Index

You can see the results for my query for edu.ncsu.lib. Here’s a snippet from the beginning of the set:

Query for edu.ncsu.lib in Common Crawl URL IndexFull Text
1
2
3
4
5
6
7
edu.ncsu.lib.blogs/:http {'arcFileParition': 200, 'compressedSize': 2062, 'arcSourceSegmentId': 1346876860565, 'arcFileDate': 1346911439829, 'arcFileOffset': 1518768}
edu.ncsu.lib.d/:http {'arcFileParition': 2132, 'compressedSize': 855, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908147933, 'arcFileOffset': 2759941}
edu.ncsu.lib.d/collections/:http {'arcFileParition': 2132, 'compressedSize': 5165, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 81186482}
edu.ncsu.lib.d/collections/catalog/0228376:http {'arcFileParition': 2132, 'compressedSize': 5738, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 60135728}
edu.ncsu.lib.d/collections/catalog/bh2127pnc001:http {'arcFileParition': 2132, 'compressedSize': 6003, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346908633502, 'arcFileOffset': 27791779}
edu.ncsu.lib.d/collections/catalog/unccmc00145-002-ff0003-002-004_0002:http {'arcFileParition': 2132, 'compressedSize': 6456, 'arcSourceSegmentId': 1346876860782, 'arcFileDate': 1346909064600, 'arcFileOffset': 7308443}
edu.ncsu.lib.databases/:http {'arcFileParition': 76, 'compressedSize': 5688, 'arcSourceSegmentId': 1346823846039, 'arcFileDate': 1346870337194, 'arcFileOffset': 37040682}

The result is a line delimited file with information about one URL on each line. A space separates the URL from some JSON-like data. (You’ll need to convert the single quotes to double quotes for it to parse as JSON, or just eval the data with Python if you are filled with trust or like to live dangerously.) Again, the URL hostname is in reverse order followed by the path in normal order and finally the protocol. The data is a pointer to the location for the file within a segment of the common crawl dataset. This information can be used toretrieve the page from AWS S3.

What I’m interested in is what NCSU Libraries URLs are represented in the index. In total the URL index has 4033 URLs that all look to be from a crawl in early September. Here’s the breakdown for subdomains:

Query for edu.ncsu.lib in Common Crawl URL IndexFull Text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
 2278: www.lib.ncsu.edu
  801: repository.lib.ncsu.edu
  326: geodata.lib.ncsu.edu
  289: news.lib.ncsu.edu
  149: lib.ncsu.edu
   54: wikis.lib.ncsu.edu
   27: images.lib.ncsu.edu
   16: ncslaap.lib.ncsu.edu
   15: historicalstate.lib.ncsu.edu
   11: insidewood.lib.ncsu.edu
   10: ncarchitects.lib.ncsu.edu
    5: d.lib.ncsu.edu
    2: media.lib.ncsu.edu
    2: insight.lib.ncsu.edu
    2: hegel.lib.ncsu.edu
    2: libb.ncsu.edu
    2: library.ncsu.edu
    1: reserves.lib.ncsu.edu
    1: ww.lib.ncsu.edu
    1: linkinghub.elsevier.com.www.lib.ncsu.edu
    1: wais.lib.ncsu.edu
    1: vega.lib.ncsu.edu
    1: tricks.lib.ncsu.edu
    1: smithers.lib.ncsu.edu
    1: sirius.lib.ncsu.edu
    1: sfx.lib.ncsu.edu
    1: search.lib.ncsu.edu
    1: rppnt.lib.ncsu.edu
    1: rpp.lib.ncsu.edu
    1: web.ebscohost.com.www.lib.ncsu.edu
    1: isiknowledge.com.www.lib.ncsu.edu
    1: bst.sagepub.com.www.lib.ncsu.edu
    1: ncsulib4.lib.ncsu.edu
    1: ncsulib2.lib.ncsu.edu
    1: sag.sagepub.com.www.lib.ncsu.edu
    1: nclive.lib.ncsu.edu
    1: sciencedirect.com.www.lib.ncsu.edu
    1: ncarchitect.lib.ncsu.edu
    1: metcalf.lib.ncsu.edu
    1: springerlink.com.www.lib.ncsu.edu
    1: luna.lib.ncsu.edu
    1: libntapp1.lib.ncsu.edu
    1: jacob.lib.ncsu.edu
    1: proquest.umi.com.www.lib.ncsu.edu
    1: www3.interscience.wiley.com.www.lib.ncsu.edu
    1: pubs.acs.org.www.lib.ncsu.edu
    1: link.aps.org.www.lib.ncsu.edu
    1: avmajournals.avma.org.www.lib.ncsu.edu
    1: gopher.lib.ncsu.edu
    1: vetpathology.org.www.lib.ncsu.edu
    1: ftp.lib.ncsu.edu
    1: etd.lib.ncsu.edu
    1: ematb.lib.ncsu.edu
    1: dli.lib.ncsu.edu
    1: dewey.lib.ncsu.edu
    1: databases.lib.ncsu.edu
    1: wwwnew.lib.ncsu.edu
    1: blogs.lib.ncsu.edu
    1: bliss.lib.ncsu.edu
Total URLs: 4033
Total hostnames: 59

Analyzing the Results

The results here are interesting as I’m always trying to raise the discoverability of NCSU Libraries’ digital collections. At the top of the list is the main web site for NCSU Libraries. The hostnames www.lib.ncsu.edu and lib.ncsu.edu both point to the same resources. Looking closer we find that of the 2427 URLs there, many are for digital collections related pages. 636 are under the Special Collections Research Center, and some of these are pages for some legacy collections. 407 URLs are for pages in our collection guides application, many of them for individual guides or, strangely the EAD XML for the guides. Some of those collection guides do link to online digital collections.

The institutional repository (Dspace instances) is also well represented at the top of this list. The Technical Reports Repository accounts for 159 of those URLs, and the NCSU Institutional Repository accounts for just 3. The digital collections in the repository, mainly special collections, accounts for 626 URLs. 719 of the 801 repository URLs are directly to the PDFs. Evidently the PDFs rank higher than the landing pages.

NCSU Libraries has been providing Geospatial Data Services and paying attention to SEO for those pages for a long time, so it isn’t completely surprising that this directory of files has gotten indexed: http://geodata.lib.ncsu.edu. (Note that this server may not be accessible from off-campus.) Many of the URLs under www.lib.ncsu.edu are also GIS pages, so GIS data services and collections pages are even better represented–and human-friendly–than at first appears.

Other digital collections projects like Historical State, Inside Wood, North Carolina Architects & Builders, and NCSU Libraries’ Rare and Unique Materials are represented, but nowhere near exhaustively. Historical State now canonicalizes its URLs for individual resources to point to the Rare and Unique Materials site, but Common Crawl may not be paying attention that that hint. (Hopefully, at some point I’ll be able to do a similar analysis for historicalstate.lib.nsu.edu as I’ve done in the following.)

For http://d.lib.ncsu.edu these are the URLs listed:

So it appears that the Common Crawl probably hasn’t (at least in this half of the index!), decided to crawl this site to any extent. Instead it appears it is only deciding to crawl pages that have been linked up. Once the rest of the index comes out, I’ll have to take a look, and consider how to improve that number. The key though is obviously getting more links into the site.

Further down in the list there are a bunch of funny looking URLs. I think these are all proxy URLs for user authentication to restricted resources.

  • linkinghub.elsevier.com.www.lib.ncsu.edu
  • web.ebscohost.com.www.lib.ncsu.edu
  • isiknowledge.com.www.lib.ncsu.edu

http://gopher.lib.ncsu.edu no longer seems to exist, so I don’t know where they got that page.

Double Checking in Web Data Commons

While the Common Crawl URL index is useful if you need the whole page, in many cases just extracted embedded semantic markup may be enough. The Web Data Commons is already extracting Microdata and RDFa data, and makes indexes available, though it takes a bit more effort to parse through their indexes. (I’d like a service or script to query for an N-Quad context and get back all the related triples. Anyone know if there is already such a service? Do I have to write one?) They do have a helpful page on how to download the extracted data in whole or in part.

The http://d.lib.ncsu.edu/collections/ site publishes Microdata and Schema.org. Looking in the Web Data Commons Microdata index I found the the N-Quad file with triples extracted from ncsu.edu pages. They list only the same URLs as the Common Crawl URL index reports. This leads me to believe that these may be the only URLs in the Common Crawl index right now even though that index is incomplete.

What can libraries and archives do with this?

First, how much of your content is in the Common Crawl corpus? I’d be interested in hearing what your results are like.

We need to figure out how to get more cultural heritage content crawled and indexed by the Common Crawl. Without our stuff in the Common Crawl we are missing many opportunities to broaden the reach of our content. It doesn’t appear that Common Crawl accepts sitemaps. It works off of page rank and the link graph of popular sites. While my sites for rare and unique digital collections get most of their traffic from search engines, mainly Google, an increasing amount of traffic is due to referrals. Referrals, links from other sites, seem like the key for getting our stuff into the corpus. Efforts to add links to library special or digital collections to appropriate Wikipedia articles and the like would seem to be a good starting point.

Social sites are in the corpus and may also be a good way to get inbound links to our collections. There are 134,928+ Pinterest URLs in the Common Crawl index, and folks are actively pinning content from d.lib.ncsu.edu. Will the content pinned and repinned on Pinterest begin showing up in the crawl? Where else are crawlers likely to find links from people who make use of our content?

If more cultural heritage content is a part of the index, then there are all sorts of things we can begin to do. For web archiving projects it would be possible to begin with data in the corpus, potentially saving some crawling expense. New targeted search engines (or aggregations) can be created for different slices of content. Implement Microdata (or RDFa Lite) with Schema.org vocabularies and richer metadata can be extracted from your pages by the Web Data Commons and understood by many. This data can then be used in a variety of interfaces to save the time of the user in finding the content they really want.

What are some other ways that libraries, archives, and museums might be able to use the Common Crawl?

You can see the simple Ruby scripts I used for parsing the Common Crawl URL index out and the Web Data Commons N-Quads in this gist.

Common Crawl URL Index

We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. You can read his guest blog post below and be sure to check out the triv.io site to learn more about how they help groups solve big data problems.

Common Crawl URL Index
by Scott Robertson

Common Crawl is my goto data set. It’s a huge collection of pages crawled from the internet and made available completely unfettered. Their choice to largely leave the data alone and make it available “as is”, is brilliant.

It’s almost like I did the crawling myself, minus the hassle of creating a crawling infrastructure, renting space in a data center and dealing with spinning platters covered in rust that freeze up you when you least want them to. I exaggerate. In this day and age I would spend hours, days maybe weeks agonizing over cloud infrastructure choices and worrying about my credit card bills if I wanted to create something on that scale.

If you want to create a new search engine, compile a list of congressional sentiment, monitor the spread of Facebook infection through the web, or create any other derivative work, that first starts when you think “if only I had the entire web on my hard drive.” Common Crawl is that hard drive, and using services like Amazon EC2 you can crunch through it all for a few hundred dollars. Others, like the gang at Lucky Oyster , would agree.

Which is great news! However if you wanted to extract only a small subset, say every page from Wikipedia you still would have to pay that few hundred dollars. The individual pages are randomly distributed in over 200,000 archive files, which you must download and unzip each one to find all the Wikipedia pages. Well you did, until now.

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

The index itself is located public datasets bucket at s3://commoncrawl/projects/url-index/url-index.1356128792.

This is the first release of the index. The main goals of the design is to allow querying of the index via byte-range queries and to make it easy to implement in any language. We hope you dear reader, will be encouraged to jump in and contribute code to access the index under your favorite language.

For now we’ve avoided clever encoding schemes and compression. We’re expecting that to change as the community has a chance to work with the data and contribute their expertise. Join the discussion we’re happy to have you.