5 Good Reads in Big Open Data: Feb 6 2015

  1. The Dark Side of Open Data – via Forbes:

    There’s no reason to doubt that opening to the public of data previously unreleased by governments, if well managed, can be a boon for the economy and, ultimately, for the citizens themselves. It wouldn’t hurt, however, to strip out the grandiose rhetoric that sometimes surrounds them, and look, case by case, at the contexts and motivations that lead to their disclosure.


  2.  Bigger Data; Same Laptop -via Frank McSherry: throwing more machines at a problem isn’t necessarily the best approach. A laptop can outperform clusters when used effectively. This post uses the Web Data Commons 128 billion edge Hyperlink Graph, created using Common Crawl data, to showcase that.


  3. Fixing Verizon’s permacookie – via Slate: 9 lines of code could make Verizon’s controversial user-tracking system slightly less invasive and much less creepy.


  4. Interact with Committee to Protect Journalist ‘s Data- via Reuters Graphics: interactive map of journalists killed over time and by location

    Source: Committee to Protect Journalists Graphic by Matthew Weber/Reuters Graphics

    Source: Committee to Protect Journalists
    Graphic by Matthew Weber/Reuters Graphics


  5. The EU wants the rest of the world to forget too – via The New York Times:

    Countries have different standards for acceptable speech and for invasions of privacy. American libel laws, for example, are much more permissive than those in Britain. That’s why authors sometimes find it easier to have some books published in United States than in Britain. There is no doubt that the Internet has made it harder for governments to enforce certain rules and laws because information is not easily contained within borders. But that does not justify restricting the information available to citizens of other countries.

    Follow us @CommonCrawl on Twitter for the latest in Big Open Data

The Promise of Open Government Data & Where We Go Next

One of the biggest boons for the Open Data movement in recent years has been the enthusiastic support from all levels of government for releasing more, and higher quality, datasets to the public. In May 2013, the White House released its Open Data Policy and announced the launch of Project Open Data, a repository of tools and information–which anyone is free to contribute to–that help government agencies release data that is “available, discoverable, and usable.”

Since 2013, many enterprising government leaders across the United States at the federal, state, and local levels have responded to the President’s call to see just how far Open Data can take us in the 21st century. Following the White House’s groundbreaking appointment in 2009 of Aneesh Chopra as the country’s first Chief Technology Officer, many local and state governments across the United States have created similar positions. San Francisco last year named its first Chief Data Officer, Joy Bonaguro, and released a strategic plan to institutionalize Open Data in the city’s government. Los Angeles’ new Chief Data Officer, Abhi Nemani, was formerly at Code for America and hopes to make LA a model city for open government. His office recently launched an Open Data portal along with other programs aimed at fostering a vibrant data community in Los Angeles.1

Open government data is powerful because of its potential to reveal information about major trends and to inform questions pertaining to the economic, demographic, and social makeup of the United States. A second, no less important, reason why open government data is powerful is its potential to help shift the culture of government toward one of greater collaboration, innovation, and transparency.

These gains are encouraging, but there is still room for growth. One pressing issue is for more government leaders to establish Open Data policies that specify the type, format, frequency, and availability of the data  that their offices release. Open Data policy ensures that government entities not only release data to the public, but release it in useful and accessible formats.

Only nine states currently have formal Open Data policies, although at least two dozen have some form of informal policy and/or an Open Data portal.2 Agencies and state and local governments should not wait too long to standardize their policies about releasing Open Data. Doing so will severely limit Open Data’s potential. There is not much that a data analyst can do with a PDF.

One area of great potential is for data whizzes to pair open government data with web crawl data. Government data makes for a natural complement to other big datasets, like Common Crawl’s corpus of web crawl data, that together allow for rich educational and research opportunities. Educators and researchers should find Common Crawl data a valuable complement to government datasets when teaching data science and analysis skills. There is also vast potential to pair web crawl data with government data to create innovative social, business, or civic ventures.

Innovative government leaders across the United States (and the world!) and enterprising organizations like Code for America have laid an impressive foundation that others can continue to build upon as more and more government data is released to the public in increasingly usable formats. Common Crawl is encouraged by the rapid growth of a relatively new movement and we are excited to see the collaborations to come as Open Government and Open Data grow together.

 

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. She is currently pursuing a master’s degree in public policy from the Goldman School of Public Policy at the University of California, Berkeley.

December 2014 Crawl Archive Available

The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-52/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

November 2014 Crawl Archive Available

The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-49/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Please Donate To Common Crawl!

Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data. Imagine the questions we could answer and the problems we could solve if talented, creative technologists could freely access more big data.

At Common Crawl, we are passionate about getting big open data into the hands of talented and creative people. Increasing access to data enables everything from business innovation to groundbreaking research.

Common Crawl is proud of what we have accomplished in 2014 thanks to our dedicated team and the support of donors like you.

This year:

  • 19 academic publications using Common Crawl data were published
  • Several Open Educational Resources designed to teach big data tools and methods were created
  • 1.3 petabytes containing 18.5 billion web pages were added to the Common Crawl corpus
  • Numerous presentations and tutorials were given at international conferences, local meetup groups, and academic workshops in six countries

100% of our funding comes from donors like you — Thank you! We accomplish a great deal with a small, dedicated staff on a limited budget so your investment in us has a big impact.

More resources for Common Crawl means more access to data for everyone. In 2015 we have big plans to scale up crawling to more rapidly increase the Common Crawl corpus and to grow our educational programs and catalogue of tutorials in order to invest in the next generation of data-driven technologists.

Whether it’s $10 or $10,000, Common Crawl needs your donation today.

Donate here!

Thank you very much,
Lisa Green and the Common Crawl team

October 2014 Crawl Archive Available

The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-42/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

September 2014 Crawl Archive Available

The crawl archive for September 2014 is now available! This crawl archive is over 220TB in size and contains 2.98 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-41/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

August 2014 Crawl Data Available

The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-35/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Robert MeuselThis is a guest blog post by Robert Meusel.
Robert Meusel is a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project. The post below describes a new tool produced by Web Data Commons for extracting data from the Common Crawl data.


The Web Data Commons project extracts structured data from the Common Crawl corpora and offers the extracted data for public download. We have extracted one of the largest hyperlink graphs that is currently available to the public. We also extract and offer large corpora of Microdata, Microformats and RDFa annotations as well as relational HTML tables. If you ask us, why we do this? Because we share the opinion that data should be available to everybody and because we want to make it easier to exploit the wealth of information that is available on the Web.

For performing the extractions, we need to go through all the hundreds of tera-bytes of crawl data offered by the Common Crawl Foundation. As a project without any direct funding or salaried persons, we needed a time-, resource- and cost-efficient way to process the CommonCrawl corpora. We thus developed a data extraction tool which allows us to process the Common Crawl corpora in a distributed fashion using Amazon cloud services (AWS).

The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed. Each worker receives a new file from the queue whenever it is ready and informs the queue about the status (success of failure) of the processing. Successfully processed files are removed from the queue, failures are assigned to another worker or eliminated when a fixed number of workers could not process it.

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed).  Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

With each new extraction, we improved the extraction tool and turned it more and more into a flexible framework into which we now simply plug the needed file processors (for one single file) and which takes care of everything else.

This framework was now officially released under the terms of the Apache license. The framework takes care of everything that is related to file handling, distribution, and scalability and leaves to the user only the task of writing the code needed for extracting the desired information from a single out of the all CC files.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extraction tasks is found at

http://webdatacommons.org/framework

We encourage all interested parties to make use of the framework. We will continuously improve the framework and are happy about everybody who gives us feedback about her experiences with the framework.

July 2014 Crawl Data Available

The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2014-23/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

We’ve also released a Python library, gzipstream, that should enable easier access and processing of the Common Crawl dataset. We’d love for you to try it out!

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Note: the original estimate for this crawl was 4 billion, but after full analytics were run, this estimate was revised.