The Promise of Open Government Data & Where We Go Next

One of the biggest boons for the Open Data movement in recent years has been the enthusiastic support from all levels of government for releasing more, and higher quality, datasets to the public. In May 2013, the White House released its Open Data Policy and announced the launch of Project Open Data, a repository of tools and information–which anyone is free to contribute to–that help government agencies release data that is “available, discoverable, and usable.”

Since 2013, many enterprising government leaders across the United States at the federal, state, and local levels have responded to the President’s call to see just how far Open Data can take us in the 21st century. Following the White House’s groundbreaking appointment in 2009 of Aneesh Chopra as the country’s first Chief Technology Officer, many local and state governments across the United States have created similar positions. San Francisco last year named its first Chief Data Officer, Joy Bonaguro, and released a strategic plan to institutionalize Open Data in the city’s government. Los Angeles’ new Chief Data Officer, Abhi Nemani, was formerly at Code for America and hopes to make LA a model city for open government. His office recently launched an Open Data portal along with other programs aimed at fostering a vibrant data community in Los Angeles.1

Open government data is powerful because of its potential to reveal information about major trends and to inform questions pertaining to the economic, demographic, and social makeup of the United States. A second, no less important, reason why open government data is powerful is its potential to help shift the culture of government toward one of greater collaboration, innovation, and transparency.

These gains are encouraging, but there is still room for growth. One pressing issue is for more government leaders to establish Open Data policies that specify the type, format, frequency, and availability of the data  that their offices release. Open Data policy ensures that government entities not only release data to the public, but release it in useful and accessible formats.

Only nine states currently have formal Open Data policies, although at least two dozen have some form of informal policy and/or an Open Data portal.2 Agencies and state and local governments should not wait too long to standardize their policies about releasing Open Data. Doing so will severely limit Open Data’s potential. There is not much that a data analyst can do with a PDF.

One area of great potential is for data whizzes to pair open government data with web crawl data. Government data makes for a natural complement to other big datasets, like Common Crawl’s corpus of web crawl data, that together allow for rich educational and research opportunities. Educators and researchers should find Common Crawl data a valuable complement to government datasets when teaching data science and analysis skills. There is also vast potential to pair web crawl data with government data to create innovative social, business, or civic ventures.

Innovative government leaders across the United States (and the world!) and enterprising organizations like Code for America have laid an impressive foundation that others can continue to build upon as more and more government data is released to the public in increasingly usable formats. Common Crawl is encouraged by the rapid growth of a relatively new movement and we are excited to see the collaborations to come as Open Government and Open Data grow together.

 

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. She is currently pursuing a master’s degree in public policy from the Goldman School of Public Policy at the University of California, Berkeley.

Winners of the Code Contest!

We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries. Several people let us know that they were not able to complete their project in time to submit to the contest. We’re currently working with them to finish the projects outside of the contest and we’ll be showcasing some of those projects in the near future! All entries creatively showcased the immense potential of the Common Crawl data.

Huge thanks to everyone who participated, and to our prize and swag sponsors! A special thank you goes out to our panel of incredible judges, who had a very difficult time choosing the winner from among such great entries.All of the entries were in the Social Impact category, so the third grand prize (originally intended for the Job Trends category) goes to the runner-up in Peoples’ Choice. And the grand prize winners are…

 

People’s Choice: Linking Entities to Wikipedia
It’s not surprising that this entry was so popular in the community!  It seeks to determine the meaning of a word based on the probability that it can be mapped to each of the possible Wikipedia concepts. It creates very useful building blocks for a large range of uses and it’s also exhilarating to see how it can be tweaked and tuned towards specific questions.

Project description
Code on GitHub


People’s Choice: French Open Data

Another very popular entry, this work maps the ecosphere of French open data in order to identify the players, their importance, and their relationship. The resulting graph grants insight into the world of French open data and the excellent code could easily be adapted to explore terms other than “Open Data” and/or could create subsets based on language.

Project description
Code on GitHub

 

Social Impact : Online Sentiment Towards Congressional Bills
This entry correlated Common Crawl data and congressional data to look at the online conversation surrounding individual pieces of legislation. Contest judge Pete Warden’s comments about this work do a great job of summing up all the excitement about this project:

“There are millions of people talking online about laws that may change their lives, but their voices are usually lost in the noise of the crowd. This work can highlight how ordinary web users think about bills in congress, and give their views influence on decision-making. By moving beyond polls into detailed analysis of people’s opinions on new laws, it shows how open data can ‘democratize’ democracy itself!”

Code on Github

 

Honorable Mentions
The four following projects – listed in alphabetical order – all came very close to winning.

Facebook Infection
Project description

Is Money the Root of All Evil?
Project description
Code on GitHub

Reverse Link!
Web app
Code on GitHub

Web Data Commons
Project description
Code on Assembla

We encourage you to check out the code created in the contest and see how you can use it to extract insight from the Common Crawl data! To learn how to get started, check out our wiki, which also includes some inspiration and ideas to get your creative juices flowing.

Thank you to all our wonderful sponsors!

Prize Sponsors:

 

Swag Sponsors:

 

Amazon Web Services sponsoring $50 in credit to all contest entrants!

Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits? If you’re a developer interested in big datasets and learning new platforms like Hadoop, you truly have no reason not to try your hand at creating an entry to the code contest! Plus, three grand prize winners will get $500 in AWS credits, so you can continue to play around with the dataset and hone your skills even more.

Amazon Web Services has published a dedicated landing page for the contest, which takes you straight to the data. Whether or not you decide to enter the code contest, if you’re looking to play around with and get used to the tools available, an excellent way to do so is with the Amazon Machine Image.

AWS has been a great supporter of the code contest as well as of Common Crawl in general. We are deeply appreciative for all they’re doing to help spread the word about Common Crawl and make our dataset easily accessible!

 

 

.

Strata Conference + Hadoop World

This year’s Strata Conference teams up with Hadoop World for what promises to be a powerhouse convening in NYC from October 23-25. Check out their full announcement below and secure your spot today.

Strata + Hadoop World

Now in its second year in New York, the O’Reilly Strata Conference explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World to create the largest gathering of the Apache Hadoop community in the world.

Strata brings together decision makers using the raw power of big data to drive business strategy, and practitioners who collect, analyze, and manipulate that data—particularly in the worlds of finance, media, and government

The keynotes they have lined up this year are fantastic! Doug Cutting on Beyond Batch Rich Hickey on 10/24 The Composite Database, plus Mike Olson, Sharmila Shahani-Mulligan, Cathy O’Neil, and other great speakers.  The sessions are also full of great topics. You can see the full schedule here. This year’s conference will include the launch of the Strata Data Innovation Awards. There is so much important work being done in the world of data it is going to be a very difficult decision for the award committee and I can’t wait to see who the award winners are.  The entire three days of Strata + Hadoop are going to be excited and thought-provoking – you can’t afford to miss it.

P.S. We’re thrilled to have Strata as a prize sponsor of Common Crawl’s First Ever Code Contest. If you’ve been thinking about submitting an entry, you couldn’t ask for a better reason to do so: you’ll have the chance to win an all-access pass to Strata Conference + Hadoop World 2012!

 

 

.

Common Crawl’s Brand Spanking New Video and First Ever Code Contest!

At Common Crawl we’ve been busy recently! After announcing the release of 2012 data and other enhancements, we are now excited to share with you this short video that explains why we here at Common Crawl are working hard to bring web crawl data to anyone who wants to use it. We hope it gets you excited about our work too. Please help us share this by posting, forwarding, and tweeting widely! We want our message to be broadcast loud and clear: openly accessible web crawl data is a powerful resource for education, research, and innovation of every kind.

We also hope that by the end of the video, you’ll be so inspired that you’ll be left itching to get your hands on our terabytes of data. Which is exactly why we’re launching our FIRST EVER CODE CONTEST. We’re calling all open data and open web enthusiasts to help us demonstrate the power of web crawl data to inform Job Trends and offer Social Impact Analysis, two examples given the video. If you’re up for the challenge, head over to our contest page to learn all the details of how to submit and get more ideas for ways to seek information from the corpus of data in these two very important fields of interest. The contest will be open for submission for just six weeks – until August 29th, and we’ve got some seriously awesome prizes and stellar judges lined up. So get coding!

OSCON 2012

We’re just one month away from one of the biggest and most exciting events of the year, O’Reilly’s Open Source Convention (OSCON). This year’s conference will be held July 16th-20th in Portland, Oregon. The date can’t come soon enough. OSCON is one of the most prominent confluences of “the world’s open source pioneers, builders, and innovators” and promises to stimulate, challenge, and amuse over the course of five action-packed days. It will feature an audience of 3,000 open-source enthusiasts, incredible speakers, more than a dozen tracks, and hundreds of workshops. It’s the place to be! So naturally, Common Crawl will be there to partake in the action.

Gil EbazGil Elbaz, Common Crawl’s fearless founder and CEO of Factual, Inc., will lead a session called Hiding Data Kills Innovation on Wednesday, July 18th at 2:30pm, where he’ll discuss the relationship between data accessibility and innovation. Other members of the Common Crawl team will be there as well, and we’re looking forward to meeting, connecting, and sharing ideas with you! Keep an eye out for Gil’s session and be sure to come say hi.

If you haven’t registered, it’s not too late to secure a spot today. If you’ve already registered, we hope to see you there! We’re curious: what are some other sessions you’re looking forward to at this year’s OSCON?

Learn Hadoop and get a paper published

We’re looking for students who want to try out the Hadoop platform and get a technical report published.

(If you’re looking for inspiration, we have some  paper ideas below. Keep reading.)

Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.

So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.

As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.

Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.


Step 1: Learn Hadoop


Step 2:
Turn your new skills on the Common Crawl corpus, available on Amazon Web Services.

  • “Identifying the most used Wikipedia articles with Hadoop and the Common Crawl corpus”
  • “Six degrees of Kevin Bacon: an exploration of open web data”
  • “A Hip-Hop family tree: From Akon to Jay-Z with the Common Crawl data”


Step 3:
Reflect on the process and what you find. Compile these valuable insights into a publication. The possibilities are limitless; here are some fun titles we’d love to see come to life:

Here are some other interesting topics you could explore:

  • Using this data can we ask “how many Jack Blacks are there in the world?”
  • What is the average price for a camera?
  • How much can you trust HTTP headers? It’s extremely common that the response headers provided with a webpage are contradictory to the actual page — things like what  language it’s in or the byte encoding. Browsers use these headers as hints but need to examine the actual content to make a decision about what that content is. It’s interesting to understand how often these two contradict.
  • How much is enough? Some questions we ask of data — such as “what’s the most common word in the english language” — actually don’t need much data at all to answer. So what is the point of a dataset of this size? What value can someone extract from the full dataset? How does this value change with a 50% sample, a 10% sample, a 1% sample? For a particular problem, how should this sample be done?
  • Train a text classifier to identify topicality. Extract meta keywords from Common Crawl HTML data, then construct a training corpus of topically-tagged documents to train a text classifier for a news application.
  • Identify political sites and their leanings. Cluster and visualize their networks of links (You could use Blekko’s /conservative  /liberal tag lists as a starting point).

So, again — if you think this might be fun, leave a comment now to mark your interest. Talk with your advisor, post a follow up to your comment, and we’ll be in touch!

Big Data Week: meetups in SF and around the world

 

Big Data Week aims to connect data enthusiasts, technologists, and professionals across the globe through a series of meetups between April 19th-28th. The idea is to build community among groups working on big data and to spur conversations about relevant topics ranging from technology to commercial use cases. With big data an increasingly hot topic, it’s becoming ever more important for data scientists, technologists, and wranglers to work together to establish best practices and build upon each others’ innovations.

With 50 meetups spread across England, Australia, and the U.S., there is plenty happening between April 19-28. If you’re in the SF Bay Area, here are a few noteworthy events that may be of interest to you!

  • Bio + Tech | Bio Hackers and Founders Meetup on Tuesday, April 24th, 7pm at Giordano in the Mission. This will be a great chance to network with a diverse group of professionals from across the fields of science, data, and medicine.
  • Introduction to Hadoop on Tuesday, April 24th, 6:30pm at Swissnex. This is a full event, but you can join the waiting list.
  • InfoChimps Presents Ironfan on Thursday, April 26th, 7pm at SurveyMonkey in Palo Alto. Hear Flip Kromer, CTO of Infochimps, present on Ironfan, which makes provisioning and configuring your Big Data infrastructure simple.
  • Data Science Hackathon on Saturday, April 28th. This international hackathon aims to demonstrate the possibilities and power of combining Data Science with Open Source, Hadoop, Machine Learning, and Data Mining tools.

See a full list of events on the Big Data Week website.

Common Crawl’s Advisory Board

As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts. We have a stellar line-up of advisory board members who will lend their passion and expertise in numerous fields as we grow our vision. Together with our dedicated Board of Directors, we feel the organization is more prepared than ever to usher in an exciting new phase for Common Crawl and a new wave of innovation in education, business, and research.

Here is a brief introduction to the men and women who have generously agreed to donate their time and brainpower to Common Crawl. Full bios are available on our Advisory Board page.

Our legal counsel, Kevin DeBré, is a well respected Intellectual Property (IP) attorney who has continually worked at the forefront of the evolving IP landscape. Glenn Otis Brown brings additional legal expertise as well as a long history of working at the forefront of tech and the open web, including currently serving as Director of Business Development for Twitter and on the board of Creative Commons. Another strong advocate for openness, Joi Ito, is Director of the MIT Media Lab and Creative Commons Board Chair, who brings with him years of innovative work as a thought-leader in the field.

We look forward to the advice of Jen Pahlka, founder and Executive Director at Code for America. Jen has led Code for America through a remarkable two years of growth to become a high-impact success, and we are delighted to have her insight on growing a non-profit as well as her experience working with government. Eva Ho, VP of Marketing & Operations at Factual who has also served on the boards of several nonprofits, brings additional insight into nonprofit management, as well as valuable experience around big data.

Big data is critical to our work of maintaining an open crawl of the web, and we are fortunate to have numerous experts who can advise on this critical area. Kurt Bollacker is the Digital Research Director of the Long Now Foundation and he formerly served as Technical Director at Internet Archive and Chief Scientist at Metaweb. Pete Skomoroch is a highly respected data scientist, currently employed by LinkedIn, who brings with him substantial knowledge about machine learning and search. Boris Shimanovsky is a prolific, lifelong programmer and Director of Engineering at Factual. Pete Warden, also a programmer, is the current CTO of Jetpac and a highly respected expert in large-scale data processing and visualization.

Danny Sullivan, widely considered a leading “search engine guru,” will bring valuable guidance and insight as Common Crawl grows and develops. Bill Michels is another member of our team with extensive experience in search from his years at Yahoo! which include working as Director of Yahoo! BOSS. We are very lucky to have Peter Norvig, Director of Research at Google and a Fellow of the American Association for Artificial Intelligence and the Association for Computing Machinery.

We are delighted that such an array of talented people see the importance in the work we do, and are honored to have their guidance as we look forward to a year of growth and milestones for Common Crawl.

Video Tutorial: MapReduce for the Masses

Learn how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents. Check out the full blog post where this video originally appeared.