Common Crawl has started talking with the Open Cloud Consortium (OCC) about working together. If you haven’t already heard of the OCC, it is an awesome nonprofit organization managing and operating cloud computing infrastructure that supports scientific, environmental, medical and health care research. We’re very interested in facilitating the use of Common Crawl data by researchers and academics, so we are excited about the idea of working with the OCC.
The Open Cloud Consortium has four working groups, one of which is the Open Science Data Cloud (OSDC). The infrastructure of the OSDC has been designed to address the challenges inherent in transporting large datasets, to balance the needs of data management and data analysis, and to archive data. The OSDC is based on a shared community infrastructure where hardware and software are shared among researchers and projects at the scale where it is most efficient to centrally locate and process data.
The OSDC has carved out a space between small public infrastructures like AWS, and the very large, dedicated infrastructures needed for projects like the large hadron collider. The OCC’s diagram describes the distinction it makes between small, medium, and very large infrastructures:
More details about the OCC and its working groups can be found in a highly informative paper [PDF] that was presented by several members of the OCC team at the 2010 ACM International Symposium on High Performance Distributed Computing. The paper gives a technical overview and describes some of the challenges faced by the Open Science Data Cloud. You can also find more information on the Open Cloud Consortium website and on the Open Science Data Cloud website.
We are excited about the important work being done by the Open Cloud Consortium and by the possibility of working closely with its Open Science Data Cloud working group. Stay tuned for more news as our partnership with the organization develops.