Jobs

Come Help Build Common Crawl!

 

We’re looking for someone enthusiastic about open source, net neutrality, open data and keeping the web truly open. Common Crawl is dedicated to building and maintaining an open repository of web crawl data in order to enable a new wave of innovation, education and research. If you’re looking to do work that matters, come join us!

We’re set to do amazing things this year, and there is no better place to hone your big data skills than helping us manage and process our 100 TB corpus.  Plus, you’ll be working within a passionate community and have the chance to interface with plenty of talented researchers, educators, startup folks, and an incredible advisory board.

 

Crawl Engineer

    Responsibilities

  • Improve the stability, scaling, and visibility of our distributed web crawler
  • Use, improve, and extend our post-crawl, Hadoop-based web data processing pipeline so we can provide better and more regular crawls
  • Design and build an easy-to-use mechanism for specification and execution of custom crawls.
    Desired Skills & Experience

  • You have the necessary background to architect and code for a system with tens of billions of documents
  • You have strong coding ability and experience with Java and at least one scripting language (e.g. Python, Ruby, Perl, Lua).
  • You have in-depth knowledge of HTTP and are familiar with web crawlers.
  • You have development and administrative experience with Hadoop and HDFS.
  • Ops experience with Linux or other UNIX.
  • At least some familiarity with AWS, including one or more of EC2, S3, EBS, and EMR.
  • You like to build useful, thorough documentation of code and systems.
  • You’re a self-starter willing to take ownership of projects.

 

 

Data Scientist

    Responsibilities

  • Design and build a platform for extraction, transformation, and mining web crawls that we can give away to our users.
  • Run data analytics projects to provide examples and recipes for doing useful things with Common Crawl data.
  • Communicate with and help external users use the tools and data we provide.
  • Coordinate with external organizations to design and build custom crawling and processing pipelines.
    Desired Skills & Experience

  • You’re interested in big data: sorting through billions of web pages and finding current, interesting and relevant facts about the internet
  • You some familiarity with data mining toolkits (e.g. Weka, Mahout, R, NLTK), and understand how to use them in a scalable context.
  • You have development and administrative experience with Hadoop and HDFS.
  • Experience working in a Linux/AWS environment.
  • You like to build useful, thorough documentation of code and systems.
  • You’re a self-starter willing to take ownership of projects.

 

Volunteers and Interns

We are always happy to have volunteers and interns!

 

Please contact Lisa Green at lisa@commoncrawl.org for more information on any of the above positions.