A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!
URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.
We are excited to see examples of URL Search in action. Are you working with Common Crawl data? Would you like to win $100 in AWS credit for sharing how URL Search makes your life easier? The first five people who share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!
Email a link to the GitHub repo to firstname.lastname@example.org for consideration. The code must be accompanied by a ReadMe file that explains. If you would like to write a guest blog post about your work we would be happy to publish it on the Common Crawl blog.