Common Crawl Discussion List

We have started a Common Crawl discussion list to enable discussions and encourage collaboration between the community of coders, hackers, data scientists, developers and organizations interested in working with open web crawl data. Please join our discussion mailing list to:

  • Discuss challenges
  • Share ideas for projects and products
  • Look for collaborators and partners
  • Offer advice and share methods
  • Ask questions and get advice from others
  • Show off cool stuff you build
  • Keep up to date on the latest news from Common Crawl

The Common Crawl discussion list uses Google Groups and you can sign up here.

  • Rich

    I found some memory leaks in your code, I’ve been running up to 10 instances against the corpus in a MapReduce jobs, after 3 chunks of data per instance have been distributed to each of the 10 nodes the memory continues to climb. I suspect you are doing something wrong with your buffers and maybe the queue? The memory never cleans up so therefore the machines run out of memory and bring processing speeds down drastically. Ive tested this in several cases and also double checked my code, which is threaded in the Map method, and returns clean. Is there anyway you can fix your queue and or buffers and put another distro out on Git? The sooner the better. . .thanx, also getting this error at times

    java.io.IOException: IO Timeout waiting for Buffer
    at org.commoncrawl.util.shared.ArcFileReader$2.read(ArcFileReader.java:163)
    at java.io.FilterInputStream.read(FilterInputStream.java:116)
    at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
    at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:221)
    at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
    at org.commoncrawl.util.shared.ArcFileReader.getNextItem(ArcFileReader.java:271)
    at org.commoncrawl.hadoop.io.ARCSplitReader.next(ARCSplitReader.java:171)
    at org.commoncrawl.hadoop.io.ARCSplitReader.next(ARCSplitReader.java:199)
    at org.commoncrawl.hadoop.io.ARCSplitReader.next(ARCSplitReader.java:42)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:194)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:178)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    • Yash

      I am facing a similar problem.. Has any one found a solution for this?

      Following is the error i am getting. 
      java.io.IOException: IO Timeout waiting for Buffer        at org.commoncrawl.util.shared.ArcFileReader$2.read(ArcFileReader.java:163)        at org.commoncrawl.util.shared.ArcFileReader$2.read(ArcFileReader.java:142)        at java.io.FilterInputStream.read(FilterInputStream.java:66)        at java.io.PushbackInputStream.read(PushbackInputStream.java:122)        at java.util.zip.CheckedInputStream.read(CheckedInputStream.java:42)        at org.commoncrawl.util.shared.ArcFileReader.readUByte(ArcFileReader.java:461)        at org.commoncrawl.util.shared.ArcFileReader.readUShort(ArcFileReader.java:453)        at org.commoncrawl.util.shared.ArcFileReader.readHeader(ArcFileReader.java:383)        at org.commoncrawl.util.shared.ArcFileReader.readARCHeader(ArcFileReader.java:327)        at org.commoncrawl.util.shared.ArcFileReader.hasMoreItems(ArcFileReader.java:234)        at org.commoncrawl.hadoop.io.ARCSplitReader.next(ARCSplitReader.java:166)        at org.commoncrawl.hadoop.io.ARCSplitReader.next(ARCSplitReader.java:199)        at org.commoncrawl.hadoop.io.ARCSplitReader.next(ARCSplitReader.java:42)        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)        at org.apache.hadoop.mapred.Child.main(Child.java:170)