Erratum

Charset Detection Bug in WET Records

Originally reported by 
Javier de la Rosa
.

The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 due to a bug in IIPC Web Archive Commons (see the related issue in the CC fork of Apache Nutch).  There should be significantly fewer errors in all subsequent crawls. Originally discussed here in Google Groups.

Affected Crawls
Affected Web Graphs
No items found.