Erratum
Charset Detection Bug in WET Records
Originally reported by
Javier de la Rosa
.
The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 due to a bug in IIPC Web Archive Commons (see the related issue in the CC fork of Apache Nutch). There should be significantly fewer errors in all subsequent crawls. Originally discussed here in Google Groups.
Affected Crawls
Affected Web Graphs
No items found.