Erratum

SURT URLs do not properly encode non-UTF-8 percent-encoded characters

Originally reported by 
Tom Morris
.

When constructing SURT (Sort-friendly URI Reordering Transform) URLs, percent-encoded characters that are not valid UTF-8 sequences were not being correctly handled. This could lead to inconsistencies in URL normalization and sorting, potentially causing incorrect deduplication or retrieval issues in datasets that rely on SURT-based indexing.  This was addressed in commoncrawl/nutch@6b2d9ea.

Affected Crawls
Affected Web Graphs
No items found.