Erratum
SURT URLs do not properly encode non-UTF-8 percent-encoded characters
Originally reported by
Tom Morris
.
When constructing SURT (Sort-friendly URI Reordering Transform) URLs, percent-encoded characters that are not valid UTF-8 sequences were not being correctly handled. This could lead to inconsistencies in URL normalization and sorting, potentially causing incorrect deduplication or retrieval issues in datasets that rely on SURT-based indexing. This was addressed in commoncrawl/nutch@6b2d9ea.
Affected Crawls
Affected Web Graphs
No items found.