The issue with web content is mainly the fact that web archive data is very heterogeneous. Depending on the policy of the institution, data contains text documents in all kinds of text encoding, html content loosely following different HTML specifications, audio and video files that were encoded with a variety of codecs, etc.. But in order to take any decisions in preservation, it is undispensable to have detailed information about the content in the web archive, especially those pieces of information that preservation tools depend on.
It is not possible to perform a data migration without knowing exactly what kind of digital object is encountered in the collection and what are the logical and technical dependencies of the object. And it is not only necessary to identify the single objects contained in an ARC/WARC file, but also identify container formats, like packaged files or any other container formats. Video files, for example, are often available as so called wrapper formats, like AVI, where each, the audio and video stream, can be encoded using different codecs. Down to this level the content stream must be identified if the institutional policy would foresee to preserve all video and audio content contained in a web archive.
Furthermore, the issue has two different aspects, one is the challenge to identify content that is already known. In this sense, the main goal of identification is to identify the content correctly. The second aspect is unknown content in the web archive which is measured by the coverage of identification tools, where coverage indicates the part of the content that can be identified. Coverage depends on reliability in the sense that a bad reliability can hide a bad coverage in case that many objects are incorrectly identified, but are actually unknown. The challenge regarding this second aspect is to reach a precise set of the unknown objects in order to be able to derive a plan dealing exactly with these objects.
From a practical point of view, the challenge starts with the ARC/WARC file format that ONB and SB as the main stakeholders of this issue are using in their web archive. The Heritrix web crawler (https://webarchive.jira.com/wiki/display/Heritrix/Heritrix) produces these files as a result of the web crawls. The business logic and implementation is accessible - Heritrix is available as a collaborative code project at Github: https://github.com/internetarchive/heritrix3, but it has been integrated in the the web crawler, not in web content preservation workflows. This leads to the subordinate issue of dealing with ARC/WARC files as the basis of web content preservation workflows.
The last aspect of this issue is the fact that several tools are known to generally address these kinds of challenges, still integration of the tools provided by the work package PC.WP.1 must be ensured by integrating them into real life workflows.