objects PDF 10000 files type checking list of types Figure 11. Validation process. relaxed parser Figure 12. Set of real-world files. graph checking no error detected parsed 8993 files encrypted 478 files syntax errors 529 files extracted randomly from an English dictionary. In total, we gathered 10000 files. This method may have some biases: correct files have more chances to be indexed by the search engine, documents are likely to be written in English because of the queries that we make and the search engine may prefer some kind of high-quality content. However, we observed that the resulting set comprised a great variety of features and errors at every level of the validation process. Some of the files contained encrypted objects and others had syntax errors that were not recovered by our relaxed parser. In total, we could parse 8993 files (see Fig. 12). All possible versions of PDF – from 1.0 to 1.7 – were present in the set (Fig. 13). The majority of the files contained between 50 and 300 objects (Fig. 14). Direct validation. We first tested the validation chain directly. Only 1465 files were accepted by our strict syntactic Version Number of files 1.0 0.1% 1.1 0.5% 1.2 4.1% 1.3 19.7% 1.4 32.2% 1.5 23.6% 1.6 15.6% 1.7 4.2% Figure 13. PDF Version. PDF 10000 files 10 1 10 2 10 3 10 4 10 5 Figure 14. Number of objects per file (8993 files). strict parser no error found 536 files parsed 1465 files graph checking Figure 15. Direct validation. type checking type checked 536 files rules, of which 536 were successfully type-checked. No error was found in the graph structure for them (Fig. 15). Limitations of the strict parser. In fact, the majority of the files made use of incremental updates, that were not allowed by our restrictions. Some files even contained several successive updates (Fig. 16). Apart from that, many files used object streams and/or contained free objects (Fig. 17). Finally, from the files that did not contain these forbidden structures, a significant number did not conform to our strict syntactic rules. 6.3. Normalization Relaxed parser. In order to handle more files, we implemented a normalization tool that rewrites files into the restricted format when possible. This tool was based on our relaxed parser. Contrary to the strict parser – that processes Number of updates Number of files 0 36.2% 1 43.1% 2 18.3% 3 1.3% 4 0.4% ≥ 5 0.8% Figure 16. Number of incremental updates per file. Problematic structure Number of files Incremental updates 64% Object streams 35% Free objects 29% Encryption 5% At least one of these structures 76% Figure 17. Structures not allowed by the strict parser.

the file linearly from the beginning to the end – the relaxed parser uses the xref table(s) to obtain the positions of objects and extract them. Hence, it was able to decode files containing incremental updates and object streams. It did not recognize linearized files as such, but these files could be decoded by means of standard xref tables, because linearization only adds metadata to the classic structure. Our normalization tool then removed all objects that were not accessible from the trailer and renumbered the objects in a continuous range starting from zero. In total, we could rewrite 8993 files out of 10000. Ad hoc extensions. In practice, some files did not pass the relaxed parser, either because they contained encrypted objects or because of syntax errors with respect to the standard PDF specification. For example, we found files that contained ill-formed cross-reference tables. In fact, the cross-reference table allows to define in-use objects but also free objects. This is useful to remove content by means of incremental updates. The invalid files that we found incorrectly declared in-use objects at offset zero instead of using the appropriate syntax for free objects. Hence, our first version of the relaxed parser could not parse them – since it did not recognize an object at offset zero. However, since this bug was common, we decided to add an ad hoc option to accept this mistake and be able to normalize the files in question. We implemented similar options for common bugs that we noticed, but did not intent to correct every possible mistake. Thus, some files could not be normalized. The 8993 normalized files were obtained with these options. Consistency. A required property of any normalization step is to be non-destructive: a normalized file must be equivalent to the original one from a semantic point of view. For this purpose, we checked manually – i.e. on a restricted set of files – that the normalized files could effectively be opened in PDF readers and that the graphical result was the same. Since our normalization simply rewrote the objects in a cleaner manner but did not modify the content of these objects, we have good confidence that the process is effectively non-destructive. Benefits. Moreover, we found cases where the normalized file was better than the original. For example, PDF allows to integrate forms that the user can fill in and save into a new file. This new file often makes use of an incremental update to append the new content of the forms – and invalidate the previous content. However, we noticed that some forms filled-in with some reader could sometimes not be loaded by other readers. Yet, after normalization by CARADOC, these files were readable by all readers. Clearly, normalization made these files more portable. This example also shows that incremental updates are not well supported by PDF readers in some cases, and that it made sense to disallow them in our restricted format. 6.4. Type checking The whole PDF language contains a large number of types, presented in more than 700 pages in the specification. Hence, we did not intent to be feature-complete, but to integrate the most common types. This subset can easily be extended in the future. Choices of types. To define the types to include first, we worked with simple PDF files produced by L A TEX, such as an article and a BEAMER presentation. We added types incrementally until these files were fully type-checked. Then, we also used the set of real-world files to add some widespread types that were not present in our L A TEX files. In the end, our set comprises 165 types, including 108 classes. It contains the following types: • overall structure of the document: the catalog, the page tree, name trees and number trees; • graphical content of pages and resources: colors, images, fonts; • interactive content: the outline hierarchy, simple annotations, common actions (such as destination to a URI), transitions (for slideshows); • various metadata: viewer preferences (default view of a document), page labels (to number pages in a specific range). We did not implement the following elements yet, but they could be integrated in a future work: • advanced interactive content: JavaScript code, multimedia; • the logical structure of a document (that allows to identify elements such as chapters, sections or figures). Results. We present here the results of the type-checker on the normalized files. Although our set of types was incomplete, it gave promising results on our data set. In the majority of the files, we could infer the types of more than 90% of the objects (Fig. 18). 0 0.2 0.4 0.6 0.8 1 Figure 18. Proportion of objects of inferred type per file (7597 files). In most cases, we could only partially check the file, which means that the types of some objects were not inferred, but no errors were found in the checked objects. This partial inference was possible because we used a placeholder any type for types that we did not implement yet. Consequently, objects with this any type were not further inspected – and objects referenced by them were not traversed. Similarly, we sometimes included only the most common attributes in some types, which means that