The file format through PDF 1.5 is well-supported, with the exception of the "linearized" or "optimized" output format, which this module can read but not write. Many specific aspects of the document model are not manipulable with this package (like fonts), but if the input document is correctly written, then this module will preserve the model integrity.

The PDF writing feature saves as PDF 1.4-compatible. That means that we cannot write compressed object streams. The consequence is that reading and then writing a PDF 1.5+ document may enlarge the resulting file by a fair margin.

This library grants you some power over the PDF security model. Note that applications editing PDF documents via this library MUST respect the security preferences of the document. Any violation of this respect is contrary to Adobe's intellectual property position, as stated in the reference manual at the above URL.

Technical detail regarding corrupt PDFs: This library adheres strictly to the PDF specification. Adobe's Acrobat Reader is more lenient, allowing some corrupted PDFs to be viewable. Therefore, it is possible that some PDFs may be readable by Acrobat that are illegible to this library. In particular, files which have had line endings converted to or from DOS/Windows style (i.e. CR-NL) may be rendered unusable even though Acrobat does not complain. Future library versions may relax the parser, but not yet.

Note: 'clean' as in cleansave() and cleanobject() means write a fresh PDF document. The alternative (e.g. save()) reuses the existing doc and just appends to it. Also note that 'clean' functions sort the objects numerically. If you prefer that the new PDF docs more closely resemble the old ones, call preserveOrder() before cleansave() or cleanobject().

Instantiate a new CAM::PDF object. $content can be a document in a string, a filename, or '-'. The latter indicates that the document should be read from standard input. If the document is password protected, the passwords should be passed as additional arguments. If they are not known, a boolean $prompt argument allows the programmer to suggest that the constructor prompt the user for a password. This is rudimentary prompting: passwords are in the clear on the console.

This constructor takes an optional final argument which is a hash reference. This hash can contain any of the following optional parameters:

Dereference a data object, return a value. Given an node object of any kind, returns raw scalar object: hashref, arrayref, string, number. This function follows all references, and descends into all objects.

Each PDF page contains a list of resources that it uses (images, fonts, etc). getPropertyNames() returns an array of the names of those resources. getProperty() returns a node representing a named property (most likely a reference node).

where the $fontlabel is something like '/Helv'. The getFontMetrics() method is useful in the cases where you've forgotten which page number you are working on (e.g. in CAM::PDF::GS), or if your property list isn't part of any page (e.g. working with form field annotation objects).

If a font metrics hash is supplied (it is required for a font other than the 14 core fonts), then it is cloned and inserted into the new font structure. Note that if those font metrics contain references (e.g. to the FontDescriptor), the referred objects are not copied -- you must do that part yourself.

Removes embedded font data, leaving font reference intact. Returns true if the font exists and 1) font is not embedded or 2) embedded data was successfully discarded. Returns false if the font does not exist, or the embedded data could not be discarded.

The optional $basefont parameter allows you to change the font. This is useful when some applications embed a standard font (see below) and give it a funny name, like SYLXNP+Helvetica. In this example, it's important to change the basename back to the standard Helvetica when de-embedding.

De-embedding the font does NOT remove it from the PDF document, it just removes references to it. To get a size reduction by throwing away unused font data, you should use the following code sometime after this method.

Returns an array of x, y, width and height numbers that define the dimensions of the specified page in points (1/72 inches). Technically, this is the MediaBox dimensions, which explains why it's possible for x and y to be non-zero, but that's a rare case.

For example, given a simple 8.5 by 11 inch page, this method will return (0,0,612,792).

Return an array of the names of all of the PDF form fields. The names are the full hierarchical names constructed as explained in the PDF reference manual. These names are useful for the fillFormFields() function.

Return a hash reference representing the accumulated property list for a form field, including all of it's inherited properties. This should be treated as a read-only hash! It ONLY retrieves the properties it knows about.

Alter the document's security information. Note that modifying these parameters must be done respecting the intellectual property of the original document. See Adobe's statement in the introduction of the reference manual.

Important Note: Most PDF readers (Acrobat, Preview.app) only offer one password field for opening documents. So, if the $ownerpass and $userpass are different, those applications cannot read the documents. (Perhaps this is a bug in CAM::PDF?)

Note: any omitted booleans default to false. So, these two are equivalent:

Search the content of the specified page (or all pages if the page number is omitted) for embedded images. If there are any, replace them with indirect objects. This procedure uses heuristics to detect in-line images, and is subject to confusion in extremely rare cases of text that uses BI and ID a lot.

Remove unused objects. WARNING: this function breaks some PDF documents because it removes objects that are strictly part of the page model hierarchy, but which are required anyway (like some font definition objects).

Set the default values of PDF form fields. The name should be the full hierarchical name of the field as output by the getFormFieldList() function. The argument list can be a hash if you like. A simple way to use this function is something like this:

If the form field is set to auto-size the text to fit, then you may use these options to constrain the limits of that autoscaling. Otherwise, for example, a very long string will become arbitrarily small to fit in the box.

Disable any triggers set on data entry for the specified form field names. This is useful in the case where, for example, the data entry Javascript forbids punctuation and you want to prefill with a hyphenated word. If you don't clear the trigger, the prefill may not happen.

If this PDF was previously saved in append mode (that is, if clean() was not invoked on it), return a new instance representing that previous version. Otherwise return void. If this is an encrypted PDF, this method assumes that previous revisions were encrypted with the same password, which may be an incorrect assumption.

Cache all parts of the document and throw away it's old structure. This is useful for writing PDFs anew, instead of simply appending changes to the existing documents. This is called by cleansave() and cleanoutput().

Returns a boolean indicating whether the save() method needs to be called. Like save(), this has nothing to do with whether the document has been saved to disk, but whether the in-memory representation of the document has been serialized.

In many cases, it's useful to apply one action to every node in an object tree. The routines below all use this traverse() function. One of the most important parameters is the first: the $dereference boolean. If true, the traversal follows reference Nodes. If false, it does not descend into reference Nodes.

Optionally, you can pass in a hashref as a final argument to reduce redundant traversing across multiple calls. Just pass in an empty hashref the first time and pass in the same hashref each time. See changeRefKeys() for an example.

Remove any filters from an object. The boolean flag $save (defaults to false) indicates whether this removal should be permanent or just this once. If true, the function returns success or failure. If false, the function returns the defiltered content.

Alter all instances of a given string. The hashref is a dictionary of from-string and to-string. If the from-string looks like regex(...) then it is interpreted as a Perl regular expression and is eval'ed. Otherwise the search-and-replace is literal.

This library was primarily developed against the 3rd edition of the reference (PDF v1.4) with several important updates from 4th edition (PDF v1.5). This library focuses most deeply on PDF v1.2 features. Nonetheless, it should be forward and backward compatible in the majority of cases.

This module is written with good speed and flexibility in mind, often at the expense of memory consumption. Entire PDF documents are typically slurped into RAM. As an example, simply calling new('PDFReference15_v15.pdf') (the 13.5 MB Adobe PDF Reference V1.5 document) pushes Perl to consume 89 MB of RAM on my development machine.

All of these except "stream" are directly related to the PDF data types of the same name. Streams are treated as special cases in this library since the have a non-general syntax and placement in the document body. Internally, streams are very much like strings, except that they have filters applied to them.

All objects are referenced indirectly by their numbers, as defined in the PDF document. In all cases, the dereference() function should be used to deserialize objects into their internal representation. This function is also useful for looking up named objects in the page model metadata. Every node in the hierarchy contains its object and generation number. You can think of this as a sort of a pointer back to the root of each node tree. This serves in place of a "parent" link for every node, which would be harder to maintain.

The PDF document itself is represented internally as a hash reference with many components, including the document content, the document metadata (index, trailer and root node), the object cache, and several other caches, in addition to a few assorted bookkeeping structures.

The core of the document is represented in the object cache, which is only populated as needed, thus avoiding the overhead of parsing the whole document at read time.