Adventures in Enterprise Computing

Adobe XMP Packet Extraction for the Aperture Framework

When it comes to manipulating photographs, I live in Photoshop. One feature of all Adobe products that I like is the ability to annotate images and other documents using their eXtensible Metadata Platform, or XMP. XMP is a collection of RDF statements that get embedded into a document that describe many facets of the document. I’ve always wanted to be able to somehow get that data out of these files and doing something with it for application purposes.

There are projects like Jempbox, which work on manipulating the XMP data but offers no facilities to extract the XMP packet from image files. The Apache XML Graphics Commons is more the ticket I was looking for. The library includes and XMP parser that performs by scanning a files for the XMP header. The approach works quite well and supports pretty much every format supported by the XMP specification.Â The downside of XML Graphics Commons is that it doesn’t property read all of the RDF statements. Some of the data is skipped or missed completely. To top it off, neither framework allows you to get at the raw RDF data.

What I really wanted to do was to get the XMP packet in its entirety and load it into a triples store like Sesame or Virtuoso. This of course means that you want to have the data available as RDF. Rather than inventing my own framework to do all of this, I found the Aperture Framework. Aperture is simply amazing framework that can extract RDF statements from just about anything. Of course, the one thing that is missing is XMP support. So, I set out on implementing my own Extractor that can suck out the entire XMP packet as RDF. It’s based on the work started in the XML Graphics Commons project, but modified significantly so that it pulls out the RDF data. Once extracted, it’s very easy to store the statements into a triple store and execute SPARQL queries on it.

Right now the, this Â XMPExtractor can read XMP from the following formats:

JPEG Images (image/jpeg)

TIFF Images (image/tiff)

Adobe DNG (image/x-adobe-dng)

Portable Network Graphic (image/png)

PDF (application/pdf)

EPS, Postscipt, and Adobe Illustrator files (application/postscript)

Quicktime (video/quicktime)

AVI (video/x-msvideo)

MPEG-4 (video/mp4)

MPEG-2 (video/mpeg)

MP3 (audio/mpeg)

WAV Audio (audio/x-wav)

On the downside, I’ve found that if you use the XMPExtractor with a Crawler, you’ll run into some problems with Adobe Illustrator files. The problem is that the PDFExtractor mistakes these files for PDFs and then fails. But as long as you’re not using Illustrator files, you should be ok. There’s also a few nitpicks with JPEG files and the JpgExtractor in that the sample files included in the XMP SDK are flagged as invalid JPEG files. However, every JPEG file I created from Photoshop and iPhoto seem to work fine. But after a little more testing, I’ll look at offering it up as a contribution to the project.

Please refer to http://www.adobe.com/devnet/xmp for the newest XMP specification and the related freely available C++ XMP SDK that implements support for most of the file formats listed above (and more).

The XMPExtractor was developed against the latest Adobe spec and the unit tests work against the sample data from the latest SDK. While yes, the SDK supports more file formats than this XMPExtractor this is another implementation that is written in Java.

Please refer to http://www.adobe.com/devnet/xmp for the newest XMP specification and the related freely available C++ XMP SDK that implements support for most of the file formats listed above (and more).

The XMPExtractor was developed against the latest Adobe spec and the unit tests work against the sample data from the latest SDK. While yes, the SDK supports more file formats than this XMPExtractor this is another implementation that is written in Java.

Yeah, Illustrator files look kinda like PDF, but they don’t seem to be a 100% PDF. This is why the PdfExtractor fails when running a Crawler. However, the XMPExtractor works just dandy on Illustrator files. One other thing to note: if the files does not contain and XMP packet, it is just ignored. IMHO, it’s understood that the lack of an XMP packet is a greater occurrence than not. So if it’s not there, you’ll just get an empty RDFContainer.

But I still have some issues with some files where Adobe Bridge still sees the header yet the extractor cannot. So, still some work to do.

Yeah, Illustrator files look kinda like PDF, but they don’t seem to be a 100% PDF. This is why the PdfExtractor fails when running a Crawler. However, the XMPExtractor works just dandy on Illustrator files. One other thing to note: if the files does not contain and XMP packet, it is just ignored. IMHO, it’s understood that the lack of an XMP packet is a greater occurrence than not. So if it’s not there, you’ll just get an empty RDFContainer.

But I still have some issues with some files where Adobe Bridge still sees the header yet the extractor cannot. So, still some work to do.

This is pretty darned useful. I know a buddy of mine that would be interested in hacking away at this too.

One thought on the Illustrator file being mishandled as PDF — Illustrator saves its files as PDF I believe, so there is probably no readily handy way to tell the difference, certainly from XMP alone ( see http://bit.ly/2kIQOi ). Though, I do think there is a way to discern the difference from other header information, but that would reduce the purity of the code here.

There’s also a variety of bugs, if you dig around on Adobe’s site, related to malformed or lack of XMP data in various files, usually when you try to export as xyz, instead of merely saving as xyz. Seems like file type conversion mangles XMP and EXIF within files, so smartest to perform all conversions in Aperture or Lightroom.

This is pretty darned useful. I know a buddy of mine that would be interested in hacking away at this too.

One thought on the Illustrator file being mishandled as PDF — Illustrator saves its files as PDF I believe, so there is probably no readily handy way to tell the difference, certainly from XMP alone ( see http://bit.ly/2kIQOi ). Though, I do think there is a way to discern the difference from other header information, but that would reduce the purity of the code here.

There’s also a variety of bugs, if you dig around on Adobe’s site, related to malformed or lack of XMP data in various files, usually when you try to export as xyz, instead of merely saving as xyz. Seems like file type conversion mangles XMP and EXIF within files, so smartest to perform all conversions in Aperture or Lightroom.