NSFileWrapper serializedRepresentation

With the last update of Aquarii, I published online a small utility app, allowing people to extract data from their backups archives. The reason is that backups are generated using the NSFileWrapper class, and its serializedRepresentation method[1] : this is an easy way to transform a directory and its content into one big file, without having to use external compression libraries. But what if the user wanted to get back things that are not exportable in the app, or full size images of the items list? Thus I developed a small OS X app to do this job, basically a simple wrapper around NSFileWrapper[2].
But what if the user was using Windows or Linux ? NSFileWrapper is an Apple class, included in Foundation, and I needed it to provide deserialization. The only solution left was to understand how a serialized archive is built by NSFileWrapper, and how to manually extract the data I wanted from it. My goal was only this one : be able to extract the files and subdirectories from an archive, conserving the hierarchy and names. Nothing more. It is important to note that the NSFileWrapper real implementation is private, and can be changed by Apple whenever they need to (or even has already been changed, with backward-compatibility).
Thus a bit of reverse-engineering. I created a simple command line tool to serialize any directory. And started doing tests.

The rtfd string comes from the fact that NSFileWrapper is among other things built to generate rich text format directories, ie rtf files wrapped with images, sounds..., in a package.
We can notice that each time we have a human-readable string, the 4 bytes number before is equal to the length (in bytes) of the string. This gives us another hint: numbers are read little-endian. The strings __@PreferredName@__ and __@UTF8PreferredName@__ refer to properties of a NSFileWrapper instance. In our case, we would expect those two strings to be Test1. So there seems to be a first part where properties are declared, and a second part where their values are stored, in the same order. But we don't know yet if the same rule is applied for files and subdirectories.

We replaced strings and their length as before. We notice that the rule seems to be the same for files and properties. In the second part, we see that each value is preceded by 8 bytes: the first four are always 01000000, probably denoting the beginning of a value/file. We see that for the value test2, the net 4 bytes indicate its length. When checking this hypothesis on the text file we read a value of 616, indeed its size in bytes.
When looking at the very beginning of the file, we see that by adding a file, the 4-th quadruple of bytes has increased by one. Maybe the number of files/properties ? But it seems shifted by one unit. And what is it with the weird repetition of loremipsum.txt at the end of the file ? There seems to be one extra item in the serialized representation. We know that NSFileWrapper can preserve authorizations and other attributes of each file, so they might be stored here. But if there is one extra value, there is one extra item in the list (first part of the file). If we read after __@UTF8PreferredName@__, following the same logic as before (size of name followed by name), we read size:1, and . . We recognize the standard name for the current directory (as in bash). This explains the overcount and the content at the end of the file. As all I wanted to do is to extract the files data, we won't explore this part further.
We now have:rtfd 00 00 00 00 03 00 00 00 | items:4 | size:14 loremipsum.txt | size:19 __@PreferredName@__ | size:23 __@UTF8PreferredName@__ | size:1 . | 70 02 00 00 0D 00 00 00 0D 00 00 00 32 00 00 00 | Begin:file size:616 Donec ...
... elit. | Begin:file size:5 Test2 | Begin:file size:5 Test2 | Begin:file size:42 directory-listing

Except that the header+content size of the PNG seems huge.
Then follows the content of the loremipsum.txt file and the two preferred strings, with the same 8 bytes header each time, just as before. And then, we would expect to get the PNG part :01 00 00 00 00 00 00 80 82 88 01 00 E7 0C 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00... and so on, 3303 00 bytes before the real content of the file begins.

NSFileWrapper seems to works by blocks for binary/big files. If we look at the beginning of this sequence we have : the usual 1 value, then a value indicating that this file has some padding (00 00 00 80)[3], and two other values: 100482 and 3303. The first one is equal to the real size of the file, and the other one is equal to the number of 00 bytes used for the padding. We can also notice that 100482+3303+4*4 = 103801[4], the size given in the first part of the serialized achive for the PNG file.

Now we have a pretty good understanding of how files are stored in a NSFileWrapper serialized representation. We have enough information to be able to read the archive using a buffer, detecting files, their name, size, if they use padding or not, and to extract their data and write it to disk.

But what if we have subdirectories ?

With a subdirectory

We now create a Test4 directory. Inside it, we put our previous Test3 directory, renamed SubTest3, and we add another text-based file at the root, 1984.txt.

We notice that this doesn't begin by 01000000, but 03000000. But the following 4-bytes number can't be the length to read, as its value is also 3. So NSFileWrapper uses another rule when dealing with subdirectories. A reasonable idea is that it performs serialization recursively: it first serializes the subdirectory before including it in the main archive.