Monday, August 18, 2008

recovering removed file on zfs disk

I have used my modified mdb and zdb (seehttp://www.osdevcon.org/2008/files/osdevcon2008-proceedings.pdf andhttp://www.osdevcon.org/2008/files/osdevcon2008-max.pdf)to recover a file that was removed from a zfs file system.The technique is to locate the active uberblock_t after the filewas created, but before the file was removed, and follow the datastructures from that uberblock_t. This technique would probably notwork on a near full file system, and probably not on a very busy filesystem, but it works here. Also, this will not work with RAID-z,but works fine for mirrors. (I shall get around to figuring outraid-z, but not now...).

It is possible to follow all of the steps and still not have the rightdata because you chose the wrong uberblock_t, or one of the blocks containingmetadata (or the data itself) has been re-used.

The modified mdb and zdb have been updated to work with Nevada,build 94. It took about 15 minutes to merge the versions I was usingfor build 79 into build94. For source for the changes, andthe zfs dmod, send mail to me at max@bruningsystems.com.

It might be possible, with a bit more clever use of mdband some shell scripting, to automate this... Also, it might beuseful to add an option to zdb so that a different transactionid other than the active one be used for it's traversals.Then you might be able to do everything using zdb.

The following describes the steps taken.

First, I copy a file with known contents to the zfs file system.

# cp /usr/dict/words /zfs_fs/words#

We'll get the object id (inumber) for /zfs_fs. We'll use it later.

# ls -aid /zfs_fs 3 /zfs_fs#

Next, I'll try to make sure everything is on the disk.

# sync#

Now, I'll use zdb to get the root blkptr from the uberblock.This will also give me a transaction ID. Generally, you would notuse zdb to get the uberblock_t every time that you add/remove afile to a zfs file system. That is ok. I have written a dcmd(output shown below), that walks the uberblock_t array on disk.Then you can, by trial and error, locate the uberblock_t you need(assuming it still exists in the array, and assuming the metadatait points to has not been re-used for another purpose).

Ok. So nothing changed when the file system was unmounted.Now, we'll use the modified mdb to examine the uberblock_t array on disk.The uberblock_t we want has transaction id 1282 decimal.

# ./mdb /export/home/max/zfsfile

First, convert decimal 1282 to hex.

> 0t1282=X 502

Now, load kernel CTF and a few dcmds that work with zfs on disk.

> ::loadctf> ::load /export/home/max/source/mdb/i386/rawzfs.so

Walk the uberblock_t array on disk. This shows all possible 1024 uberblocks.Here, we'll only show the entry with ub_txg = 0x502. Again, if I had notretrieved the value of the active uberblock_t after the file was created,and before the file was removed, I could dump all uberblock_t using thefollowing command, and then searched backwards, trying all transaction idsthat are less than the current (i.e., current after the file was removedand the file system unmounted).

Using zdb with the offset specified for the first ditto blockin the above blkptr output, we get the mos dnode array.Note that the "LEVEL: 0" blkptr output means there areno levels of indirection. On larger zfs file systems, you mayneed to go through block(s) of indirect blkptr_t's. An example of thisis shown a bit later.

Now, we'll look at the metadnode for the DMU_OT_OBJECT_DIRECTORY. Thiswill tell us about objects in the zfs file system. For every zfs filesystem that I have tried this on, this is dnode number 1, (starting from0). Regardless, the field to check is "dn_type = 0x1". It is possible,(I assume), for this to be at a different index into the metadnode array,and, possibly not in the 0x4000 bytes read and decompressed from 0x11000.In this case, the LEVEL field would not have been 0, and you would have tolook at indirect blkptr_t's. But not here...

Back to mdb to look at the object directory. Object directories are "zap"objects. Zap objects contain name/value pairs. The first 64 bitsidentify the type of the zap (micro zap or fat zap). A "fat zap" is a zapobject that uses indirection. Micro zaps contain name/value pairs directly(i.e., no indirection). I have not seen a fat zap (but the largest zfsfile system I have used is only ~140GB, and I have not examined largedirectories. (Directory entries are stored in zap objects).

Now, we go back to the mos metadnode array in /tmp/metadnode, andexamine object id 2 (the third entry in the array).Each entry is 0x200 bytes, so we want the dnode_phys_t startingat (2*200) bytes into the file.

The blkptr_t for the dsl_dataset_phys_t is for another DMU objset.(The first DMU objset was from the uberblock_t rootbp and describethe set of objects. The objset described by the dsl_dataset_phys_t describesthe set of objects in the file system (i.e., files and directories (and...?)).Back to zdb to get this data.

Note the "LEVEL: 6" in the above output. There are 6 levels of indirectionto get to another array of dnode_phys_t. We will follow the levels, alwaysusing the first indirect blkptr_t at each level since the file was in a directory whose object id is 3 (from "ls -aid /zfs_fs" back at thebeginning). If I want the dnode_phys_t for a different object id, Ican use the technique explained in the paper and slides referencedat the beginning.

Level 0 will contain the beginning of the array of dnode_phys_tfor files and directories within the file system.We'll again use zdb to retrieve the block containing the first0x20 entries. (Again, decompressed size is 0x4000, dnode_phys_t sizeis 0x200, so there are 0x20 entries in the first level 0 block).

At this point, we could go the the fourth entry in the above output(object id 3 at 0x600 bytes into the file) and look at the directorycontents to see if the removed file is there. (Remember, ls -aid onthe directory containing the removed file shows inumber 3).However, we'll be safe and examine the master node to get to the root directory of the file system. The master nodeis object id 1 (at 0x200 in the above output). The block pointerfor this dnode_phys_t is for a zap object.We'll use mdb to dump the master node blkptr_t.

The mzap_phys_t is 0x80 bytes large. Following this are zero or moremzap_ent_phys_t. Each mzap_ent_phys_t is 0x40 bytes. The followingwill dump all mzap_ent_phys_t following the mzap_phys_t in the block.

2 comments:

Hi. The modified mdb and zdb are available by sending me email at max@bruningsystems.com. These are modifications that I made. The modification to mdb allows one to use kernel CTF information on raw disks (or any other raw data file). The modified zdb allows one to dump decompressed blocks from the disk. There are also 2 dmods, one for zfs on raw disk, the other for ufs on raw disk.