Over time, I have received email from various people asking for help either recovering files or pools or datasets, or for the tools I talk about in the blog post and the OpenSolaris Developers Conference in Prague in 2008. These tools were a modified mdb(1) and a modified zdb(1M).
It is time to revisit that work.

In this post, I'll create a ZFS pool, add a file to the pool, destroy the pool, and then recover the file. To do this, I'll use a modified mdb, and a tool I wrote to uncompress ZFS compressed data/metadata (zuncompress). Since zdb does not seem to work with destroyed zpools (in fact, much of zdb does not work with pools that do not import), I will not be using it. The code for what I am using is available at mdbzfs. Please read the README file for instructions on how to set things up.

For those of you who are running ZFS on Linux, at the end of this blog post, I have a suggestion on how you might try this on your ZFS on Linux file system.

Before you try this on your own, please backup the disk(s) in question. Use the technique I am showing at your own risk. (Note that nothing I am doing should change any data in the zpool). If you are using a file the way I do here, there is of course no need to make a backup.

First, we'll create a zfs pool using a file, then add a file to the pool, then destroy the pool

Note that the first time I tried this, I did not do the sync. I create the pool, added the file, and destroyed the pool before zfs got around to committing the transactions to disk, resulting in the file not showing up.

The steps we'll take to get the words file back from the destroyed pool will start at the uberblock, and walk the (compressed) metadata structures on disk until we get to the file. If I (or someone else) ever get around to adding a "zfs on disk" target to mdb, this will be much simpler.

The zfs.so and rawzfs.so files are built when you build mdb from my github repo. If you
gmake world, you may not need to do the two loads. So, in this case, the highest transaction group id is 0x14. Note that I am making an assumption that this is the last active uberblock_t. If it doesn't work, try the next lowest id. Let's print out the uberblock_t for that transaction group id.

So, there are 3 copies of the objset_phys_t specified by the blkptr, at 0x84800, 0x1284800, and at 0x2484800 bytes into the first (and only) vdev (the leading 0 in 0:84800:200). The three copies are compressed via lzjb compression. On disk, each is 0x200 bytes large. Decompressed, the objset_phys_t is 0x800 bytes. Currently, mdb has no way to decompress the data. We'll use the new tool zuncompress to uncompress the data into a file.

Let's get the blkptr_t in the objset_phys_t. This will be either a block containing the dnode_phys_t for the meta objset set (MOS) for the pool, or an indirect block containing blkptr_ts which may contain the dnode_phys_t, or more indirect blocks.

In this case, the blkptr is for a block containing the MOS (array of dnode_phys_t. (The L0 DNODE in the above output shows that there are 0 levels of indirection. A case where there are multiple levels of indirection from a blkptr_t will be shown below. We'll decompress the block.

# ./zuncompress -p a00 -l 4000 -o 83e00 /var/tmp/zfsfile > /tmp/mos
#

As mentioned earlier, the MOS is an array of dnode_phys_t. The decompressed block is 0x4000 bytes large.

An "object directory" (DMU_OT_OBJECT_DIRECTORY) is a "ZAP" object containing information about the meta objects. Meta objects in the MOS include the root of the pool, snapshots, clones, the space map, and other information. The ZAP object is contained in the data specified by the blkptr_t at location 0x240 in the above output.

There are more entries, but this is the entry we want (the "root_dataset"). The value of 2 for mze_value is an object id. Basically, an index into the MOS array of dnode_phys_ts where the root dataset is described.

For the dataset object set, there are 2 copies of the metadata (unlike the three copies for the MOS). And the "L6" says there are 6 levels of indirection. Indirect blocks are blocks containing blkptr_ts of block containing block pointers... of blocks containing data. In this case, 6 levels deep. We'll look at the first blkptr_t in each of these. Note that if this was a large file system with lots of data, we would probably still need the beginning (root of the file system) to get started. In this particular case, the only blkptr_t being used in all of the indirect blocks is the first one. The rest are "holes" (placeholders for when/if the file system has more objects). Given an object id, the arithmetic needed to find the correct path through the indirect blocks for that object id is covered in the papers mentioned at the beginning of this post.

At this point we'll follow a sequence of decompressing and following the block pointers until we get to level 0 (the dnode_phys_t array for the objects in the (root) dataset).

That's a lot of work. Is there a way to just "see" all of the information? Yes, it's called zdb(1M). But zdb is not interative, and it does not work with destroyed pools (or pools that won't import). Also, I find that using mdb this way forces you to understand the on-disk format. For me, much preferable to having it all done for me.

I mentioned at the beginning of this post that it will only work on illumos-based systems, i.e., systems with mdb. I cannot include Solaris 11 or newer because there is no way to build mdb without source code. But what if you are using ZFS on Linux?

You could upload your devices (or files) as files to manta, along with the modified mdb, the zfs.so and rawzfs.so modules, and the zuncompress program. Then you use mlogin to log into the manta instance and try from there. I've included built copies of mdb, the modules, and zuncompress in the github repo. Note that I have not yet tried this, but it will likely be in a blog post in the next week or so.

About

Support

Joyent.com Feedback

Thank You

Thank you for helping us improve joyent.com.

Forms are blocked

It appears you have an ad or script blocker that won't allow us to load our feedback form from app-sjf.marketo.com. To submit feedback, you can either temporarily unblock that domain, or email marketing@joyent.com. Email may take longer to get to the relevant people.