I have tested a 10Gbit connection between two hosts to be able to read a 10GB file from host1 and write it to host2 using netcat, where the speed was 410MB/s.

When I ZFS send/receive again with netcat over the same dedicated 10Gbit connection, I only get 70MB/s. The snapshot was 2.5TB with 15 million files.

Question

What could the reason for this slowdown be? Is the bottleneck that it takes a lot of time to rollback a snapshot with this many files, or does number of files not have an influence of ZFS rollback speed?

Update

The 10GB file transfer test where I got 410MB/s do I suppose simulates a ZFS send/receive with rollback. So with that assumption I am very surprised that I see so different speeds. I am using speed for comparison between the two tests, so I don't have to generate a 2.5TB file with random data.

So I don't understand why "read file from host1, transfer with netcat, write file to host2" is much faster than "zfs send snapshot from host1, transfer with netcat, ZFS receive/rollback on host2".

Maybe another way to ask the same would be?:

If I have two 2,5TB snapshots of equal size, where snapshot1 contains 1 file, and snapshot2 contains 15 million files. Will the time for zfs receive for both of them be the same? Or will one be faster than the other?

I think the 15 million files has more to do with it. You are, after all, moving relatively small files instead of one big one.
–
Nathan CMay 30 '13 at 14:06

Does that mean, that a tiny snapshot is made for each file instead of all the filesystem block are put into a stream?
–
SandraMay 30 '13 at 14:17

You're probably hitting an I/O bottleneck at this point. You'd see this slowdown with virtually any filesystem.
–
Nathan CMay 30 '13 at 14:35

1

This question somewhat confuses me, as you sugest that you transferred a file using netcat @ 410 MB/s, then transferred zfs send/recv over netcat @ 70 MB/s. Then you also meniton ZFS rollback speed. Was the original transfer a zfs send/recv, or not? Where is ZFS rollback coming into play here?
–
Nex7May 31 '13 at 5:55

1

It is, thank you. I've filed what I think is an appropriate answer, based on my own history with zfs send/recv. (something I didn't bother to mention is that the amount of other I/O going on in the pool while doing the zfs send/recv also matters; but the mbuffer solution can actually help mitigate that as well, assuming there's lulls in the non-send/recv I/O).
–
Nex7Jun 1 '13 at 6:28

2 Answers
2

The number of files and directories involved in a zfs send/recv stream should have no direct impact on its transfer speed. Indirectly, it might, because it is usually true to say that the 'spread' of the dataset across your disks will be higher with more directories/files, depending on the workload that generated them. This matters, because it's far easier for a hard disk to do a sequential read than a random read -- and if the stream in question is all over your disks, it will be much more of a random read workload than sequential.

Further, it is my understanding there is ZFS metadata involved in files on ZFS filesystems (not on zvols); I have no direct numbers, but I would be unsurprised for a single 2.5 TB file to have, on the whole, significantly less metadata blocks associated with it than 2.5 TB full of 15 million files. These (potentially many) extra metadata blocks will add more data that must be read, thus more disk reading (and potentially seeking) going on. So yes, it is likely that indirectly, a send stream consisting of 15 million files may be slower to create than one consisting of a single file of the same size (especially if that one file was created all at once, as a sequential write, on a pool with plenty of contiguous free space at the time).

It is very common for ZFS send/recv streams that are sent out unbuffered to have very spotty performance - at times they seem to go quite quickly, then will drop to nearly nothing for potentially long periods of time. The behavior has been described and even analyzed to some extent in various forums on the internet, so I won't get into it. The general take-away is that while ZFS can and should do some work on making it a more efficient workflow internally, a 'quick fix' for many of the issues is to introduce a buffer on the sending and receiving side. For this, the most common tool is 'mbuffer'.

The reason for the big speed difference is because transferring a file and snapshot can not be compared. A file is sequential I/O and a snapshot is random I/O, as it contains the blocks that have changed.

That is not technically accurate. A sequential snapshot and a non-sequential file are both completely possible (even probable depending on environment) in ZFS. There is no guarantee that reading through a a file will represent sequential I/O, any more so than a guarantee that reading through an incremental snapshot will represent non-sequential I/O. Probabilities of one being more likely random than the other, certainly, but not guarantees. I'm sorry, but this is just false.
–
Nex7Jul 24 '13 at 7:14