Pinned topicdata block striping algorithm

‏2013-03-08T21:25:55Z
|Tags:

Answered question
This question has been answered.

Unanswered question
This question has not been answered yet.

Dear all,

recently we saw, on an io benchmark test, a clear imbalance in block distribution of the data blocks of the benchmark files. This imbalance was clearly more pronounced than the imbalance in free/used blocks on the NSDs.

We had 1/3 of the NSDs in the FS first and added the other 2/3 later, so initial FS occupation after this join was 11%. We trusted over time the usage would even the discrepancy between "old" and "new" NSDs.
About 2 months after that, we did some benchmarks and saw only vanishing differences between the old and new NSDs WRT block allocations.

A recently done benchmark run of the same kind, however did exhibit a block distribution of about 7 blocks on a new disk for about 5 blocks on an old one. The NSDs had, at this time, free blocks (according to mmdf) of 61% and 70% of their capacity (old and new, resp.). Given the vanishing difference earlier, when the free blocks were probably something close to 80% and 90%, resp., a ratio of 5/7 in block location appears high to me - i could have understoo something like 6/7 corresponding to the free block portions.

The more puzzling that is for I was told that GPFS does not do any balancing here. Is it so, that GPFS does -- blindly -- just target a data block to a disk (block), and if that block is not free, it jumps to the next disk in striping order? What is the allocation segment within this respect -- is it a portion of the NSD, or certain portions of multiple/all NSDs? Is it that GPFS passes by an NSD if whithin a certain area of that NSD there are no free bloclks (would that area be the alloc. segment?)?

I attach the block locations (NSD wise) for 5 of the 1000 benchmark files (i.e. NSD index vs block number). The benchmark writes all files at the same time (MPI barrier), so the block numer axis gives also kind of a time line. We do have 9 failure groups, 56 consecutive disks in each.
The order of the block targetting is (normally) like
d, d+56 , d+2*56 , ..., d+9*56,
d+1, d+1+56, d+1+2*56, ...
d+2, ...
...
1
of course wrapping around if d+ l in the first column exceeds the disk number whithin that FG. d is just a (random) disk number so not all writes to to the same disks in synch.
the 9 lines of one colour in the diagram are formed by the data points related to the columns of the above matrix. If that order were undisturbed, we would just see those lines all over.
However, we do not. What we do see is that for certain files, one third of the disks (the old ones, disk indice 337-504) receives for a certain block rabge - i.e. also for a certain period in time as I assume - less down to almost no data, but later it does again. For example, from about block 9500 on, there are almost no blocks of the file with inode 100742090 (pink dots) this disk range, many missing in 20000...22500 and 25000..27000, but from block 17500 up to block 20000 almost and especially above block 27000 the blocks of this file are almost completely on those disks again.
That also supports the assumption that GPFS does not necessarily write data to an NSD even if this NSD has still (plenty of) free blocks.

The blocks not written to an NSD in order appear as "out-of-order" data points between the lines somewhere. Interestingly, we do find those out-of-order points in all disk ranges (1..504), albeit more of them seem to be in the 1..336 range (the "new" NSDs).
Would not be best to maintain an "empty block inventory" for each NSD and pick data targets from there? The current picture from my findings is, GPFS does not do so. GPFS must eventually find out whether a place on disk is free or not anyway ?

So, I would be very happy if somebody could describe the data block striping a bit more detailed than just as "plain round robin" - which it is for sure not.

I think the simple take-away point is that GPFS tries to do a reasonable job of round robin allocation when allocating new blocks to a file - BUT - does nothing to try to make up for deficiencies in old allocations, which are likely to look silly after you add or delete some new disks to the file system -- until you issue restripefs. There is also a restripefile command, if you just want to try to redistribute the blocks of some selected files.

Re: data block striping algorithm

I think the simple take-away point is that GPFS tries to do a reasonable job of round robin allocation when allocating new blocks to a file - BUT - does nothing to try to make up for deficiencies in old allocations, which are likely to look silly after you add or delete some new disks to the file system -- until you issue restripefs. There is also a restripefile command, if you just want to try to redistribute the blocks of some selected files.

Hi,
Here I cannot fully agree.
What we saw was an imbalance of the block distribution over NSDs of new files which was clearly more severe than the imbalance in free blocks on the NSDs.

In plain round robin, all disks should receive the same rate except full ones which. Another approach (balancing, but I understood that is what GPFS is not supposed to do), given an initial imbalance and writing data into (new) files each NSD would get its own individual (but henceforth constant) rate, they would all reach 100% usage at the same point.
This means that the rates the NSDs receive should be proportional to the free disk blocks on these NSDs, resp.

We saw none of the behaviours described. Our test created 1000 (new) files and wrote 256 GiB into each. The NSDs (initially) had (61+-0.5)% and (70+-0.5)% of their blocks free (two groups, "old" and "new"). I would have expected thus that the old disks would receive about 87% (61/70) of the data the new ones got, but they received only about 69%, i.e. little more than two third. That might have been by accident, but is shows that the striping method may have some disadvantages (I suppose it has also advantages , though) or at least can suprise you.