Pinned topicControlling stripe order on disks?

‏2013-02-05T19:02:14Z
|Tags:

Answered question
This question has been answered.

Unanswered question
This question has not been answered yet.

We have 10 storage arrays, each SAS-attached to a pair of servers. As this is an HPC environment, we have been watching performance under extreme load, and it seems we can bottleneck at the storage array. Since we have multiple LUNs on each array, we would like to make sure large files stripe in such away that as many storage arrays (not just GPFS "disks") are involved. Yet there doesn't seem to be a way to indicate to GPFS that different disks are on the same storage array...

Unless (as some rather old postings suggest) the meaning of "failure group" is overloaded to communicate such groupings. But that doesn't seem appropriate for our purposes. Our servers and storage arrays are are arranged in two racks, and for fault tolerance, it is much more appropriate to arrange the contents of each rack into a single failure group. (So we have two failure groups, one for each rack. Most data is not replicated, but some is, as is all metadata.)

Is there an appropriate way to indicate to GPFS that it should consider a set of disks to be in the same "storage array" so that it will stripe over multiple arrays (and a way to change this if I currently have it wrong)? Or does GPFS somehow infer this information by means I am not considering? Or, given how performance tuning often gives counterintuitive results, has research shown this would be a waste of effort and I should just be happy I am getting the performance I am?

Re: Controlling stripe order on disks?

‏2013-02-06T20:59:34Z

This is the accepted answer.
This is the accepted answer.

There is no way to tell GPFS this information. It would have been best to not have multiple LUNs on a single array. GPFS assumes the "disks" are independent and that the best performance can be had by scheduling all disks to prefetching or writebehind in parallel to as many disks as possible.

Re: Controlling stripe order on disks?

There is no way to tell GPFS this information. It would have been best to not have multiple LUNs on a single array. GPFS assumes the "disks" are independent and that the best performance can be had by scheduling all disks to prefetching or writebehind in parallel to as many disks as possible.

s> dlmcnabb wrote:
> There is no way to tell GPFS this information. It would have been best to not have multiple LUNs on a single array. GPFS assumes the "disks" are independent and that the best performance can be had by scheduling all disks to prefetching or writebehind in parallel to as many disks as possible.

Can you clarify this please?

Are your comments specifically for a configuration where arrays are
treated as JBOD storage and relying on GPFS to do native RAID?

When you write "disk" above, do you mean individual physical drives
within an array, or an NSD (one LUN)?

For configurations where RAID is done at the array level, are you saying
that the best practice is to create one LUN per physical array (join
all physical drives into a single RAID group and present the entire RAID
group as a single LUN for GPFS to use as one NSD)?

I can understand if it was best not to have multiple LUNs on a single
RAID group, but I was under the impression that for SAN-attached storage,
using RAID arrays, the arrangement is typically:

GPFS filesystem =>

composed of "N" NSDs (GPFS stripes data across NSDs) ==>

one LUN per NSD ===>

one LUN per RAID group ====>

each RAID group is composed of "N" drives
(the array stripes data across physical disks)

Are you suggesting a different arrangement?

If GPFS Native RAID is used, would you suggest:

GPFS filesystem =>

composed of "N" NSDs (GPFS stripes data across NSDs) ==>

one LUN per NSD ===>

one LUN per physical disk drive
(disable RAID and caching on the storage
arrays)

Re: Controlling stripe order on disks?

s> dlmcnabb wrote:
> There is no way to tell GPFS this information. It would have been best to not have multiple LUNs on a single array. GPFS assumes the "disks" are independent and that the best performance can be had by scheduling all disks to prefetching or writebehind in parallel to as many disks as possible.

Can you clarify this please?

Are your comments specifically for a configuration where arrays are
treated as JBOD storage and relying on GPFS to do native RAID?

When you write "disk" above, do you mean individual physical drives
within an array, or an NSD (one LUN)?

For configurations where RAID is done at the array level, are you saying
that the best practice is to create one LUN per physical array (join
all physical drives into a single RAID group and present the entire RAID
group as a single LUN for GPFS to use as one NSD)?

I can understand if it was best not to have multiple LUNs on a single
RAID group, but I was under the impression that for SAN-attached storage,
using RAID arrays, the arrangement is typically:

GPFS filesystem =>

composed of "N" NSDs (GPFS stripes data across NSDs) ==>

one LUN per NSD ===>

one LUN per RAID group ====>

each RAID group is composed of "N" drives
(the array stripes data across physical disks)

Are you suggesting a different arrangement?

If GPFS Native RAID is used, would you suggest:

GPFS filesystem =>

composed of "N" NSDs (GPFS stripes data across NSDs) ==>

one LUN per NSD ===>

one LUN per physical disk drive
(disable RAID and caching on the storage
arrays)

I quoted the first GPFS "disk" meaning what it gets in mmadddisk or mmcrfs, not a physical disk. So in this sense it is a LUN presented by some controller or JBOD, or ...

Best practice is to have one LUN per traditional RAID, and the GPFS blocksize should be a small multiple of the RAID stripe width; e.g. an 4+P RAID where each disk segment is 64K would be good for a 4*64K = 256K blocksize or a multiple of 256K. This way, full block writes to the LUNs would not have to do read/modify/write to the physical disks.

Also, a RAID array should not consist of all the physical disks. GPFS prefetch depth uses the number of LUNs it sees so that it does not schedule too many IOs which would just queue up. If there is only one LUN GPFS will not do much prefetch and therefore be slower than it could be.

When you have multiple LUNs on an array, GPFS will be essentially scheduling more IO to the array than to other arrays if there are a different number of LUNs on one array from another array. Also, the seeks between the LUNs can be large.
GNR only works with JBOD. It does its own RAIDing over a large number of physical disks presenting virtual arrays that look like N+nP for some subset of N and P to give you tolerance of multiple disk failures. This is eliminating the the expensive controller in favor of cheaper drawers of physical disks. GNR is currently only supported with a few qualified drawers.

Re: Controlling stripe order on disks?

‏2013-02-06T23:27:49Z

This is the accepted answer.
This is the accepted answer.

GPFS does straight round-robin striping of blocks across all disks in a given storage pool, with disks arranged in the order they were specified at mmcrfs/mmadddisk time (you can examine the disk order with mmlsdisk -L). So if all you are after is even loading of different disk storage stack components, specify the list of disks that round-robins servers, disk controllers, and RAID arrays, for example:

and so on. When GPFS prefetch/writebehind is working well, there will typically be simultaneous IOs outstanding for a range of blocks from a given file, and if disks are judiciously ordered, those IOs will be spread across a number of servers, disk controllers and RAID arrays.

Re: Controlling stripe order on disks?

There is no way to tell GPFS this information. It would have been best to not have multiple LUNs on a single array. GPFS assumes the "disks" are independent and that the best performance can be had by scheduling all disks to prefetching or writebehind in parallel to as many disks as possible.

OK, thanks. But I am afraid I was unclear -- by "storage array" I meant the entire "box", e.g. a DCS3700 or DS3500, not the 8+2P RAID "array" that can be broken into LUNS -- I am so used to 1 LUN per "RAID array" that I think of these as the LUNs.

But it remains the case that LUNs in a "box" can impact each other's performance, even if they are all using different spindles, just because the they share a SAS bus or other controller resources which serializes (some of) the communication. I was wondering if there were a way to help GPFS know it should stripe over a LUN in another "box" next, to gain parallelism here. (I'm picturing a file system operation that puts enough data in flight to require several GPFS "disks" to participate, but not necessarily every "disk" in the file system. Maybe this is too rare to worry about?)

Re: Controlling stripe order on disks?

GPFS does straight round-robin striping of blocks across all disks in a given storage pool, with disks arranged in the order they were specified at mmcrfs/mmadddisk time (you can examine the disk order with mmlsdisk -L). So if all you are after is even loading of different disk storage stack components, specify the list of disks that round-robins servers, disk controllers, and RAID arrays, for example:

and so on. When GPFS prefetch/writebehind is working well, there will typically be simultaneous IOs outstanding for a range of blocks from a given file, and if disks are judiciously ordered, those IOs will be spread across a number of servers, disk controllers and RAID arrays.

Thanks. Of course, over time, we may be adding or removing disks (or maybe we didn't know about the need to balance load over controllers and paths when we created the file system). Is there a way to change the order after the fact?

For instance, could I sequentially remove and re-add disks to a file system (so that when I completed, I've done the mmadddisk's in an optimal order)? Right now I have a couple largish file systems (approaching a 1PB between them) that only have about 10-15TB data in use; I'd be willing to go through this pain now, but not again...) It would be nice if there were another way to set the order.

Re: Controlling stripe order on disks?

Thanks. Of course, over time, we may be adding or removing disks (or maybe we didn't know about the need to balance load over controllers and paths when we created the file system). Is there a way to change the order after the fact?

For instance, could I sequentially remove and re-add disks to a file system (so that when I completed, I've done the mmadddisk's in an optimal order)? Right now I have a couple largish file systems (approaching a 1PB between them) that only have about 10-15TB data in use; I'd be willing to go through this pain now, but not again...) It would be nice if there were another way to set the order.

There isn't a way to simply re-order the disks that are already a part of a file system (this has to do with the block allocation map layout). If you want to change the order of disks, there are two ways:

1) Use mmdeldisk and mmadddisk. New disks are added on the first-fit basis, filling any empty slots that are present, it's a simple and deterministic arrangement. So if you delete a disk in slot 3, and all lower slots are taken, the next disk to be added via mmadddisk will go to slot 3. This is obviously a time-consuming process.

2) Create a new storage pool, and use ILM function to put data there. This may be a more natural approach if you're considering adding disk of different speed/capacity. You'll be able to order disks in the new pool any way you want.

Re: Controlling stripe order on disks?

‏2013-02-07T18:28:19Z

This is the accepted answer.
This is the accepted answer.

Hello Dr. Todd (from a RPI alum)

First let me say that you are correct that there is a potential for less-than-expected scaling within and across multiple storage arrays if the ordering of NSDs when the file system was created was not well balanced. I use the terms "micro-clumping" and "micro-starvation" to describe the behavior.

We have studied this behavior, and developed a straightforward method to lessen its impact that is NOT labor intensive, and as such can be used effectively with large disk farms with thousands of disks.

However, from our experience (with a non-traditional, non-HPC like, GPFS storage layout topology), the performance impact due to less-than-perfect NSD sequencing is NOT the first area of focus in GPFS disk IO (stack) performance optimization.

Based on this experience, I would caution at jumping at what could look like a "quick fix" without validating some due diligence at the lower layers of the IO stack. We have identified at least seven layers in the storage-Linux-GPFS IO stack whose "defaults" are sub-optimal for large-file IO.

I will also emphasize that there is no "perfect" solution. As part of the cross-coordination across the seven IO stack layers, there are tradeoffs due to the current restrictions of a layer. The impact of these tradeoffs can be trivial or significant, and often add a performance monitoring distortion that is difficult to counteract.

You did not say what type of storage you are using, and what is the storage layout?
What performance level are you achieving per storage array, and what performance level you are expecting?

Is this the classic HPC NSD server topology with dual NSD servers connected to each storage array? How many storage arrays are connected to the NSD server pair? A diagram would be useful.

The reason I ask is your NSD ordering may be sub-optimal but not enough to be effecting performance to the degree that you may be observing. The lack-of-expected scaling may be elsewhere in the stack ... and easier to change.

As a reference point, with PRRC enabled (required for silent data corruption handling) a 60-disk DCS3700 should yield 1600 MB/sec each. We have successfully achieved 92% scaling across 6 storage arrays, and understand how to scale further.

The DCS3700 "marketing" performance rating of ~4,000 is somewhat misleading, as it represents a best-case scenario with critical data integrity options (PRRC) disabled.

We also have identified a DCS3700 performance anomaly under severe write stress that was posted on the GPFS forum in the past. We have identified the "footprint" of the anomaly and developed a workaround that lessens its impact. At this point, I do not believe you are hitting the anomaly.

Once I understand more about the configuration, I will be able to comment better.