Pinned topiclow IO performence on FS

‏2013-05-28T15:11:37Z
|Tags: ignoreprefetchluncount performance

Answered question
This question has been answered.

Unanswered question
This question has not been answered yet.

Dear,

I have a direfernt disk systems running GPFS each of them pointed out to one big FS. both of them using 4 nsd servers witch 8Gbps FC connections for each disk subsystem. Each LUN/NSD contains about 10 physical disk the first disk system is a SA2900 DDN RAID6 the other one is formed by two IBM DCS3700 RAID 5.

The DDN servers netwotk interface are 10Gbps while the DCS3700 seaver have a 4x1Gbps bonding (the network in both cases works fine doing an iper test)

Checking the Fiber directy using a dd command, we obtain a good values for both cases

Re: low IO performence on FS

‏2013-05-28T16:57:44Z

This is the accepted answer.
This is the accepted answer.

You can dynamically try changing (mmchconfig with -I) that setting to see if it makes a difference. Just be careful when this is turned on, because you may overload the disks with queued requests. The only limit on this will be the setting of prefetchThreads.

Re: low IO performence on FS

You can dynamically try changing (mmchconfig with -I) that setting to see if it makes a difference. Just be careful when this is turned on, because you may overload the disks with queued requests. The only limit on this will be the setting of prefetchThreads.

Seems that change has make effect (I just apply the change on 4 nsd servers that manage one FS).

The througput from servers have growing from 200MB/s to 300MB/s (four servers have increase ~ 400MB/s in total) with the same workflow and work load. These servers run over 60 Tiers 10+2 Raid6 2TB sata disk. All the servers have redundant 8Gbps FC access to the same enclosers/dual-controller.

Re: low IO performence on FS

‏2013-07-08T22:31:05Z

This is the accepted answer.
This is the accepted answer.

Hi,

You have not detailed how the 4x1gbit links are bonded. In general, GPFS makes a single socket connection from a NSD client to the NSD server. Most Ethernet bonding schemes that use a hashing function (and don't break packet ordering) operate that multiple connections to the bonded link can scale throughput, but a SINGLE connection can not exceed the throughput of the underlying 1 gbit link. So .... a 4 x 1gbit bonded channel can have 4 different streams running at 1 gbit, but can NOT have a single stream running at an effective 4 gbits.

You have already have shown that when you are local to the NSD server, you are getting 550 - 600 MB/sec. This exceeds both the aggregate performance of the 4 x 1gbit bonded link (~ 500 MB/sec max), and the single stream performance of 1 gbit (~ 125 MB/sec). So the channel bonded performance will be much different than the 10 Gbit case, which can handle a single stream as large as 1250 MB/sec.

If this is a classic NSD server topology with the dual DCS3700's connected to dual NSD servers in a high-availability mode, you have 2 NSD servers. For 1/2 of the NSDs, one of the NSD servers should be the primary NSD server, and 1/2 of the NSDs, the second NSD server should be listed as the primary.

For equal sized NSDs, GPFS's striping policies are primarily round-robin, with the round-robin ordering being driven by the order of the NSDs when the file system was created.

For this ordered list of NSDs, you would like the primary NSD server to toggle between the two NSD servers, and NOT have all the first half of the NSDs all going to the first server, and then all the second half of the NSDs serviced by the second NSD server. This NSD configuration scheme is typical, but causes IO "clumping". If you have 60 tiers, with the first 30 all being serviced by the first NSD server, you will not even use the second NSD server until 31 GPFS read-aheads are launched ... which is a difficult read-ahead level to maintain.

By alternating the primary NSD server that is servicing each NSD from one NSD server to the second NSD server, you hope that every other NSD request will go to the other NSD server, and thus, a read-ahead level of only 2 will keep both NSD servers busy (with 1 gibt streams).

If you alternate the primary NSD server as suggested, a single NSD client with a 1 gbit connection would evenly load the dual NSD servers, with 1 gbit/2 each. With 8 NSD clients running your IO test, you could ideally max out the NSD servers. In reality, you will need a few more ... like 10-12 clients ... and you should approach an aggregate of 800 gbit. (2 x 4 x 1gbit).

If you don't properly alternate the NSD servers, the IO will "clump". For example, lets assume that your block size is 1 MB, and you have the first 30 of the NSDs defined in the file system all being serviced from the first NSD server. Each 1 MB request will take about 1/125 'th of a second best case, or 8 milliseconds. You will typically send 30 NSD requests to the same server .... 30 x 8 milliseconds = 240 milliseconds .... almost a quarter of a second.

So in this example, you are doing 1/4 second of network traffic to NSD server 1, and then 1/4 second of traffic to NSD server2. Each one of those 1/4 second sections of network traffic is limited to 1 gbit.

Without showing all the math ... if you have a "clump" of 30 IOs going to the same NSD server, you will need many more NSD clients to use the 4x1gbit bonded channel ... than a single 10 gbit channel.

What you are indirectly showing is that in the DDN case, with each NSD server having a 10Gbit connection (20 Gbit for 2 NSD servers), your GPFS topology is good enough to allow 600 MB/sec throughput via a 2 x 10 gbit NSD servers. If the 10gbit link can handle at most 1250 MB/sec, your network efficiency is 600 / (1250 x 2) = 600 / 2500 = 24%.

With your 4 x 1 gbit channel bonded topology, you have a maximum of 2 x 4 x 1gbit = 8 gbit of network bandwidth to the NSD servers, or about 1,000 MB/sec.

For you to sustain a desired 600 MB/sec with only 1,000 MB/sec of network bandwidth available, you have to be able to operate at over 600/1000 = 60% network efficiency, rather than just 24% network efficiency as in the 10gig case. Your configuration settings are not sufficiently optimized to obtain the desired 60% efficiency.

If you go backwards, and apply the demonstrated efficiency of the 10gbit case to the 4 x 1gbit channel bond, you end up with 24% of (1000 MB/sec) or about 240 MB/sec ..... which is right about where you are.

You can use the mmchnsd command to change the NSD server ordering for each NSD. It can be done after the file system is created, but the file system must be shut down.

I would suggest "rotating" the primary NSD server for each of the NSDs so it alternates across the available NSD servers. This is a pain to do manually, but is not that difficult to script.

If you have 60 NSDs, named NSD01 - NSD60, and NSD servers SERVER_A, SERVER_B, SERVER_C, SERVER_D,

Re: low IO performence on FS

You have not detailed how the 4x1gbit links are bonded. In general, GPFS makes a single socket connection from a NSD client to the NSD server. Most Ethernet bonding schemes that use a hashing function (and don't break packet ordering) operate that multiple connections to the bonded link can scale throughput, but a SINGLE connection can not exceed the throughput of the underlying 1 gbit link. So .... a 4 x 1gbit bonded channel can have 4 different streams running at 1 gbit, but can NOT have a single stream running at an effective 4 gbits.

You have already have shown that when you are local to the NSD server, you are getting 550 - 600 MB/sec. This exceeds both the aggregate performance of the 4 x 1gbit bonded link (~ 500 MB/sec max), and the single stream performance of 1 gbit (~ 125 MB/sec). So the channel bonded performance will be much different than the 10 Gbit case, which can handle a single stream as large as 1250 MB/sec.

If this is a classic NSD server topology with the dual DCS3700's connected to dual NSD servers in a high-availability mode, you have 2 NSD servers. For 1/2 of the NSDs, one of the NSD servers should be the primary NSD server, and 1/2 of the NSDs, the second NSD server should be listed as the primary.

For equal sized NSDs, GPFS's striping policies are primarily round-robin, with the round-robin ordering being driven by the order of the NSDs when the file system was created.

For this ordered list of NSDs, you would like the primary NSD server to toggle between the two NSD servers, and NOT have all the first half of the NSDs all going to the first server, and then all the second half of the NSDs serviced by the second NSD server. This NSD configuration scheme is typical, but causes IO "clumping". If you have 60 tiers, with the first 30 all being serviced by the first NSD server, you will not even use the second NSD server until 31 GPFS read-aheads are launched ... which is a difficult read-ahead level to maintain.

By alternating the primary NSD server that is servicing each NSD from one NSD server to the second NSD server, you hope that every other NSD request will go to the other NSD server, and thus, a read-ahead level of only 2 will keep both NSD servers busy (with 1 gibt streams).

If you alternate the primary NSD server as suggested, a single NSD client with a 1 gbit connection would evenly load the dual NSD servers, with 1 gbit/2 each. With 8 NSD clients running your IO test, you could ideally max out the NSD servers. In reality, you will need a few more ... like 10-12 clients ... and you should approach an aggregate of 800 gbit. (2 x 4 x 1gbit).

If you don't properly alternate the NSD servers, the IO will "clump". For example, lets assume that your block size is 1 MB, and you have the first 30 of the NSDs defined in the file system all being serviced from the first NSD server. Each 1 MB request will take about 1/125 'th of a second best case, or 8 milliseconds. You will typically send 30 NSD requests to the same server .... 30 x 8 milliseconds = 240 milliseconds .... almost a quarter of a second.

So in this example, you are doing 1/4 second of network traffic to NSD server 1, and then 1/4 second of traffic to NSD server2. Each one of those 1/4 second sections of network traffic is limited to 1 gbit.

Without showing all the math ... if you have a "clump" of 30 IOs going to the same NSD server, you will need many more NSD clients to use the 4x1gbit bonded channel ... than a single 10 gbit channel.

What you are indirectly showing is that in the DDN case, with each NSD server having a 10Gbit connection (20 Gbit for 2 NSD servers), your GPFS topology is good enough to allow 600 MB/sec throughput via a 2 x 10 gbit NSD servers. If the 10gbit link can handle at most 1250 MB/sec, your network efficiency is 600 / (1250 x 2) = 600 / 2500 = 24%.

With your 4 x 1 gbit channel bonded topology, you have a maximum of 2 x 4 x 1gbit = 8 gbit of network bandwidth to the NSD servers, or about 1,000 MB/sec.

For you to sustain a desired 600 MB/sec with only 1,000 MB/sec of network bandwidth available, you have to be able to operate at over 600/1000 = 60% network efficiency, rather than just 24% network efficiency as in the 10gig case. Your configuration settings are not sufficiently optimized to obtain the desired 60% efficiency.

If you go backwards, and apply the demonstrated efficiency of the 10gbit case to the 4 x 1gbit channel bond, you end up with 24% of (1000 MB/sec) or about 240 MB/sec ..... which is right about where you are.

You can use the mmchnsd command to change the NSD server ordering for each NSD. It can be done after the file system is created, but the file system must be shut down.

I would suggest "rotating" the primary NSD server for each of the NSDs so it alternates across the available NSD servers. This is a pain to do manually, but is not that difficult to script.

If you have 60 NSDs, named NSD01 - NSD60, and NSD servers SERVER_A, SERVER_B, SERVER_C, SERVER_D,

The NSD server asignment is like you have comment (a total of 12 NSDs RAID5 (8+1) 3TB SATA disk 1MB block size):

NSD001 server1, server2 NSD007 server3, server4

NSD002 server2, server1 NSD008 server4, server3

NSD003 server1, server2 NSD009 server3, server4

NSD004 server2, server1 NSD010 server4, server3

NSD005 server1, server2 NSD011 server3, server4

NSD006 server2, server1 NSD012 server4, server3

For the 10Gb network there are 60 NSDs and four servers balanced between them. (4MB block size )

Setting the ignorePrefetchLUNCount=1option, the 10Gb FS, show now throughPuts very near to 2GB/s substained during 12h, while in the case of 4x1Gb promtly 600-700MB/s.

I think this sentence is reveling

"Without showing all the math ... if you have a "clump" of 30 IOs going to the same NSD server, you will need many more NSD clients to use the 4x1gbit bonded channel ... than a single 10 gbit channel."

And I think tha tthe number of NSDs is having agreat impact on this equation

Re: low IO performence on FS

The NSD server asignment is like you have comment (a total of 12 NSDs RAID5 (8+1) 3TB SATA disk 1MB block size):

NSD001 server1, server2 NSD007 server3, server4

NSD002 server2, server1 NSD008 server4, server3

NSD003 server1, server2 NSD009 server3, server4

NSD004 server2, server1 NSD010 server4, server3

NSD005 server1, server2 NSD011 server3, server4

NSD006 server2, server1 NSD012 server4, server3

For the 10Gb network there are 60 NSDs and four servers balanced between them. (4MB block size )

Setting the ignorePrefetchLUNCount=1option, the 10Gb FS, show now throughPuts very near to 2GB/s substained during 12h, while in the case of 4x1Gb promtly 600-700MB/s.

I think this sentence is reveling

"Without showing all the math ... if you have a "clump" of 30 IOs going to the same NSD server, you will need many more NSD clients to use the 4x1gbit bonded channel ... than a single 10 gbit channel."

And I think tha tthe number of NSDs is having agreat impact on this equation

Thank you for the information. There is likely much going on here, and remediating all the issues will take some effort.

First, please realize that the round-robin bonding is NOT a "standard" bonding scheme. It will balance outgoing network traffic, but it does NOT help incoming network traffic. In general, incoming traffic will directed to only one of the four paths. Since the round-robin bonding method is "transparent" to both the end client and any switches and routers in the network path, there is no mechanism to alternate between the 4 paths (and their 4 MAC addresses) on incoming traffic. There is some ARP spoofing that can be done to attempt to assign incoming connections to different paths, but once a TCP stream is connected the incoming stream will be limited to a single path's worth of bandwidth.

A bigger concern could be that the round-robin bonding method does NOT ensure the preservation of packet ordering. The likelihood of out-of-order packets will increase with congestion and the number of hops between the end point and the server. Please monitor the IP-level and TCP-level statistics with "netstat" to check the level of packet reordering and retransmissions going on. Not all protocols tolerate out-of-order packets well. I've seen conditions were a single dropped packet once every 2 seconds (over 100,000 packets) reduced overall throughput by more than 50%, and the network flow control mechanisms backed off, and paced the return to full throughput..

The non-round-robin bonding methods are based on various hashing schemes. They may require compatibility with the Ethernet switch, and switch-side configuration. Mode "802.3ad" or "4" is the standardized link aggregation protocol and is supported by many switches. 802.3ad DOES preserve packet ordering.

Now ... please don't "shoot the messenger".

You did not mention the model of the DDN system with the 10 Gbit NSD servers, but your last posting mentioned that there were 60 NSDs, which would often imply 60 x 10-disk RAID tiers with "SATAssure". This is a total of 600 disks, with a GPFS blocksize of 4 MB. From my experience with the IBM DCS9900 / S2A9900, such a storage configuration should be running about 6400 MB/sec read or write, and about 7,000 MB/sec combined, across 8 x 8gbit FC controllers. With "average" levels of IO stack tuning, you would expect at least 5,000 MB/sec. Please note that this is 5 THOUSAND MB/sec, not the 6 HUNDRED MB/sec that you have described coming from the 10 Gbit NSD servers connected to the DDN storage.

The fact that you can only realize 600 MB/sec from a 5000+ MB/sec storage system indicates that there are likely multiple levels of inefficient operation. We regularly run at 6,500 - 7,000 MB/sec on our DCS9900 systems with at least 300 disks. Your configuration is running about 10% of expected utilization.

On the dual DCS3700, you have similar issues, and additional limitations. You are using a small 1MB GPFS block size, which is inefficient for large file IO. Using 4MB GPFS block sizes, and 4MB host IO, you can expect 1800-2000 MB/sec per DCS3700 with the base controllers and 60 disks, for a total of 3600-4000 MB/sec for dual systems. Using a 2 MB GPFS block size, roughly halves the bandwidth (with the same number of disks), and using a 1 MB block size roughly halves the bandwidth again. So you end up with an expectation of only 900-1000 MB/sec from the dual DCS3700s.

So you are potentially wasting 3/4 of the throughput capability. Also, when running at 900-1000 MB/sec the disks will be nearly 100% busy, leaving little additional headroom for the IOPs needed for metadata activity. Therefore, any metadata IO activity will lessen the available throughput available for large-file IO. If the GPFS block size was 4 MB rather than 1 MB, and you were running at the same 900-1000 MB/sec level, your disks would only be about 25% busy, leaving about 75% of the disk resources for metadata IOPs. This would be about 5400 IOPS for metadata and 1000 MB/sec.

So the fact that you are now approaching 600-700 MB/sec across the GPFS clients, with a max of 1000 MB/sec across the channel-bonded network and 900-1000 MB/sec to the disks ... is admirable.

However, if not network-limited, you could be at ~ 3600-4000 MB/sec from the dual DCS3700s with a 4 MB GPFS block size, and 6000+ MB/sec from the DDN with 300+ disks.

The Linux IO stack can be viewed as an automotive drive train. However, like automotive drive trains, all the proper "gear ratios" need to be assigned depending on the capabilities of the sub-components. The end-to-end gear ratio is an arithmetic function of all the intermediate components with multiple combinations yielding similar end results. There can be more than one "right" combination.

The gear ratios and number of transmission speeds for a "car" will be much different from those of a tractor-trailer truck. The "out of the box" experience the Linux and/or GPFS "defaults" is closer to a mid-range "car" than a tractor-trailer truck in this analogy. Putting a high power truck engine into a "car" drive train will be sub-optimal without changing some gear ratios. Correspondingly, putting a car engine into a "truck" drive train will also be sub-optimal without changing some gear ratios.

There are at least 7 layers of the IO stack on the NSD server, and 3 additional layers when including the connection to GPFS client.

As elements in the "drive chain", at least seven-ten layers of the Linux IO stack need to be cross-coordinated with ultra-large IO and wide striping in mind:

Storage

Fibre Channel / SAS / IB SRP Driver

SCSI Device Handler Module

Block Layer

DM-Multipath

Logical Volume Manager

File System (GPFS NSD-server side)

Cluster-interconnect network (NSD-server side)

Cluster-interconnect network (NSD-client side)

File System (GPFS client-side)

A constriction in any of the seven-ten layers can limit the overall end-to-end efficiency and throughput. Such a constriction can also mask the effect of changes in other layers.

Also, the DDN configuration is a more monolithic storage system with "strong" storage controllers. The DCS3700 is a more modular topology with relatively "weak" storage controllers. Thus the IO optimization techniques and/or focus is somewhat different.

Re: low IO performence on FS

Thank you for the information. There is likely much going on here, and remediating all the issues will take some effort.

First, please realize that the round-robin bonding is NOT a "standard" bonding scheme. It will balance outgoing network traffic, but it does NOT help incoming network traffic. In general, incoming traffic will directed to only one of the four paths. Since the round-robin bonding method is "transparent" to both the end client and any switches and routers in the network path, there is no mechanism to alternate between the 4 paths (and their 4 MAC addresses) on incoming traffic. There is some ARP spoofing that can be done to attempt to assign incoming connections to different paths, but once a TCP stream is connected the incoming stream will be limited to a single path's worth of bandwidth.

A bigger concern could be that the round-robin bonding method does NOT ensure the preservation of packet ordering. The likelihood of out-of-order packets will increase with congestion and the number of hops between the end point and the server. Please monitor the IP-level and TCP-level statistics with "netstat" to check the level of packet reordering and retransmissions going on. Not all protocols tolerate out-of-order packets well. I've seen conditions were a single dropped packet once every 2 seconds (over 100,000 packets) reduced overall throughput by more than 50%, and the network flow control mechanisms backed off, and paced the return to full throughput..

The non-round-robin bonding methods are based on various hashing schemes. They may require compatibility with the Ethernet switch, and switch-side configuration. Mode "802.3ad" or "4" is the standardized link aggregation protocol and is supported by many switches. 802.3ad DOES preserve packet ordering.

Now ... please don't "shoot the messenger".

You did not mention the model of the DDN system with the 10 Gbit NSD servers, but your last posting mentioned that there were 60 NSDs, which would often imply 60 x 10-disk RAID tiers with "SATAssure". This is a total of 600 disks, with a GPFS blocksize of 4 MB. From my experience with the IBM DCS9900 / S2A9900, such a storage configuration should be running about 6400 MB/sec read or write, and about 7,000 MB/sec combined, across 8 x 8gbit FC controllers. With "average" levels of IO stack tuning, you would expect at least 5,000 MB/sec. Please note that this is 5 THOUSAND MB/sec, not the 6 HUNDRED MB/sec that you have described coming from the 10 Gbit NSD servers connected to the DDN storage.

The fact that you can only realize 600 MB/sec from a 5000+ MB/sec storage system indicates that there are likely multiple levels of inefficient operation. We regularly run at 6,500 - 7,000 MB/sec on our DCS9900 systems with at least 300 disks. Your configuration is running about 10% of expected utilization.

On the dual DCS3700, you have similar issues, and additional limitations. You are using a small 1MB GPFS block size, which is inefficient for large file IO. Using 4MB GPFS block sizes, and 4MB host IO, you can expect 1800-2000 MB/sec per DCS3700 with the base controllers and 60 disks, for a total of 3600-4000 MB/sec for dual systems. Using a 2 MB GPFS block size, roughly halves the bandwidth (with the same number of disks), and using a 1 MB block size roughly halves the bandwidth again. So you end up with an expectation of only 900-1000 MB/sec from the dual DCS3700s.

So you are potentially wasting 3/4 of the throughput capability. Also, when running at 900-1000 MB/sec the disks will be nearly 100% busy, leaving little additional headroom for the IOPs needed for metadata activity. Therefore, any metadata IO activity will lessen the available throughput available for large-file IO. If the GPFS block size was 4 MB rather than 1 MB, and you were running at the same 900-1000 MB/sec level, your disks would only be about 25% busy, leaving about 75% of the disk resources for metadata IOPs. This would be about 5400 IOPS for metadata and 1000 MB/sec.

So the fact that you are now approaching 600-700 MB/sec across the GPFS clients, with a max of 1000 MB/sec across the channel-bonded network and 900-1000 MB/sec to the disks ... is admirable.

However, if not network-limited, you could be at ~ 3600-4000 MB/sec from the dual DCS3700s with a 4 MB GPFS block size, and 6000+ MB/sec from the DDN with 300+ disks.

The Linux IO stack can be viewed as an automotive drive train. However, like automotive drive trains, all the proper "gear ratios" need to be assigned depending on the capabilities of the sub-components. The end-to-end gear ratio is an arithmetic function of all the intermediate components with multiple combinations yielding similar end results. There can be more than one "right" combination.

The gear ratios and number of transmission speeds for a "car" will be much different from those of a tractor-trailer truck. The "out of the box" experience the Linux and/or GPFS "defaults" is closer to a mid-range "car" than a tractor-trailer truck in this analogy. Putting a high power truck engine into a "car" drive train will be sub-optimal without changing some gear ratios. Correspondingly, putting a car engine into a "truck" drive train will also be sub-optimal without changing some gear ratios.

There are at least 7 layers of the IO stack on the NSD server, and 3 additional layers when including the connection to GPFS client.

As elements in the "drive chain", at least seven-ten layers of the Linux IO stack need to be cross-coordinated with ultra-large IO and wide striping in mind:

Storage

Fibre Channel / SAS / IB SRP Driver

SCSI Device Handler Module

Block Layer

DM-Multipath

Logical Volume Manager

File System (GPFS NSD-server side)

Cluster-interconnect network (NSD-server side)

Cluster-interconnect network (NSD-client side)

File System (GPFS client-side)

A constriction in any of the seven-ten layers can limit the overall end-to-end efficiency and throughput. Such a constriction can also mask the effect of changes in other layers.

Also, the DDN configuration is a more monolithic storage system with "strong" storage controllers. The DCS3700 is a more modular topology with relatively "weak" storage controllers. Thus the IO optimization techniques and/or focus is somewhat different.

On your DDN S2A9900 configuration with 4 NSD servers ... yes, you are correct that the maximum performance for read OR write is 4 x 1250 MB/sec = 5 GB/sec due to the single 10GbE connection to each NSD.

However, when you run in mixed mode, with both reads and write activity, you can use the full-duplex capabilities of both the 10GbE connections, as well as the full-duplex capabilities of the 8Gbit FC connections to the storage. So running in a mixed read/write workload, you can approach the limits of the S2A9900 of 6,400 - 7,000 MB/sec ... but this requires near-perfect balance of the IO across both the NSD server's 10GbE connections and the 8 x 8gbit FC ports on the storage processors themselves. Also, running your NSD interconnect on 10GbE (vs. Infiniband) adds additional layers of potential restrictions and/or bottlenecks, especially since GPFS does not yet support the more-efficient RDMA-like protocols on Ethernet. However, properly configured, the 10GbE environment can perform efficiently.

I am willing to continue the discussion off-line by email. I would suggest that the final resolution be posted back on the GPFS forum for others to benefit from.