Brocade FabricOS v7.3x is officially supported for IBM clients now. Among all the new features and improvements there are some I would like to cover in small blog entries. Especially for the ones directly related to support and troubleshooting.

One command to rule them all

Investigating ongoing problems usually starts with setting a baseline. To tell the current problem from the battles of the past, you need to clear the counters carefully. Over the years, hardware platforms, and FOS versions these commands changed again and again. Portstatsclear was such a command. Years ago it was like Russian roulette - You never knew what it would really clear. This port? The ports in the same portgroup? All phyiscal counters but not the stuff on the right side of porterrshow? Statsclear cleared all ports - at least the external FC ports. You needed another command for internal blade counters. And for the GigE interfaces you needed portstatsclear again.

All you need in FOS v7.3 is supportinfoclear. It will clear all port counters and in addition it clears the portlogdump, too. You only need to execute:

supportinfoclear --clear -force

The -force prevents it from asking you again, if you are really really sure about doing it. Additionally you can clear the error log, too by using -RASlog (case-sensitive). But at least for anything support-related I don't recommend to do that if not instructed otherwise.

And another improvement: It will be in the clihistory. Even if you execute it in plink or ssh without opening a shell on the switch. So no worrying about how to execute it anymore. Just use your favorite script or do it directly and the IBM support will see how reliable the data is.

Update Nov 3rd:

And another way how it rules them all: As Serge writes in the comments below, it will clear the counters for all ports regardless of their VF-membership. So no hopping through logical switches or need to use fosexec! Thanks Serge!

And as described in "How to avoid support data amnesia" over there in the Storageneers blog: Please think about when to execute this command! While it's save to clear the counters for really ongoing or 100% reoccurring problems, you need to gather supportsaves first if you want to have the root cause analyzed for something that happened in the past. Otherwise supportinfoclear might wipe all the indications and evidences needed to find out what happened!

Sometimes we notice that an ISL is actually a bottleneck in a fabric. Not a congestion bottleneck where the throughput demand is just too high for the ISL's bandwidth. This one could be solved by putting another cable between the both switches. But if you have a latency bottleneck your ISL won't be running at the maximum of the bandwidth. The contrary is the case: it lacks the buffer credits to ensure a proper utilization. If you see a latency bottleneck on an ISL it's often back pressure from a slow drain device attached to the adjacent switch. But every now and then I get a case where it's just the ISL. Sometimes in one direction, sometimes in both. Even with lengths were you don't think about using long distance settings at all.

But in the past we did exactly that!

When we encountered a situation like that, the first step was always to get rid of everything that reduces buffer credits for the real traffic flows, like an active QoS setting without having QoS zones. If the problem was still there the only way to give it more buffers was to configure a long distance mode then. We solved performance bottlenecks on ISLs by setting up a let's say 50m ISL in a 10km long distance mode (LE). I described this also 2 years ago in the article How to NOT connect an SVC in a core-edge Brocade fabric. While this indeed gives you more buffers, it comes with a drawback.

Long distance and Virtual Channels

On normal ISLs we have Virtual Channels. They work in a way that the buffer credit management of the ISL is logically partitioned into 8 channels. When we talk about normal Class 3 open systems traffic it's used this way:

VC

Used for

0

Class F

1

not used

2

Class 3 data

3

Class 3 data

4

Class 3 data

5

Class 3 data

6

Multicast

7

Multicast/Broadcast

VC 0 is used for inter-switch communication. For example if a new zoning is distributed to all switches. The VCs 6 and 7 are not really of interest most of the time. We have to focus on VCs 2, 3, 4, and 5. (Mind the Oxford comma!) If you have a slow drain device that is reached using Virtual Channel 2 in your fabric, then at least the traffic of the other 3 data VCs is unaffected. With a long distance mode like LE you lose that advantage.

Buffer distribution on Virtual Channels on a normal ISL:

VC

0

1

2

3

4

5

6

7

credits

4

0

5

5

5

5

1

1

Buffer distribution on Virtual Channels on a LE configured ISL:

VC

0

1

2

3

4

5

6

7

credits

4

0

80

0

0

0

1

1

While you have more buffers in total now, only the first data VC has them assigned. There is no partition of data traffic anymore and the result is the risk of Head of Line Blocking (HoLB). A latency bottleneck (for example due to back pressure from a slow drain device) will always impact ALL the user data going over that ISL! That's a high price for those additional buffers.

The solution

With FabricOS v7.2x Brocade introduced a new command:

portcfgeportcredits

It allows you to assign a freely configurable number of credits between 5 and 40 for that ISL. You might ask:

But LE mode gives me 80 on 16Gbps!?!

Yes, but look at the distribution:

VC

0

1

2

3

4

5

6

7

credits

4

0

40

40

40

40

1

1

Not the whole data part of the link will share the 40 buffers. Each data VC gets its own 40 buffers and they are still handled independently! No Head of Line Blocking! And remember: This is not meant for long distance connection and it still comes for free! It works on 8G switches, too, as long as they are running at least v 7.2x.

To give 40 buffers to each data VC on an ISL at port 1 you would enter:

portcfgeportcredits --enable 1 40

With the --disable parameter you switch back to normal mode and with --show you can see the current configuration of a port.

And please keep an eye on the number of remaining buffers in portbuffershow :o)

So from now on if you need just some more buffers on your ISLs to keep everything running smoothly:

Well-made professional education is worth every cent, but in today's world controlled by CFOs everything costing money will be challenged sooner or later. And if you search for freebies you often end up with the first 3-4 sentences of an obviously good book about the topic and the prompt to register with your business information and email address. Weeks of business SPAM will follow then even if you unsubscribe again. Here are some good free books to get a good understanding of SAN switching and how it's implemented by the both big players Cisco and Brocade without the need to register for anything.

Introduction to Storage Area Networks and System Networking

Working at IBM I appreciate their Redbooks program. Experts from in- and outside IBM share their knowledge in for of these comprehensive ebooks. This one is a good introduction to SAN and how IBM does it. You learn how Fibre Channel works, the hardware, the software, the management, the use cases and the design considerations. And of course it covers the IBM products in that area, too.

SNIA Dictionary

Regular readers of my blog (are there any?) may know my opinion about the SNIA Dictionary, but for learning Storage Networking it's still a good source of definitions and explanations for many of the common terms and concepts. Get it directly from snia.org.

Cisco MDS 9000 Family Switch Architecture

This document is also known as “A Day in the Life of a Fibre Channel Frame” and I like it. Although it certainly saw some summers and winters since its release in 2006 but the general architecture is still the same. Of course everything is integrated and consolidated in the latest products, but if you ever understood how a frame is handled by a n older generation Cisco switch, it won't be a problem to work with, design for, or even troubleshoot the newest ones.

Brocade Fabric OS Administrator's Guide

While Brocade is certainly not revealing too much about the internals of their switches, the admin guide is still a good source of information about the Brocade features and implementations. Many SAN questions I'm asked in an average week could easily be answered by a glimpse into this guide. There is a new one for each new major codestream, so always look in the one for your installed FabricOS version. This is the link for FabricOS v7.2.

The remaining two ebooks on my list are specifically for performance troubleshooting... ...my hobbyhorse somehow.

Slow Drain Device Detection and Congestion Avoidance

This one is from Cisco and it covers the different types of performance problems pretty well. If you read the one about Cisco architecture before (see above) you can get much more out of this piece as well. It has some good example, troubleshooting approaches and explanations for the counters you might see. A definite must-read.

IBM Redpaper: Fabric Resiliency Best Practices

This one is about Brocade switches and the IBM version of their "SAN Fabric Resiliency Best Practices". After explaining the fundamentals about SAN performance it shows you how performance troubleshooting is done on a Brocade fabric, especially by using built-in features like bottleneckmon.

I'm sure there are many other good learning materials out there that don't exist for the sole reason to catch your contact addresses by registration. If you know some that should be on this list as well, please let me know. Thanks!

I don't always write technical blog posts. But when I do I make them long and the conclusion contains a request to you my readers to do this or that. I won't do that today. Today is about a behavior I observed, but I won't propose anything. Feel free to draw your own conclusions. Well that might be considered as a proposal :o)

This one is about the IBM System Storage SAN06B-R, a multi-protocol router or SAN extension switch. It consists of two ASICs - one handling the Fibre Channel part and one for the FCIP. They also have some extra tasks like FC routing and compression, but for our example it's enough to know that there are two and if you want to transfer SAN traffic over FCIP, it has to pass both of them.

The both ASICs are connected via 5 internal ports all working with a line rate of 4Gbps. That doesn't sound much compared to the 16 FC ports running with up to 8Gbps on the front-side. But we have to keep in mind that they are only for connectivity. Given the max. IP connectivity of 6x 1GbE, the internal connections shouldn't be a bottleneck.

Shouldn't...

Internal connections are somewhat similar to external ISLs between switches when it comes to flow control. They use buffer-to-buffer credits ("buffer credits") and the links are logically partitioned into virtual channels, each of them with an own buffer credit counter. These virtual channels prevent head of line blocking in case of a back pressure (for example due to slow drain devices on the other side of FCIP connections).

When it comes to buffer credits, it's important how they are assigned to these virtual channels. Within these internal connections each VC gets 1 buffer, but it can borrow 3 out of a pool. The pool is shared among all VCs for that port and contains 11 in total.

You might say "Yeah, but hey it's just a very short connection on the board. Who needs those buffer credits anyway?", but keep in mind they are not just for spanning the tiny distance. There are multiple reasons why frames need to be touched here and therefore buffered. Plus of course possible external back pressure. Often a few buffer credits make the difference between normal traffic flow and piling up of frames and even frame discards due to timeout.

I guess the last thing you want to have is an artificial bottleneck inside of your routers...

So the amount of buffers and buffer credits for each internal connection depends on how many VCs are in use. And that's the crux. The number of VCs per internal connection depends on the number of...

Tunnels!

A tunnel consists of 1-6 circuits, so you can bundle several GbE interfaces together. They call it FCIP trunking. Some features like e.g. Tape Pipelining require the use of only one tunnel. There's not much we can do about that. For an environment that doesn't utilize them, it starts to get interesting now: If you have only 1 tunnel, you have only 1 VC and therefore only 4 buffer credits plus the risk of head of line blocking! In addition if you actually spread the traffic across the low, medium and high priority within a circuit, you would get an own VC for each priority.

Using only the standard "medium" priority for the data traffic (F-class "administrative" fabric traffic use an own VC that fall out of this equation of course) would give you that amount of buffers on each of the 5 internal connections between the ASICs:

# of tunnels

# of VCs

# of buffers

1

1

4

2

2

8

3

3

12

4

4

15

5

5

16

6

6

17

(1 buffer per VC + 3 to borrow per VC out of a pool of 11)

Please be aware that the amount of VCs/buffers is only one point that needs to be taken into consideration when planning and configuring the optimal FCIP connection. You can find a good overview about the other ones in Brocade's FCIP Administrator's Guide for your FabricOS version.

I thought I'd never have to write about fillwords. I thought: there will be a phase of some months and then this topic is dead. Strangely enough it's still alive. I still get questions about them, I still see people blaming them and I still see evitable problems because of changing them.

Fillwords?

For every new line rate (now read "Generation" or "Gen"), usually the switch and HBA vendors are the first ones to adopt the new standard and release their products. It was the same for 8Gbps, which came with a new fillword. Fillwords are 4-byte-words without a special task. A port sends them all the time it doesn't have to send something else. They're used to maintain the synchronization of the link and therefore the fillword used up to incl. 4Gbps was fittingly called IDLE. Depending on the workload, the ports and the CPU utilization of a PC have one thing in common: You see a lot of IDLE. Therefore it made sense to think about the optimal fillword and so it was changed for 8Gbps. In the first published version it was quite like "Let's replace all instances of IDLE with a better one: ARBff". First products were developed and among them Brocade's 8Gbps switches.

Later it turned out that it would be better to not just replace all IDLEs out of hand, because they were not only used as a fillword, but in the link initialization, too. The standard was updated and then said, "Use ARBff as a fillword, but keep the IDLE for link initialization".

For products released after that point in time the vendors usually implemented the new version of the standard, which was not compatible with the first one. So clients bought new 8Gbps-capable devices, for example DS5000 boxes or SAN Volume Controllers, and failed to get them online. These devices tried to use the standard-compliant word during the critical link initialization phase and when they noticed that the switches sent the wrong ones, the link initialization failed.

I have to admit that most vendors' information politics were very "unlucky" at that time. Everybody blamed everybody else. After some protocol traces it was clear that the problem was the use of ARBff during link initialization. So as a workaround we recommended to configure the switches to use IDLE again (mode 0). Eventually new firmware versions were written and Brocade came up with two new fillword modes - one of them compliant to the standard (mode 2) and another more dynamic mode 3. It tried ARBff in link initialization first (like mode 1) and if that failed, then it behaved like mode 2. So mode 3 became the natural choice.

For some time we had a lot of cases for that problem and many people in the broad area of storage got in touch with the term fillword. While the number of problem cases about them decreased, the memory about fillwords stayed active in people's minds. In addition there is a counter called "er_bad_os" for each port. It means "Error: bad ordered set" and increases basically for 2 situations: 1) If such a 4-byte word is corrupted or 2) if the port receives an ordered set it didn't expect. The first situation is a problem, but you get other indications as well ("enc out", "enc in", ...). The second situation could for example happen if a running port expects the IDLE fillword (because it was configured to mode 0 as a workaround as stated above) but receives ARBff. Although the counter increases in the ASIC there is no impact on a running connection. In fact the fibre channel protocol says that each well-encoded ordered set without any other function should be treated the same way as IDLEs. So as long as there is no bit error in them, it doesn't matter what kind of fillword is received - the switch must use it to maintain the synchronization.
However, the myth was already born: Blame it on the fillword! For a lot of totally unrelated problems, like performance problems, CRC errors, occasional link resets and even SFP heat issues, SAN admins and even support personnel for the attached device blamed the fillword. "The fillword is wrong!", "Change the fillword first!", "Look at this rapidly increasing error counter!" - Changing the fillword mode to 3 became the new mantra for every howsoever remote storage problem. And now it's very similar to bloodletting in the medicine of the previous centuries: A sophisticated-sounding theory everybody could agree on and a simple action plan.

But just like bloodletting, it only helps in certain situations and used as a general treatment it does more harm than good.

Changing the fillword mode is disruptive for a link. If you really have a problem with a wrong fillword setting, this is not very concerning, because as stated above, the link initialization would have failed and the device wouldn't be online at that moment anyway. But for all the cases where the port is actually up and running there will be a new link initialization. All current I/O belonging to this port will be void. There will be command timeouts. Error recovery needs to take place. Depending on the robustness of the attached device this could already lead to problems. But not enough, I even saw a lot of SAN admins even changing the fillword mode for normal E-ports, which is complete nonsense. Believe me, you don't want to disturb your fabric stability by bouncing each and every ISL in your SAN environment within a short time without a solid reason.

And changing running ports to a more compliant fillword is certainly NOT a solid reason.

The sad part is that often the perceived problems improved by this action. But then a simple portdisable/portenable would most probably have had the same effect, too. It's like patients recover - not because of bloodletting, but despite of it.

There is a function in FabricOS with a command name so unsuspicious and innocent you might want to enable it without even knowing what it does:

portcfgislmode

Now doesn't that sound like: "If you plan to attach an ISL to this port, enable me! Enable me and I will prepare this callow port to be a real E-Port."?

Well in some ears it seems to do so. But it's something completely different. This command enables the so called "ISL R_RDY Mode". Hmm... that still does not sound dangerous. So what is it?

Virtual Channels 101

We talked about virtual channels before and with "talked about" I mean I wrote about them here and here and briefly in some other articles. But for the sake of completeness let me explain what we need here.
A normal classic ISL of a Brocade switch is partitioned in 8 so called virtual channels. In some documentation you'll find the term "virtual circuit", which is basically the same.

Unlike a multiplexed link an ISL still transports only one signal at a time. So what's partitioned is not really the ISL but its buffer credit management. That means you have a distinct buffer credit counter for each virtual channel, which is decreased every time a frame is transmitted for that particular virtual channel. The receiving switch recognizes for which virtual channel a frame is sent. To give buffer credits back as soon as it can receive more frames, the switch will send a VC_RDY. This is a ordered set (4 byte word) that contains the information for which virtual channel it carries a buffer credit.

In the classical ISL mentioned above we have basically 8 virtual channels and only 5 of them are needed for most purposes in a FCP SAN: Virtual channel 0 and 2-5. VCs 2-5 are the "data-VCs", they carry the user traffic - mainly the I/O between end-devices, but also control frames and administrative stuff between end-device and switch. Additionally switches talk with each other in an own service class "Class F" using virtual channel 0. If you change the zoning, the new zoning configuration will be distributed in the fabric as Class F traffic. The same for devices coming online or when the the principal distributes the current time.

Why Virtual Channels?

The advantage of virtual channels is, that if there is a bottleneck further down the way - for example due to a slow drain device - not all traffic is blocked, but only the traffic mapped to the same virtual channel like that slow drain device. The other devices still can talk freely. It's even more important to separate normal traffic from Class F traffic. You don't want your fabric to stay in an inconsistent state because a slow drain device consumed the buffer credits. And you don't want it because of another reason I will explain later.

Here you see the buffer credits of a classic ISL in the output of the root-level command portregshow. The VCs count from left to right beginning with VC0. For this article you can ignore the both credits for VC6 and VC7.

You can increase the buffers of an E-Port (which you would see in the other switch's portregshow then) in 2 ways:

1) With portcfgeportcredits as described here. This would give you additional buffer credits on each data virtual channel. It would look like this:

bbc registers
=============
0xcdc32800: bbc_trc 4 0 20 20 20 20 1 1

All of the data virtual channels have 20 buffer credits now. The 4 buffers for Class F would stay the same.

2) With a long distance mode. By doing that, the data VCs collapse to only VC2. The rest of the VCs stay the same again.

bbc registers
=============
0xcdc32800: bbc_trc 4 0 300 0 0 0 1 1

You see: You might change buffer credits, but it will only affect the data VCs. Class F traffic stays with 4 buffers regardless of the distance. And that's a good thing. Because Class F is a beast.

Principal ISLs

Normal I/O can be routed over any ISL that is available as long as it is on one of the shortest paths according to the FSPF protocol (Fabric Shortest Path First). Class F traffic will usually not do that, it will go over principal ISLs. When the fabric is built, a spanning tree is built as well. It is rooted at the fabric's principal switch and reaches every switch in the fabric exactly one time (thus a spanning tree). You can see the involved ISLs in switchshow marked with "upstream" and "downstream" depending on the direction towards or away from the principal.

So why is Class F a beast?

Class F has the highest priority (together with the F_RJT and ACK traffic from class 2). Especially higher than any user traffic. But it is caged in VC0 with the fixed 4 buffers. So regardless of how long the distance is, it is always limited to these 4 buffers. It wants to bite the ones outside the cage, but it can't. In the fabric that means that it just can take some time until all the fabric information is distributed, but that's fine. The processes are designed in the way to cope with that.

But what if the beast breaks out of the cage?

ISL R_RDY mode is intended to be used on long distance links going over legacy SAN extenders or gateways that cannot cope with VC_RDYs or only transport the plain frames without any ordered set per se. If you use ISL R_RDY mode, VC_RDYs cannot be used and so there are no virtual channels. All virtual channels will collapse to one big channel and it will get all the buffer credits. So in portregshow it looks like this:

bbc registers=============0xcdc32800: bbc_trc 200 0 0 0 0 0 0 0

The cage is broken, the beast is free. The highly prioritized Class F traffic is not limited to 4 buffers anymore but can consume all of them freely. And it's strange how much Class F traffic there suddenly is. Especially in situations where something changes in the fabric, user traffic is effectively blocked and in extreme cases, the user traffic waits for such an amount of time, that it's dropped due to timeout. Additionally a lot of back pressure spreads in other parts of the fabric and causes performance problems even for apparently unrelated devices.

Conclusion

If it's not absolutely necessary to use ISL R_RDY mode, you should NEVER enable it. It's a special purpose mode that should ONLY used for that special purpose of connecting two parts of a fabric with legacy SAN extenders or gateways. You don't have that? Then don't use ISL R_RDY mode. You even have ISL R_RDY mode enabled on local ISLs? Switch it off! You really have a reason for having ISL R_RDY mode? In my eyes the time is now to re-think your SAN architecture and figure out what changes have to be made to be able to live without ISL R_RDY mode.

Almost a year ago I wrote an article about congestion bottlenecks in Brocade switches. I said you should avoid them, because they mean that you probably have no redundancy because of too much workload or you don't use it properly. You can use the bottleneckmon to detect them. On the other hand I cared much more about latency bottlenecks, often caused by slow drain devices and their implications. And so I do today.

Well...stop! Didn't you talk about congestion bottlenecks?

Yes! Today I want to explain how a congestion bottleneck could cause the exact same symptoms on the devices like a latency bottleneck - and exactly the same performance degradations. This is how it happens. In the middle you see a SAN director with 2 portcards and 2 core cards. While the devices are connected to the portcards, the core cards provide the backend connections between them. They are internally connected via the backplane. So for example host 1's way over there to the storage array A would traverse the portcard, then one of the both core cards and leaves the other portcard until it reaches storage array A. It could even be that two devices connected to the same portcard have to go over the core cards, because so called local switching is only done within an ASIC and a portcard could have more than one depending on the number of ports.

Now please meet host 2. Host 2 is a wonderful modern server. One of the work horses of the datacenter. It's fully packed with virtual machines, but its many cores and memory, as well as its state-of-the-art HBA, provide enough horsepower to cope with the workload. This baby is more than capable to do the work and it's in no way a slow drain device. It's zoned and mapped to the storage arrays A, B, C and D and it uses them heavily, mostly for read operations. The green tiny bars are read requests and as you see in the next picture it sends them to all of them, all of the time.

Of course the other hosts send requests, too, but let's focus on our diligent host 2. Yes, the pictures are too simplistic, but I'm sure you'll get the point. On the next one you see the first responses flowing back to host 2. Communicating with several storage arrays the link towards host 2 is used heavily, but host 2 is processing the incoming frames quickly and gives buffer credits back to the switch in proper time. So far so good.

If you didn't enable bottleneckmon, the congestion bottleneck would still be there... you just wouldn't know it.

The crux is: you will hardly find a congestion bottleneck that just flows with high link utilization and no negative effects. The probability is much higher for the following scenario:

Although there are enough buffer credits for this highly utilized link, frames are piling up towards it, because there is just too much workload and the link is busy sending frames. There is no slow drain device and to stay with the bathing metaphor: the drain works very well and transports as much water as its physically able to do. But there is so much more water in the tub that it could not go through the drain at the same time. And in addition imagine you have not only one water tap (in our case storage arrays) but four of them. They fill the tub quicker than the drain can empty it. As a result the internal buffers for all the hops through the SAN director fill up (that's basically the tub) and finally the director needs to do something against that: It will slow down the sending of buffer credits to the devices. Not only devices that want to send frames directly to host 2, but due to back pressure also the ones that send frames into that rough direction (using the same internal connections for example). And finally you'll end up in something like this:

The SAN director just behaves like a slow drain device itself!

Frames pile up inside the storage arrays and other end devices impaired by the slow drain behavior. If their RAS package is good, they will yell about credit starvation and probably even drop frames within their FC adaptors. In extreme situations these frame drops could happen in the director, too. At least you would see then something that would point you to a performance problem. Because otherwise - if you would have substantial delay in the traffic but all the frames get finally transferred to the next internal or external hop within 500ms ASIC hold time - you would only see the congestion bottleneck. And without bottleneckmon you wouldn't see anything at all then. The switch would look clean. Nothing in porterrshow or porstatsshow. Both show only external port counters anyway. As a SAN administrator you would not suspect anything in the director to cause this.

And still it would be there. A big performance problem caused by a device communicating with too much other devices. Not a slow drain device but still causing a slow drain in the SAN.

So how to solve it?

It's basically what I wrote a year ago plus points 3. and 4. from How to deal with slow drain devices. You just have to ensure - from a architectural design point of view - that all components of the SAN are able to cope with the workload at any given time. It's both that easy and that complex. But the first step towards resolving such a situation is to detect it properly and to keep in mind what could happen.

I found an interesting question in our worldwide IBM internal SAN support community the other day. A colleague asked if there is a command that returns all zones for a given WWPN on a Brocade switch. While I thought there was a command for a simple task like that, I did not find anything. So I thought about how to solve that.

Update:

Nick Lanneau made me aware in the comments that there is indeed a FabricOS command to do this: nszoneshow. A few months after this article IBM made FOS v7.1.0c available to its clients which allows you to do that by just executing:

nszoneshow -wwn <wwn>

Of course I should have noticed that as the initial version v7.1 was released around the start of 2013 and I have access to pre-release documents. I just didn't see it. Sorry for that and a big Thank you to Nick!

Original blog:

To be honest, I'm not a great linux shell script expert. I troubleshoot SAN problems also in conjunction with Linux at work. I also use what Linux provides on Brocade switches. But at home Linux is certainly not my hobby. So I have to admit my scripting kung fu is a bit rusty.

Enough excuses! - but a small disclaimer:

The following code snippets are provided "as is" without any warranty. If you use them, your switch might explode in a big colorful fireball or just melt through the rack. This blog article is written for the sole reason to give you an idea how the question could be answered. I'm not responsible for any problems you may face. To use the snippets you need to log in as root (for example to utilize "cut"), which is not really best practice.

Let's start

The easiest thing would be to grep the zoneshow output and just add some lines prior to the filter argument (the WWPN).

zoneshow | grep --before-context=3 01:23:45:67:89:ab:cd:ef

But you usually get a lot of stuff you don't want to see if you use a high number for before-context, or you don't see your zone name if the number is too low, depending on how many other members are in the zone. So this is not cool.

If you use configshow instead of zoneshow you have a different output - and I really mean configshow here, not cfgshow. (Just force your brain for a moment to accept that there are different meanings of the word "configuration" in a Brocade switch ... *sigh*) In configshow a zoning object - e.g. a zone - occupies a single line. That's much better for grep. But not exactly beautiful:

And in addition it could be you actually don't see the actual zone, because aliases were used. So the first step was to write something that returns the alias name for a given WWPN (it works with domain,port, too):

If it finds an alias name for the WWPN, it stores it and later also searches for that one. After you've executed both functions, you are able to find the zones for which a particular WWPN (or domain,port)is used by:

revzone 01:23:45:67:89:ab:cd:ef

The output is then simply:

Zone1_littlebear
Zone2_bigbear

It will of course only work in your current session. The next time you have to execute both once again to make them work or you add them to your profile.

I'm sure there are smarter ways to script that, but again, this is just an idea for an approach. I really appreciate any feedback.

But how to do this on Cisco MDS switches?

How can you solve this problem in the Cisco CLI where you normally don't have the possibility to create functions or to use cut?

There are some good videos out there on the STG Europe Youtube channel about the infrastructures able to cope with analytics workloads. Distinguished Engineer John Easton discusses the requirements for these kind of workloads in the video "IBM Big Data with John Easton" below:

He points out that it is more efficient to use large memory systems with high computing power like Power Systems or System z instead of multiple parallel working System x nodes. The reason for that is the high I/O demand contrary the high wait times that result out of the usage of disk based storage systems to share the data between the nodes during processing. Especially for real-time analytics he recommends to have all the computation within the same box.

The same preference of a scale up approach of high powered systems versus scale out infrastructures is explained by Paul Prieto, Technical Strategist for Business Analytics in the video "Choosing the right platform for Cognos Analytics":

Can flash make a difference?

With I/O performance being the main reason for avoiding a scale out strategy, there is of course the question: What if I the I/O performance could be drastically enhanced? Before IBM acquired Texas Memory Systems in 2012, their RamSan systems were rarely used to accelerate scale out infrastructures as far as I know. The main use case was to boost the few big boxes running highly productive applications but waiting for their I/O due to inadequate I/O latencies provided by traditional disk storage systems. With their I/O latencies within the range of two-digit to lower three-digit microseconds and their capability to sustain several hundred thousands of IOPS they were used as a Tier 0 storage for only the most demanding and business-critical workloads.

With the integration of the now called IBM FlashSystem into the IBM storage portfolio another use case emerged and since then played a growing role in these deployments: IBM FlashSystem behind IBM SAN Volume Controller.

The pair "FlashSystem plus SVC" represents in fact two approaches:

Using SVC to virtualize the all-flash FlashSystem and enrich its raw I/O performance with the features you expect from a today's virtualized storage solution like seamless migrations, remote copy, thin provisioning, snap-shots (FlashCopy) and many more.

Using FlashSystem to boost existing SVC-virtualized storage environments by using it for Easy Tier as well as for pure flash-based volumes.

Especially the second way combined with the wide range of supported host systems, HBAs, and operating systems now makes it interesting for a former no-go: Running applications with really high I/O demand like analytics on scale out commodity systems while relying on an impressive I/O performance available outside in the SAN. But of course - as always - it's not that simple. Yes, there will still be scenarios where such a scale out approach is just not applicable. Especially then it might make much sense to speed up the storage even for the scale up purpose-built business analytics systems. However for many - for example SMB - companies it'd make perfect sense to run their analytics on flash-accelerated clusters of x86 based commodity hardware...

...if they do it right.

So how to do it right?

Well, this blog is not intended to explain reference architectures or architectural best practices for analytics. But I want to add the SAN point of view. (I guess you already wondered when this will start - given the usual topics of "seb's sanblog") And from my perspective as a SAN troubleshooter I can at least tell you what should be taken into consideration to not let it fail from the beginning. There are two major points: The general architecture and the hardening of the SAN. The proper architecture (for example by keeping the FlashSystem and SVC attached to the core) is the base, but a hand full of issues could have an unacceptable impact on the performance. Many of them I already covered in earlier blog posts and some of them will be the topics of future ones.

With disk-based storage we talked about good average latencies of around 3ms. As the combination of FlashSystem plus SVC now works with a tenth of that and lower, the storage network's performance really start to make a difference. Usually we talk about single-digit microseconds one-way from device to device in a well-designed SAN. But the issues described above could increase this into the range of hundreds of milliseconds. Then of course it will hardly be possible to provide real-time business analytics. Therefore it is important to harden the SAN with the possibilities you have today, like - speaking of Brocade fabrics - Fabric Watch, bottleneckmon, Advanced Performance Monitoring, port fencing, traffic isolation zones, and so on. Brocade's "Fabric Resiliency Best Practices" are a good first step in this direction.

Conclusion

I think it's still possible to create a scale out infrastructure for business analytics even - and especially - with SAN based storage, as long as it's optimally prepared and using IBM FlashSystem solutions to overcome the mechanically caused latencies of disk storage. But it's crucial to ensure that these benefits are not rendered void by avoidable performance problems.

IBM Experts are more then willing to support you in this challenge. ;-)

To like working in tech support, you have to be the most optimistic guy around. You have to be even more optimistic about the product you support than the sales guy trying to sell it. Why? Because the product can be as fantastic as possible - jam-packed with jaw-dropping features - as a tech support guy you will only witness the bugs. However, the bugs are not what's annoying me. Well, at least most of them. :o) Every software necessarily has bugs. They are my job, the reason of its mere existence. What's really annoying me is, when I know that there is a problem, but the RAS package is just not good enough to enable me troubleshooting it.

Therefore, I was pleasantly surprised when I read the release notes of the Fabric OS v7.1 codestream. There are a lot of tweaks and features that make the life of a troubleshooter easier. And it's not only about finding problems, it's about preventing them, too. So here is just a first selection of what I like:

Can I trust the counters?

"FOS v7.1 has been enhanced to display the time when port statistics were last cleared." says the release note. This sounds trivial, but it's essential for the troubleshooting of many problem types like performance problems, physical problems and so on. Times when we had to go through the CLI history - in the hope that the counters were cleared via CLI after a proper login - seem to be over now.

Link Reset Type in the fabriclog

A small enhancement, but a time-saving one. To get a time-based overview about the state-changes of the ports, you usually have a look into the fabriclog. But there you often only see that there were link resets. The interesting thing would be to find out who initiated them - the local port or the remote one. The LR_IN and LR_OUT counters in portshow were an insufficient source of information here as they show only absolute numbers. In Fabric OS v7.1 they type is simply part of the message and you see it at a glance.

Better SFP-awareness

For many admins the best practice to replace an SFP is to disable the port, then replace the SFP and afterwards re-enable the port again. I know many people who did this and I felt always uncomfortable to tell them, "Rip it out while it runs, otherwise the switch won't recognize it correctly." But that's the way it is before v7.1: If the port is not running while you replace an SFP, it might not notice that for example the 4G LW SFP that was in there before is now an 8G SW SFP. Beside of any ugly additional bug that was possible based on that later on, the behavior itself was a pain. In v7.1 you don't have to care for that. Sfpshow will show you the correct information. Additionally sfpshow will also tell you when the last automatic polling of the SFP's serial data took place.

Honest long distance

If you read SAN Myths Uncovered 2: The LD mode (Brocade) on my blog before, you know that the whole long distance stuff in Brocade switches is a little bit... let's say "optimistic". For long distance ISLs (other than long distance end-device connections) you only configure the length of the connection and the switch calculates the necessary amount of buffers. But as it does that by using the maximum frame size, you'll end up with a buffer shortage for basically all real-world use cases. In Fabric OS v7.1 new functions take account of this fact. The command portbuffershow (by the way a mandatory candidate for every data collection) will show you the average frame size now. So sooner or later I can mothball my article about How to determine the average frame size. And this value then can be used to optimize the buffer settings in the completely overhauled portcfglongdistance command. Now it will calculate the buffers based on your average frame size. Furthermore, it allows you to configure the absolute number of buffers yourself if you want. You don't need to tell your switch anymore that a distance is 200km just to assign enough buffers to span 60km with your real-world average frame size being far less than the maximum one. It's that kind of clarity that prevents misconceptions and evitable performance problems.

This is not an exhaustive list of all the good new things. There are definitely more good features in direction of RAS like enhancements for credit recovery, Diagnostic Ports, FDMI, Edge Hold Time, FCIP and many others. In my eyes they'll make the platform even more robust and after all, it will hopefully give me a little more time to write more blog articles in the future. :o)

Oh wait... is this the call to update to v7.1 immediately?

Well, no, it's not. It's just an outlook for the things to come. Better plan your updates carefully. You know, it's just a blog article by the most optimistic guy around... ;o)

It's summertime again and for some of our customers it's the time to do their Fabric OS updates. Maybe you want to do that, too? I personally recommend a six month interval to go to the latest or the latest "mature" code, depending on your policy.

When you update to one of the latest v6.3x, v6.4x or v7x codes you might see your switch error log flooded by a new error message after the update:

No. Brocade just implemented a check for "stuck VCs" and it found one in your director. So it was there before but now after the update the Fabric OS is able to point at it and generates a warning message about it.

What is a stuck VC?

I explained VCs (Virtual Channels) a bit in the updated version of my article about"How to NOT connect an SVC in a core-edge Brocade fabric" and the one about Quality of Service. As I wrote there, each VC has its own buffer management - its own buffer credit counter and special VC-related 4-byte words (VC_RDYs) that re-fill only the buffer credits of a certain VC. A normal link to a device usually has only one buffer credit management and if the buffer credits are lost over time, performance usually decreases until the last buffer credit is lost, a link reset will be issued after 2 seconds to re-gain the credits. Internal backlinks between cards in a director could loose buffer credits, too. But as they can only loose a buffer credit belonging to a VC, other VCs may still have buffer credits. So while the other VCs continue to run without any problems, only the VC which lost credits is affected. It's a so called "stuck VC" now.

Wait! How can buffer credits be lost?

There are some reasons but I think the likeliest and most understandable one is a bit error corrupting the VC_RDY. If a bit is flipped in the VC_RDY the receiving port cannot recognize it anymore. The credit is lost. But "a few" bit errors are acceptable even in the Fibre Channel protocol. So this can happen even if everything works within the specs. The important thing is to detect it and react properly.

So I get these new messages and they tell me I have a problem. What now?

With FabOS v6.4.2a (and v6.3.2d, v7.0.0) Brocade extended the bottleneckmon command with an additional agent. This agent reacts on stuck VC conditions by doing a link reset on the specific backlink. This is a big improvement compared with the older codes. Stuck VCs on internal links between two blades required to reseat one of the blades or to power it off for a moment.

But it's disabled and you have to switch it on!

To enable it, run:

bottleneckmon --cfgcredittools -intport -recover onLrOnly

Once enabled the agent will monitor the internal links and if there is a 2-second window without any traffic on a backlink with a stuck VC, it will reset it to solve the stuck VC. This approach minimizes the impact of the link reset. But it still could happen that you see a few aborts in the host logs - which is usually self-recoverable. After that the messages should stop and you can use the full internal bandwidth of your switch again.

Please have a look into the help page of the bottleneckmon command ("help bottleneckmon") for more information. And if you still get messages pointing to lost credits, please open a case and we'll have a look.

Let's imagine a core-edge fabric. A powerful switch (or director) in its center is the core. The SVC and its backend storage subsystems are directly connected to it. Beside of that there are also the ISLs to the edge switches where the hosts are connected to. As there is an SVC in the fabric, all host traffic usually goes to the SVC and the SVC is the only host of all other storages. From time to time I see a cabling like the one below. The devices are connected in a common pattern. For example SVC ports are always on port 0, 4, 8, ... or for a director for example on port 0 and 16 on each card... Something like that. The reason behind that is often to spread the workload over several cards/ASICs to minimize impact in case of a hardware failure. But there's a risk in doing so.

In the situation described above, all host traffic is passing the ISLs from the edge switches to the core. ISLs are logically "partitioned" into so called virtual channels. Of course the ISL is still just one fibre and only one signal is passing it physically at the same time. The virtual channels are just portions of buffer credits dedicated and the decision which virtual channel a frame takes - and therefore which portion of the buffers credits it uses - is made by looking into the destination fibre channel address.

Technical deep dive

A normal non-QOS ISL has 4 virtual channels for data traffic. For an 8G link each one of them has 5 buffers. They can only work with these 5 buffers and there is no possibility to "borrow" some out of a common pool like for QoS links. With the command "portregshow" you can see the buffer credits assigned to the virtual channels (I added the first line):

VC 0 1 2 3 4 5 6 7
0xe6692400: bbc_trc 4 0 5 5 5 5 1 1

Only VCs 2-5 are used for data traffic. This makes 20 usable buffers which normally should be enough for a normal multimode connection between two switches in the same room with only some metres cable length. Basically the switch uses the last two bits of the second byte of the destination address. That looks so:

In our imaginary core-edge fabric where for example all SVC ports are connected to ports 0 (bin 00), 4 (bin 100), 8 (bin 1000), 12 (bin 1100) , ... all host I/O towards SVC would use the same virtual channel. As this is the only traffic that passes the ISLs from edges to cores, only a quarter of the buffers are actually used! 5 buffers are very heavy in use and 15 are idling around never to be filled. And 5 buffers are actually pretty few for an edge switch full of hosts wants to speak with the core switch where the SVC is connected. The result would be credit starvation and congestion on a virtual channel level.

How to solve that?

There are 3 possibilities:

1.) You could re-cable your SAN in a manner that all VCs are used. But beside of the risk of physical problems and problems introduced by maintenance actions the devices have to learn about the new addresses of the SVC ports. For many operating systems this still means reboots or reconfigurations. It could involve a lot of work and risk for outages.

2.) You could just change the addresses with the portaddress command. This command is usually used in the virtual fabric environment and if you can use it depends on installed firmware and used platform. While it avoids the physical actions, it still has the disadvantages for the hosts because of changed addresses.

3.) The best and least interrupting possibility might be to set the ISLs to LE mode. This is the long distance mode dedicated for links under 10km in length. It will not only put more buffers on the link (40 for user traffic in an 8G link compared with the 20 for a normal 8G E-Port) but will also collapse the 4 user traffic VCs to just one. It looks like this then:

VC 0 1 2 3 4 5 6 7
0xe6602400: bbc_trc 4 0 40 0 0 0 1 1

So all buffers and therefore also all buffer credits will be used by the hosts and nothing idles. There will of course be a short interruption while changing the ISL to LE mode but beside of that nothing changes for the hosts, because all the addresses stay the same. This is clearly the way to go in the situation described above.

Just something strange for the end: Some switches are delivered from manufacturing with an alternative addressing pattern. For example port 1 of domain 3 won't have the address 030100 then but something like 030d00. In that case the problem can happen similarly but on other ports. But using LE-mode would solve it in the pretty same way.

Please keep in mind that the whole article relates to a very special (although very common) SAN layout in an SVC-centered environment. This is clearly not a standard action plan for all performance problems but it could help if you have a customer in a situation like this. For any questions, feel free to contact me.

Additionally, please be aware that this is not an SVC problem by itself but will happen with every central storage connected to a switch using a pattern as described above and being used by hosts connected to another switch over an ISL!

Update from May 9th:

I was made aware that readers of this article queried their vendors, maintenance providers or business partners with the idea to just set all their ISLs to LE-mode regardless if the condition as described above is actually met. Because of that, I would like to state more clearly: Using LE-mode as a general approach for your ISLs can cause severe problems!

If the SVC ports are not connected in a way that only one Virtual Channel would be used, it actually makes sense to have ISLs with more than one VC. Virtual Channels are a good feature to prevent that a latency bottleneck due to back pressure impairs the traffic of all devices using the same ISL. If devices on the edge switches communicate with other devices connected to other ports of the core (or other edges) as well, the impact of using LE-mode would be even more extreme in the case of slow drain devices.

I made some drawings to illustrate this. The first one shows 1 normal ISL between the edge and the core. You can see the 4 VCs used for data traffic. (I left out the other VCs for better visibility):

Here host 1 and 2 make traffic against the SVC (green), host 3 against an additional disk subsystem (purple) and host 4 against a tape drive (orange). Based on the ports these devices are connected to, other VCs are used for that traffic.

If you would use an LE-port instead, it would look like this:

Now all 4 data traffic VCs collapsed to a single one. As long as everything runs smoothly, you won't see an impact.

Buf if for example one of the devices connected to the core is slow draining, following will happen most probably:

In the picture above the purple disk is a slow drain device. Due to back pressure the whole ISL will be a latency bottleneck, because all data traffic shares the same VC in LE-mode. The back pressure goes further towards the edge switch and all 4 hosts of our example are affected now although only host 3 communicates with the slow drain device!

With a normal E-port it looks like this:

Now only VC4 is affected while VC2, 3 and 5 are running smoothly, because they have their own, unaffected buffer management. Therefore only host 3 will face a performance problem while the hosts 1, 2 and 4 are running fine.

You see: Using LE-mode for the purpose described in my original article does only make sense if these special conditions are really met. In all other cases it can impair the SAN performance tremendously!

I claim that in 2012 performance problems will keep their place amongst the most frequent and most impacting problems in the SAN. In many of the cases the client's users really notice a performance impact and so the admin calls for support. Other support cases are opened because of performance related messages like the ones from Brocade's bottleneckmon or Cisco's slowdrain policy for the Port Monitor. Beside of that there are also cases that look not really like performance problems from the start but turn out to occur because of the same reasons like them. "I/O abort" messages in the device log, link resets, messages about frame drops, failing remote copy links, failing backup jobs or even worse failing recoveries - these could all be "performance problems in disguise".

When I analyze the data then and find out that a slow drain device or congestion is the real reason for the problem I write my findings down and try to give the client some hints about possible next steps. For example by mentioning my earlier blog article about How to deal with slow drain devices.

Do you know what's mean about it?

Often clients never heard of slow drain devices before. Longtime storage administrators are confronted with a term that sounds like a support guy made it up to fingerpoint to another vendor's product. Of course I usually explain what it is, what it means for the fabric and for the connected devices. But to be honest, I would be sceptical, too. I would go to the next search engine and query "slow drain device". The first finds are from this blog and from the Brocade community pages and there are some questions about that topic. Considering the substance of posts in public forums, I would check Brocade's own SAN glossary. Guess what? Not a word about slow drain devices - Which is no surprise as it's from 2008. I would check wikipedia. Nothing. My fellow blogger Archie Hendryx mentioned that it's missing in the SNIA dictionary, too. And he's right: Nothing!

So why is that so?

Why are the terms "HTML" and "export" explained in the dictionary of the Storage Networking Industry Association but there is not a single appearance of the term "slow drain device" on the complete SNIA website (according to their in-built search function)? Well I don't know but of course we can change that. The SNIA dictionary makers are asking for contribution, so if you have a term that has a meaning in the storage industry, feel free to send them a definition for the next release. I thought about doing that as well for some of the SAN performance-related terms I didn't find in the dictionary. Below you'll find some definitions that I wrote. But I'm not inerrable and therefore I would like to have an open discussion about them. Let me know what you think about them. Let me know if your understanding of a term (used in the area of SAN performance of course) differs from mine. Let me know if my wording hurts the ears of native English speakers. Let me know if you have a better definition. Let me know if there are important terms missing. And let me know if you think that a term is not really so generally used or important that it should appear in the SNIA dictionary - side by side to sophisticated terms like Tebibyte :o).

My definitions:

slow drain device - a device that cannot cope with the incoming traffic in a timely manner.Slow drain devices can't free up their internal frame buffers and therefore don't allow the connected port to regain their buffer credits quickly enough.

congestion - a situation where the workload for a link exceeds its actual usable bandwidth.Congestion happens due to overutilization or oversubscription.

buffer credit starvation - a situation where a transmitting port runs out of buffer credits and therefore isn't allowed to send frames.The frames will be stored within the sending device, blocking buffers and eventually have to be dropped if they can't be sent for a certain time (usually 500ms).

back pressure - a knock-on effect that spreads buffer credit starvation into a switched fabric starting from a slow drain device.Because of this effect a slow drain device can affect apparently unrelated devices.

bottleneck - a link or component that is not able to transport all frames directed to or through it in a timely manner. (e.g. because of buffer credit starvation or congestion)Bottlenecks increase the latency or even cause frame drops and upper-level error recovery.

Feel free to use the comment feature here or tweet your thoughts with hashtag #SANperfdef. If you add @Zyrober in the tweet, I'll even get a mail :o)

I updated the definitions with an additional sentence. Feel free to comment.

When Brocade released FabricOS v6.0 in 2007 Quality of Service sounded like a great idea: It allows you to prioritize your traffic flow to the level of certain device pairs. There are 3 levels of priority:

High - Medium - Low

Inter Switch Links (ISLs) are logically partitioned into 8 so called Virtual Channels (VCs). Basically each of them has its own buffer management and the decision which virtual channel a frame should use is based on its destination address. If a particular end-to-end path is blocked or really slow, the impact on the communication over the other VCs is minimal. Thus only a subset of devices should be impaired during a bottleneck situation.

Quality of Service takes this one step further.

QoS-enabled ISLs consist of 16 VCs. There are slightly more buffers associated with a QoS ISL and these buffers are equally distributed over the data VCs. (There are some "reserved" VCs for fabric communication and special purposes). The amount of VCs makes the priority work - the most VCs (and therefore the most buffers) are dedicated to the high priority, the least for the low one. Medium lies in the middle obviously. So more important I/Os benefit from more resources than the not so important ones.

Sounds like a great idea!

Theoretically you can configure the traffic flow in terms of buffer credit assignment in your fabric very fine-grained. But that's in fact also the big crux: You have to configure it! That means you actually have to know which host's I/O to which target device should be which priority. Technically you create QoS-zones to categorize your connections. Low priority zones start with QOSL, high priority zones start with QOSH. Zones without such a prefix are considered as medium priority.

But how to categorize?

That's the tricky part. The company's departments relying on IT (virtually all) have to bring in their needs into the discussion. Maybe there are already different SLAs for different tiers of storage and an internal cost allocation in place. The I/O prioritization could go along with that and of course it has to be taken into account to effectively meet the pre-defined SLAs. If you have to start from the scratch, it's more a project for weeks and months than a simple configuration. And there is much psychology in it. Beside of that you really have to know how QoS works in details to design a prioritization concept. For example if you only have 20 high priority zones and 50 with medium priority but only 3 low priority zones, the low ones could even perform better. In the four years since its release I saw only a couple of customers really attempting to implement it.

In addition you need to buy the Adaptive Networking license!

So why should I care?

If QoS is such a niche feature, why blogging about it? Usually a port is configured for QoS when it comes from the factory. You can see it in the output of the command "portcfgshow". A new switch will have QoS in the state "AE" which means auto-enabled - in other words "on". An 8Gig ISL will be logically partitioned into the 16 VCs as described above and the buffer credits will be assigned to the high, the low and the medium priority VCs. But that does not mean that you can actually benefit from the feature, because you most probably have no QoS-zones! And so all your I/O share only the resources allocated for the medium priority. A huge part of the available buffers are reserved for VCs you cannot use! So as a matter of fact you end up with less buffers than without QoS and in many cases this made the difference between smooth running environments and immense performance degradation.

Conclusion:

If you don't plan to design a detailed and well-balanced concept about the priorities in your SAN environments, I recommend to switch off QoS on the ports. I don't say QoS is bad! In fact with the Brocade HBA's possibility to integrate QoS even into the host connection - enabling different priorities for virtualized servers - you have the possibility to better cope with slow drain device behavior. But done wrong, QoS can have a very ugly impact on the SAN's performance!

Better know the features you use well - or they might turn against you...

Update:

As this was not clear enough in the text above and I got back a question about that, please be aware: Disabling QoS is disruptive for the link! In most FabricOS versions in combination with most switch models, the link will be taken offline and online again as soon as you disable it. In some combinations you'll get the message that it will turn effective with the next reset of the link. In that case you have to portdisable / portenable the port by yourself.

As this is a recoverable, temporary error your application most probably won't notice anything, but to be on the save side, you should do it in a controlled manner and - if really necessary in your environment - in times of little traffic or even a maintenance window. The command to disable it is:

A slow drain device often has a huge impact on the performance of many other devices in a SAN environment. That happens, because they block resources in a fabric other devices use as well. The main example for such a resource are ISLs, particularly the Virtual Channel(s) within those ISLs that are used to reach the slow drain device. But as soon as you have an appliance in the SAN, this could turn into such a blocked resource as well.

Disclaimer: There are several definitions and types of appliances. Within this article an appliance is a device "in the middle" between the hosts and the storages with a specific task such as a compression, encryption, virtualization or deduplication appliance. While I had the SAN Volume Controller (SVC) in mind while I wrote this, it applies to many other products matching this definition. The common thing is that the performance they can provide is to some degree dependent on their destination device's performance.

Fortunately many of the fabrics I saw over the recent years were designed using a core-edge approach. If the device is in the communication path of many of the devices in a SAN it's best practice to attach it directly to the core. But a slow drain device can still block it. This is how it happens:

In this sketch the appliance sends data towards a slow drain device. It will not be able to process the incoming frames quickly enough - they pile up in its HBA's ingress buffers (1). As the appliance is still sending frames but the edge switch cannot send them further to the slow drain device, they also pile up in the ingress buffer of the ISL port of the edge switch (2). Now this could already impair the performance of the other host connected to the same edge switch like the slow drain device - if the frames towards it use the same VC. Some microseconds later the same might happen to the frames from the appliance entering the core (3). They pile up there as well and as soon as that happens, this so called back pressure reaches the appliance itself then. As there are no VCs on the F-to-N-port connection used to attach the appliance to the core, the chance is high that the appliance cannot send any frames out to the SAN anymore - no matter to which destination (4).

That means?

Well, that means you just turned your appliance into a slow drain device itself! The performance of the whole environment is heavily impaired now:

In step (5) the frames from the other hosts towards the appliance pile up in the core as well and then the back pressure spreads further to the hosts connected to the edge switches as well (6).

Worst case, hmm?

After the ASIC hold time is reached (usually 500ms) the switches will begin to drop frames to free up buffers again. But as all switches have the same ASIC hold time, you'll end up in the situation that while the edge switch reach these 500ms first, the core switch will start to drop the frames likewise before the buffer credit replenishment information (VC_RDY) from the edge switches arrive. So not only the frames from the communication with the initial slow drain device will be dropped, but most of the others down the path as well. As as the appliance itself turned into a slow drain device, the same might happen to the frames piled up because of that, too.

So what to do against it?

The first thing is: give the F-ports of the appliance as much buffers as possible. Prio 1 should be that it should be able to send its frames out into the fabric, so the chances are higher that when the frames of the open I/O against the slow drain device are out there, there could be still some buffer credits left to send stuff to other devices. For clustered appliances like the SVC it's even more important, because they use these ports for their cluster-internal communication as well. Blocked ports could result in cluster segmentation then (SVC: single nodes rebooting due to "Lease expiry"). To assign more buffers to the switch port (= more buffer credits for the port of the appliance), use

portcfgfportbuffers --enable [slot/]port buffers

Update: Please keep in mind, that adding more buffers to an F-Port is of course disruptive for the link!

To check how much buffers are available, you can check

portbuffershow

But in many cases this is not enough. Some time ago, Brocade released Fabric Resiliency Best Practices with some good advises. In my opinion every SAN admin with Brocade gear should have read it. It recommends:

Use Fabric Watch to get alarms for frame timeouts. (Erwin von Londen wrote a good article about that.

Configure bottleneckmon to get alarms for latency and congestion bottlenecks.

While Fabric Watch is used more and more and especially in the FICON world - but also for open systems - I see some of our customers using port fencing, I hardly see anyone utilizing the Edge Hold Time feature. For a situation as described above it could really improve the situation for the appliance and the other hosts dramatically. It can be set to any value between 100ms and 500ms. It was introduced in FOS v6.3.1b. So if you expect hosts connected to an edge switch to behave slow draining in certain situations, in my opinion the Edge Hold Time of that switch should be set as low as possible. Of course it's always depending on your environment and how likely it is to be impaired by a slow drain device, but 100ms is a long time in a SAN. If you also have some legacy devices connected to these edge switches, check if a decreased hold time could be a problem for them.

It can be enabled and configured using the "configure" command, where it can be found in: