To like working in tech support, you have to be the most optimistic guy around. You have to be even more optimistic about the product you support than the sales guy trying to sell it. Why? Because the product can be as fantastic as possible - jam-packed with jaw-dropping features - as a tech support guy you will only witness the bugs. However, the bugs are not what's annoying me. Well, at least most of them. :o) Every software necessarily has bugs. They are my job, the reason of its mere existence. What's really annoying me is, when I know that there is a problem, but the RAS package is just not good enough to enable me troubleshooting it.

Therefore, I was pleasantly surprised when I read the release notes of the Fabric OS v7.1 codestream. There are a lot of tweaks and features that make the life of a troubleshooter easier. And it's not only about finding problems, it's about preventing them, too. So here is just a first selection of what I like:

Can I trust the counters?

"FOS v7.1 has been enhanced to display the time when port statistics were last cleared." says the release note. This sounds trivial, but it's essential for the troubleshooting of many problem types like performance problems, physical problems and so on. Times when we had to go through the CLI history - in the hope that the counters were cleared via CLI after a proper login - seem to be over now.

Link Reset Type in the fabriclog

A small enhancement, but a time-saving one. To get a time-based overview about the state-changes of the ports, you usually have a look into the fabriclog. But there you often only see that there were link resets. The interesting thing would be to find out who initiated them - the local port or the remote one. The LR_IN and LR_OUT counters in portshow were an insufficient source of information here as they show only absolute numbers. In Fabric OS v7.1 they type is simply part of the message and you see it at a glance.

Better SFP-awareness

For many admins the best practice to replace an SFP is to disable the port, then replace the SFP and afterwards re-enable the port again. I know many people who did this and I felt always uncomfortable to tell them, "Rip it out while it runs, otherwise the switch won't recognize it correctly." But that's the way it is before v7.1: If the port is not running while you replace an SFP, it might not notice that for example the 4G LW SFP that was in there before is now an 8G SW SFP. Beside of any ugly additional bug that was possible based on that later on, the behavior itself was a pain. In v7.1 you don't have to care for that. Sfpshow will show you the correct information. Additionally sfpshow will also tell you when the last automatic polling of the SFP's serial data took place.

Honest long distance

If you read SAN Myths Uncovered 2: The LD mode (Brocade) on my blog before, you know that the whole long distance stuff in Brocade switches is a little bit... let's say "optimistic". For long distance ISLs (other than long distance end-device connections) you only configure the length of the connection and the switch calculates the necessary amount of buffers. But as it does that by using the maximum frame size, you'll end up with a buffer shortage for basically all real-world use cases. In Fabric OS v7.1 new functions take account of this fact. The command portbuffershow (by the way a mandatory candidate for every data collection) will show you the average frame size now. So sooner or later I can mothball my article about How to determine the average frame size. And this value then can be used to optimize the buffer settings in the completely overhauled portcfglongdistance command. Now it will calculate the buffers based on your average frame size. Furthermore, it allows you to configure the absolute number of buffers yourself if you want. You don't need to tell your switch anymore that a distance is 200km just to assign enough buffers to span 60km with your real-world average frame size being far less than the maximum one. It's that kind of clarity that prevents misconceptions and evitable performance problems.

This is not an exhaustive list of all the good new things. There are definitely more good features in direction of RAS like enhancements for credit recovery, Diagnostic Ports, FDMI, Edge Hold Time, FCIP and many others. In my eyes they'll make the platform even more robust and after all, it will hopefully give me a little more time to write more blog articles in the future. :o)

Oh wait... is this the call to update to v7.1 immediately?

Well, no, it's not. It's just an outlook for the things to come. Better plan your updates carefully. You know, it's just a blog article by the most optimistic guy around... ;o)

The term ecological footprint describes the total impact of someone or something on the environment. To achieve sustainability this footprint should be kept as low as possible. We should not demand more from Mother Nature than she can provide and of course we should not demand more than we actually really need. Sounds simple, but the reality is way more complex. In the area of IT the term Green IT was found to describe and consolidate all the rules, actions and requirements to decrease the ecological footprint for the sake of sustainability. And IBM has a broad agenda about this. But often we forget what each one of us could do to be a little more greener.

In the technical support we deal with defects. Our clients have the right to have a product working within the specifications. If a part is working outside its specifications, it has to be repaired or replaced. That's it.

And what's "green" about that?

The impact on the Nature happens if a part is replaced that was not really broken. No manufacturing process of a part can be so "green-optimized" that it's better than just to avoid replacing a part in good order. There is the mining (and/or recycling) for the materials, the chemicals and energy used during its processing, the packages, the stocking and of course the logistics, too. At the end a small part like a fan can have a huge ecological footprint. This can only be avoided by replacing only the broken part. There's just one problem with that:

What if you can't tell which part is broken?

A classical example for that is a physical error in the SAN. In my article about CRC I pointed out how to use the porterrshow to find physical errors and - even more important - how to find the connection where the physical error is really located. But that's all what's possible out of the data: You can only track it down to the connection. The connection usually consists of the sending SFP, the cable (plus any additional patch panels and couplers in between), and the receiving SFP. There is no reliable and technically justifiable way to tell which one is the culprit just out of the porterrshow. I know that there are some "whitepapers" available in the web stating that this combination of "crc err" and "enc in" means this and that combination of "crc err" and "enc out" means that. But from a technical point of view that's nonsense.

So you have a physical problem, what to do?

When it comes to cables, my fellow IBM blogger Anthony Vandewerdt just released a great article about the impact of dust today. Other reasons for a cable to cause physical problems could be a too small bending radius or loose couplers. In times of fully populized 48- or even 64-port cards the frontside of a SAN director often looks like the back of a hedgehog. For every maintenance action with one of the cables you can wait for the CRC error counters increasing for the other ports around then. So in many situations the cable is not really broken and just replacing it wholesale just because of the counter is not eco-friendly.

The same thing with SFPs. You see physical errors increasing in the porterrshow for a specific port. That could mean that the SFP in there is broken, because its "electric eye" doesn't interpret the (good) incoming signal correctly. It could also mean that the SFP on the other end of the cable is broken, because it sends out a signal in a bad condition. Both will lead to the very same counter increases in porterrshow. If you replace them both as the first action you most probably replaced at least one good one.

Given that you have redundancy in your SAN environment (which you should ALWAYS have), you have free ports available, and the multipath drivers for the hosts using the affected path are working properly, you could track the culprit down by plugging the cable to another SFP in another port and look if the error stays with the port or with the cable.

Please keep in mind that the port address ("the IP address of the SAN") could change along with the port (if you don't have Cisco switches). On Brocade switches you need to do a "portswap" to swap the port addresses as well.

If you cannot touch the other ports, Brocade built some tests into FabricOS for you, like "porttest", "portloopbacktest" and "spinfab". Please have a look into the Command Line Interface Reference Guide for your FabricOS version to get more information about them. With these tests in combination with a so called loopback plug it's easy to find out which part is really broken. Loopback plugs look like the end of a cable but just physically redirect the SFP's TX signal into its RX connector.

Mother Nature will be thankful

There is just one thing from above I want to pick up: parts working within their specification. Not every single CRC error is a reason to replace hardware. According to the Fibre Channel standard, the protocol requires a BER (Bit Error Rate) of 10^(-12) to work properly. For 8 or even 16 Gbps that means it's allowed and fully compliant with the FC protocol to have bit errors quite often. Here is where common sense must come into play. If you have 2-digit increases of the CRC error counter within an hour, it might be a good idea to determine which part to replace with the steps mentioned above.
If you see a single CRC from time to time, sometimes with days of no error, sometimes with "some" per day, that's perfectly fine with the FC protocol and well within the specifications. It could lead to single temporary and recoverable errors on a host, but nothing has to be replaced then as long as the rate doesn't increase significantly. You wouldn't replace your one-year-old tires just because the tread is only 90% of what it was when you bought them.