March 1, 2010

I’ve recently started working extensively with iSCSI based SAN storage, having previously only had a background in Fibre Channel based storage. One unexpected benefit from using iSCSI based storage is the ease of capturing SCSI traffic for analysis. ISCSI, making use of the TCP/IP stack and Ethernet for the physical layer makes traffic capturing as trivial as installing Wireshark. On the other hand, capturing Fibre Channel traffic for analysis typically requires a Fiber tap and expensive pieces of hardware.

I have had experience with SCSI-3 Persistent Reservations in the past, but while recently troubleshooting a SCSI-3 PR issue, I decided to capture sample SCSI-3 Persistent Reservation traffic with Wireshark for reference analysis. I built a CentOS 5u3 virtual machine, installed and configured the iscsi_initiator_utils package, and then installed the sg3_utils package. sg3_utils consists of a number of extremely useful tools which allow you to directly manipulate devices via SCSI commands. The tool I wanted to use is called sg_persist and is used to send PROUT and PRIN commands (more on that in a moment).

Before we go any further, a little background is required…

The SCSI protocol standards, maintained by the t10 sub-committee of the International Committee for Information Technology Standards (INCITS), is split into a large number of specifications covering various aspects of SCSI. The over-arching standard is known as the SCSI Architecture Model, or SAM, now in its 5th generation (SAM-5). The SAM ties together all of the SCSI standards and provides the requirements which the myriad SCSI specifications and standards must meet.

Of primary concern to us now is the SPC, or SCSI Primary Commands standard, which defines SCSI commands common to all classes of SCSI devices. A given SCSI device will conform, in the least, to the SPC and a standard specific to that class of device. For example, a basic disk drive will conform to the SPC and the SBC (SCSI Block Commands) specifications.

Device reservations are handled in one of two ways–the older RESERVE/RELEASE method specified in the SCSI-2 specs and the newer persistent reservation method of SCSI-3. Reservations allow a SCSI device (typically a SAN-based storage array) to maintain a list of initiators which can and cannot issue commands to a particular device. Reservations, whether the older method or PR, are what allows more than one server to access to a shared set of storage without stepping on one another.

SCSI-3 Persistent Reservations offer some advantages over the older RESERVE/RELEASE method–primarily by allowing the reservation data to be preserved across server reboots, initiator failures, etc. The reservation will be held by the array for the LUN until it is released or preempted.

One important thing to keep in mind is that the function of persistent reservations are to prevent a node from writing to a disk when it does not hold the reservation. The reservation system will not prevent another node from preempting the existing reservation and then writing to the disk. It is the responsibility of the server-side application making use of the persistent reservations to ensure that cluster nodes act appropriately when dealing with reservations.

In part II we will look at the two SPC commands dealing with persistent reservations–PRIN (Persistent Reserve In) and PROUT (Persistent Reserve Out) and their associated service actions.

November 10, 2008

Over the past six months a large portion of my job function has been to develop courseware and then use that material to teach, for both internal employees and customers. (This was the primary reason I was in Japan in May/June).

While I think I’ve been doing pretty well at it, I’d love to be able to further polish my skills in the area. To that end, I tried doing some research on the Web on this topic. Unfortunately, I didn’t find all that much of use.

I’d like to hear from other readers out there of resources available to the technical trainer, both on teaching techniques and courseware development. Keep in mind that the material I am teaching is designed for highly skilled sysadmins and SAN and network admins, so info relating to this kind of high-end topic would be beneficial. Information on professional groups or organizations that cater to the field would also be appreciated.

June 5, 2008

I’ve been in Japan for the past two weeks and I’ve been having a blast. I arrived on Sunday May 24th and unfortunately I’ll be leaving tomorrow. I miss my family, but feel like there is a lot of stuff I haven’t seen. Of course, since I’m here for work, I haven’t had a lot of time to see things. However, I’ve used my camera at every opportunity I could. The photos are available for viewing here, Japan Photo Album. Please pop in and take a look and let me know what you think.

I will write more when I return to the states next week. I’ve been so busy with work and sightseeing that every available waking hour has been used.

March 27, 2008

There was a flurry of activity earlier this week by bloggers and reporters as Dell announced it was partnering with Egenera, Inc. to OEM Egenera’s PAN Manager virtualization management software on its PowerEdge servers.

Egenera, Inc. is the leader in a market segment that IDC has begun calling “Virtualization 2.0″, or as Egenera defines it, “Data Center Virtualization.” Until now, Egenera’s PAN Manager management software was only available on their high-end BladeFrame hardware platform. With the Dell announcement, Egenera has begun to expand its PAN Manager software framework to other platforms.

The aforementioned Ideas International story essentially sums up the the value of the Dell-Egenera partnership with the following statement: “PAN Manager allows IT managers to create an entire virtual datacenter
where nothing is tied to physical hardware. Compute, storage, and
network resources can be dynamically allocated when needed and where
needed…With PAN Manager, Dell leaps over many of its competitors with the
ability to create the virtualized datacenter of the future today using
inexpensive industry-standard components.”

March 13, 2008

It seems to me that 64-bit computing is the wave of the future with very little effort to adopt, yet from where I stand, not as many companies are going the 64-bit route as I would have thought.

Nearly every processor sold, if not every processor sold, is of the AMD64 or EM64T (IA-32E) architectures. Memory prices are continuing to drop making it more and more common to see x86 (or technically x86_64) based systems with 32GB, 64GB, or even 96GB of RAM.

Through Physical Address Extensions, (PAE), 32-bit processors have long been able to address more than 4GB of physical memory, given proper OS support. The addition of just 4 bits of memory additional addressing allows a 32-bit processor to support up to 64GB of RAM.

In Red Hat Enterprise Linux (hereafter referred to as RHEL), support was added for up to 64GB of memory in RHEL4 via the hugemem kernel. The RHEL4 Release Notes state that the hugemem kernel provides a 4GB per-process address space and a 4GB kernel space. It is also noted, though, that running the hugemem kernel will have a performance impact as the kernel needs to move from one address lookup table to another when switching from kernel to user space and vice-versa. It is not stated in the release notes, but I have heard conjecture that the performance impact could be up to 30%.

RHEL5, the latest major release of Red Hat Enterprise Linux actually removes support for the hugemem kernel. 32-bit RHEL5 will support at most 16GB of memory. See the RHEL Comparison Chart for details. I cannot find specific references as to why hugemem was removed in RHEL5 but I have heard that the performance impact of hugemem was a hassle to deal with from a support perspective. (Not to mention the assertion that 32-bit is dead!).

So, if a user is running 32-bit RHEL4 with greater than 16GB of memory their upgrade path to 32-bit RHEL5 is limited by the 16GB maximum in RHEL5. One would have to do a fresh 64-bit install of RHEL5 to take advantage of the increased memory.

64-bit on the other hand, can address a full 2TB of memory. There is also no longer the distinction between LOWMEM and HIGHMEM for the kernel. The elimination of the 1GB/3GB split (or the 4GB/4GB translation with hugemem) increases the stability of the kernel when dealing with memory intensive loads. I’ve seen many cases where 32-bit systems with 8, 12, or even 16GB of RAM have fallen over because the kernel can no longer assemble contiguous blocks of memory fast enough from its limited 1GB address space, while HIGHMEM sits with many GB of useable memory pages.

In cases like this, a migration to 64-bit has nearly always resolved the issue with kernel memory starvation. (In a couple of cases there was a runaway app that consumed every available page of memory on the system, so there was no difference between 32 or 64 bit.

The moral of the story? Go 64-bit with any new server implementations. Begin putting plans into place now to migrate legacy 32-bit systems to 64-bit in the near future. Having a solid, actionable plan will go a long way to ensure a smooth transition.

March 11, 2008

Slashdot recently linked to an article titled Is There Really an IT Labor Shortage? which argues that the alleged IT labor shortage is merely an argument of convenience for various companies. The article also questions whether the skills shortage is really just that or a result of unrealistic hiring practices.

While I cannot disagree with some of the reasoning behind the argument of hiring practices, I’d like to propose another angle on the ‘labor shortage’ theory–too much ‘book knowledge’ and not enough critical thinking among IT job candidates.

My company has recently been looking for qualified support personnel. The typical support engineer we hire has 7 to 10 years of industry experience in UNIX/Linux, Windows, Networking, SAN, or combinations thereof. These are high level positions for high level people. Granted, the ’support’ portion tends to scare away a good number of people from the outset.

I tend to be involved in either the first level (phone screen) or second level (1st face to face) interviews. Not only do most of the candidates not have strong enough technical skills, but of those that do, or appear to (according to their resume), all are lacking one critical skill–troubleshooting. Troubleshooting, or more simply put, thinking for oneself, is a very basic and critical skill that seems to be lacking in the majority of candidates I speak with.

I don’t expect every candidate to know the answer to every technical question that I ask. However, I do expect that they will able to admit that they don’t know, and tell me how to go about looking for the right answer, and how to attempt to break an unknown problem down to a more basic level by asking the right questions. Problem definition is a critical part of troubleshooting and something that our organization tries to instill into our engineers.

For example, with one recent batch of phone screens, I provided a simple scenario, warning beforehand that there was no single correct answer. The question was as follows:

“If you received a [call | email | trouble ticket] from a customer who reports that users of his Oracle database on Linux are reporting slowness, how would you go about troubleshooting this issue?”

Not a single candidate was able to provide a set of problem definition steps to my satisfaction. Most attempted to solve the problem right off the bat by spouting off possible solutions to an unknown problem. (I’d check the server’s disk space! I’d look at subnet masks!, etc.). I was hoping to hear logical questions to try to narrow down the problem, such as “Is the problem new? Does it only occur at certain times of the day? Are all users affected? Are all queries affected? Were any changes made recently?, Define slowness, How are you measuring slowness?, etc.”

Where does this lack of critical thinking come from? I’m not entirely sure, but I think part of the blame can be laid at the feet of the myriad industry certification programs. The bulk of the certification program keywords that you see tossed around only push people to know what is on the test. No attention is given to troubleshooting, to trying to solve problems in a methodical fashion. Not all certifications are like this–the CCIE and RHCE are two that I can think of off the top of my head that are lab-based, rather than ‘multiple-guess’.

I don’t know what the solution to this whole problem may be. It almost seems like a self-perpetuating problem. Companies put out ads for IT people with certifications X, Y, and Z, except those certification programs are not producing qualified candidates. I for one would like to see a end to employment ads requiring a candidate with a bunch of 3 and 4 letter acronyms after their name.

This was a four day course designed to assist those in the network and security fields analyze TCP packet dumps for both performance problems and security issues. As I tend to look at a number of tcpdumps taken during times of performance issues, I thought that this class would be helpful.

My overall impression of the class was favorable. Unlike other supposedly advanced classes I’ve been in, there was no coddling of the unqualified. You were expected to have a solid understanding of TCP/IP in order to keep up.

The course book was a bit sparse, but the entire point of the class was to look at packet dumps in wireshark. To that end, the lab materials (provided on DVD in the back of the book) were excellent. In addition, the instructor supplemented the provided trace files by capturing live traffic on the network and analyzing it on the overhead. To highlight some of the security segments, she installed a honeypot on an unpatched Windows XP box to capture some virus and worm infection attempts.

Of course, in any type of training class there is always room for improvement. I would have liked if the class were five days, rather than four. As we were all very competent with networking we frequently came up with a lot of ‘what if’ type of scenarios that we explored, taking us off track. The course book could also use a bit more ‘meat’ to it.

Despite these few shortcomings, I wouldn’t hesitate to recommend this class to someone who has a solid networking background and has the need to capture or analyze traffic using wireshark.