Part 2: Live Partition Mobility in IBM Power Servers – Debugging with IBM’s devscan Utility

I recently performed some Live Partition Mobility (LPM) testing for a customer, and ran into a number of problems – see Part 1 of this post for the full story on all of the problems and solutions.

The most complicated problem required us to investigate our SAN storage connectivity. IBM Support introduced us to the devscan utility to aid in this task. I found devscan to be a powerful and useful utility, but I didn’t find a whole lot of real world documentation on it, so I thought I would document my usage of it in this article.

Introduction to devscanThe devscan utility is a free tool that was developed by IBM to provide information on SAN storage and connectivity and aid in debugging problems. Information about downloading and using devscan can be found at IBM’s AIX Support Center Tools website:

Problem RecapAfter overcoming several minor LPM problems, our attempt to perform a LPM validation from the HMC command line failed with error HSCLA319, indicating that the destination VIO Servers could not host the Virtual Fibre Channel (VFC) adapters required by the client LPM partition. A google search and IBM Support both indicated that SAN zoning was likely the problem. Our SAN team disagreed, so we set about proving or disproving that there was a SAN zoning issue.

Running devscan on the LPM Destination Server’s VIO ServersWe began our debugging by running devscan on the VIO Servers on the LPM destination server. When running devscan on a VIO Server, we run it in NPIV mode so that we can give it the WWPNs of the VFC adapters of the LPM client LPAR, so that devscan can test for SAN connectivity via the physical fibre channel adapters in the VIO Server. In particular, we can specify the secondary (LPM) WWPNs, so that we can verify that the destination server has the required connectivity to support the LPM migration.

In NPIV mode, devscan cannot gather information about specific LUNs, but it does gather data regarding the connections on the storage side. IBM Support had us run the following command for each client VFC adapter WWPN (primary and secondary) on all fscsi adapters on both destination VIO Servers:

# devscan -t f -n <WWPN> --dev=fscsi<#>

In retrospect, I would NOT recommend doing this for all primary and secondary WWPNs, as doing so caused our client LPAR to hang/crash. I’m not sure if this *should* have caused a problem, but establishing the connectivity of the primary WWPNs on the destination server seemed to remove the connectivity from the client LPAR on the source server. In theory, you should only need to run this for the secondary WWPNs on the destination server – to verify that zoning is correct to allow the mobility operation to complete successfully. However, despite causing our LPAR to crash, it turned out to be fortunate that we ran it with both sets of WWPNs because it turned up some interesting information that helped us figure out where the real problem was.

The devscan command returns a lot of good info, but we focused on the highlighted section. Notice that the LUN IDs are all “0”. As I mentioned, in NPIV mode devscan cannot find actual LUN information. However, it does demonstrate the connectivity by providing info about the 6 target ports that the client VFC adapter’s WWPN can “see” via the SAN fabric that this physical fibre channel adapter is connected to.

In our configuration, we utilize two SAN fabrics, so each AIX LPAR has a pair of VFC adapters, one for each fabric. On the VIO Server side, the server VFC adapters are mapped across four physical fibre channel adapters to spread the load – two of these adapters are cabled to SAN fabric A while the other two are cabled to fabric B. So, devscan commands for a specific WWPN from one of the client VFC adapters returned the 6 SCSI IDs listed above for half of the adapters (for the two cabled to the SAN fabric where that WWPN is included in the zoning), and the other half returned with a message saying “No targets found”, since that WWPN isn’t zoned in the other fabric. For a WWPN from the other client VFC adapter, the opposite adapters returned SAN connection info, but pertaining to the other SAN fabric. Example from the other fabric:

We presented this information to the SAN team, but they once again merely confirmed that the secondary WWPNs were indeed zoned properly to the SVC storage. So, we had to continue our debugging.

Finding SCSI IDs with lspath on LPM Client LPAR

We decided to run some commands directly on the client LPAR, to see what the connectivity looked like from there. For each disk/VFC adapter combination we were able to find the SCSI ID info that devscan returned on the VIO Servers by using this lspath command:

# lspath -AHE -l <disk> -p <parent adapter> -w <connection>

First, we needed the “connection” information to use with the “-w” flag. We could get that for each disk by using the “-F” flag with the lspath command. For example, for hdisk1:

Now, we compare the SCSI IDs that lspath found to the SCSI IDs that devscan found on the destination VIO Servers:

6e0050 - not found at all by lspath6e0051 - not found at all by lspath
6e5424 - lspath found with hdisk2
6e5426 - lspath found with hdisk2
6e5578 - lspath found with hdisk0 & hdisk1
6e5579 - lspath found with hdisk0 & hdisk1
780043 - not found at all by lspath780044 - not found at all by lspath
784c2b - lspath found with hdisk2
784c2c - lspath found with hdisk2
784f15 - lspath found with hdisk0 & hdisk1
784f17 - lspath found with hdisk0 & hdisk1

So, it appears that the four SCSI IDs that devscan only found with the primary WWPNs on the VIO Servers do not appear to be used at all on the client LPAR anyway.

The plot thickens.

Putting it all TogetherWe next decided to run devscan on the LPM client LPAR to see what we could learn. We ran the following command on the client to find all of the SCSI IDs:

This information revealed that ALL of the SCSI IDs are indeed seen on this client LPAR (which isn’t surprising, since it is using the primary WWPNs, which did see all of the connections). But why are some of these connections unused?

Running the same command but looking at the full output revealed the answer. The command produces very wide output, so for the sake of clarity I’ll run the command through awk to just print out several of the most important columns:

Note that the unused SCSI IDs in question all have a different device type of IBM 2107900, while the “good” SCSI IDs are all IBM 2145. This LPAR is currently allocated storage through an IBM SVC (device type 2145), but it used to get its storage from an IBM DS8x00 (device type 2107). Seemingly, it was still connected to the DS8x00 unit, even though storage was no longer allocated from that unit.

The SAN team confirmed that the primary WWPNs were indeed still zoned to the DS8x00, in addition to the SVC, while the secondary WWPNs had only been zoned to the SVC (since all storage was allocated solely via the SVC at the time that they established the zoning for LPM testing). They didn’t think that would be a problem since no DS8x00 storage was allocated – but apparently LPM requires the secondary WWPNs to have the same connectivity as the primary WWPNs, whether or not any storage is actually allocated via those connections.

We were able to fix our LPM problem by removing the DS8x00 from the zoning of the client VFC adapters’ primary WWPNs. Once the zoning of the primary WWPNs matched the zoning of the secondary WWPNs, LPM worked like a charm.