Prior to AIX 5.3 TL7 and AIX 6.1,
there was an 8 character limit on AIX user passwords. If you need passwords of
greater than 8 characters then you must enable one of the supplied Loadable
Password Algorithms (LPAs). The following table lists the available algorithms
and the limitations of each:

For example, to enable the MD5 algorithm
I can modify /etc/security/login.cfg
file with the chsec command as follows:

# chsec -f
/etc/security/login.cfg -s usw -a pwd_algorithm=smd5

# tail -2
/etc/security/login.cfg

pwd_algorithm = smd5

This
algorithm (smd5) will allow a password limit of 255 characters. Each of the
available algorithms is listed in the /etc/security/pwdalg.cfg file.

* /usr/lib/security/ssha is a
password hashing load module using SHA and

* SHA2 algorithms. It
supports password length up to 255 characters.

*

* This LPA accepts three
options. The options are separated by commas.

...etc...

Once you’ve enabled
the LPA of your choice, and you set/change a users’ password, you’ll notice
that the /etc/security/passwd stanza
for that user will look different when compared to the stanzas of users that
have not had their password set/changed using the new LPA:

fred:

password = E7nOaTrrz9Q16

lastupdate = 1330986703

flags = ADMCHG

joe:

password = {smd5}z9JrHDJB$Oq/cZXr0jUyAWvfFyjt161

lastupdate = 1330987903

flags = ADMCHG

In the example above,
user joe’s password has been set using the smd5 algorithm.

For those of you who
run PowerHA (HACMP) and are thinking about using one of the LPAs with the clpasswd utility, you may want to
review this APAR first:

The APAR states “HACMP
cluster-wide C-SPOC password administration does not support use of the feature
allowing passwords longer than 8 characters which became available with the
Loadable Password Algorithm as part of AIX 53 TL 7.”

The last time I tested this with PowerHA, the problem was
that the password entry in /etc/security/passwd
was corrupted/truncated when a users password was changed using the clpasswd utility.

For example, if the passwd
utility is linked to clpasswd and I
changed a users password, the password field appeared to be corrupted/truncated
and the user could not log in successfully:

I’ve not tried this again recently but I am curious if the
same behaviour can be expected on a PowerHA system today. When I first encountered
this problem (in 2008) I opened a PMR for the issue. In that call I was told
that the “clpasswd utility is corrupting the encrypted password when
distributing to the nodes, so that a login fails”. I’ll configure a HA
cluster soon and try it again with PowerHA 6.1 and AIX 6.1 and report back with
the results.

UPDATE: I built a HA 6.1 cluster (on AIX 6.1) this afternoon in
my lab and tested this successfully. Based on the tests I’ve performed so far,
it appears that this limitation no longer exists. Thanks to hafeedbk@us.ibm.com for the help on this
one.

The following IBM
tech note has more information on the available Loadable Password Algorithms
and support for longer than 8 character passwords on AIX:

If restoring a workload
partition, target disks should be in available state.

So I tried the command again,
this time with the –O flag as suggested in the error message. It failed again
stating that it could not remove cgvg from hdisk5. This was also good. At this
point, an experienced AIX admin would look for the existence of a volume group
on hdisk5. However, a junior AIX admin might not! :)

Global# mkwpar -O -D
rootvg=yes devname=hdisk5 -n rootvgwpar1

mkwpar: 0960-620
Failed to remove cgvg on disk hdisk5.

Global# lsvg -l cgvg

cgvg:

LV
NAME
TYPE LPs
PPs PVs LV STATE
MOUNT POINT

cglv
jfs2 1
1 1 open/syncd /cg

loglv00
jfs2log 1
1 1 open/syncd N/A

But what if the file systems
were already unmounted prior to running mkwpar
with the –O flag. What would happen?

So, to test my theory, I
unmounted my file system so that the logical volumes in cgvg were all now
closed.

This could be a trap for “first time” users of RootVG WPARs! So
look out! :)

Apparently this is working as designed, as the –O flag is really
only meant to be called by WPAR tools such as the WPAR Manager.

The man page for mkwpar
states:

-O This flag is used to force the overwrite of an existing volume group
on the given set of devices specified with the -D rootvg=yes flag directive. If not
specified, the overwrite value defaults to FALSE. This flag should only be specified
once, for its setting will be applied to all devices specified with the -D
rootvg=yes flag directive.

After reading about
the latest AIX updates here: http://t.co/bkIpnXkS
I decided to download and install the latest TL & SP for AIX 6.1 and 7.1
and take a peek at some of the latest features.

Here’s what I found
so far.

There appears to be
some new integration between NIM and the VIOS. The nim command now has an updateios
option e.g. nim
–o updateios.

So you can update
your VIO servers from NIM now. This is nice.

On my lab NIM master
I checked the nim man page and found
the following new information:

NIM
[/] # oslevel -s

6100-07-02-1150

NIM
[/] # man nim

...

updateios

Performs
software customization and maintenance on a virtual input output server (VIOS)
management server that is of the vios or ivm type.

updateios

1To install fixes or to update VIOS with the
vioserver1 NIM object name to the latest maintenance level, type:

nim
-o updateios
-a lpp_source=lpp_source1 -a preview=no vioserver1

The
updates are stored in lpp_source and lpp_source1 files. Note: The updateios
operation runs a preview during installation. Running the updateios operation
from NIM runs a preview unless the preview flag is set to no. During the
installation, you must run a preview when using the updateios operation with
updatios_flags=-install. With the preview, you can check if the installation is
running accurately before proceeding with the VIOS update.

2To reject fixes for a VIOS with the
vioserver1 NIM object name, type:

nim
-o updateios -a updateios_flags=-reject vioserver1

3To clean up partially installed updates for
a VIOS with the vioserver1 NIM object name, type:

nim
-o updateios -a updateios_flags=-cleanup vioserver1

4To commit updates for a VIOS with the
vioserver1 NIM object name, type:

nim
-o updateios -a updateios_flags=-commit vioserver1

5To
remove a specific update such as update1 for a VIOS with the vioserver1 NIM
object name, type:

There’s also mention
of a new resource type, specifically for VIOS mksysbs. This resource type is
called ios_mksysb:

...

ios_mksysb

Represents
a backup image taken from a VIOS management server that is of the vios or ivm
type.

26 To
define a ios_mksysb resource such as ios_mksysb1, and create the ios_mksysb
image of the vios client as vios1, during the resource definition where the
image is located in /export/nim/ios_mksysb on the master, type:

nim
-o define -t ios_mksysb -a server=master \

-a
location=/export/nim/ios_mksysb -a source=vios1 \

-a
mk_image=yes ios_mksysb1

This is all starting
to come together now, since the introduction of the new “management” object
class, vios, with AIX 6.1 TL3.

Next I thought I’d
take a look at the TCP Fast Loopback
option. This new option should help to reduce TCP/IP (CPU) overhead when two
(TCP) communication end points reside in the same LPAR. This could be
useful where you have an LPAR running a database and application in the same
LPAR e.g. SAP and Oracle in the same
LPAR. It can also be used when two
or more WPARs, in the same LPAR need to communicate with each other over
TCP/IP.

I turned on this new
feature on my AIX 7.1 LPAR.

AIX7[/]
# oslevel -s

7100-01-02-1150

AIX7[/]
# netstat -p tcp | grep fastpath

0 fastpath loopback connection

0 fastpath loopback sent packet (0
byte)

0 fastpath loopback received packet (0
byte)

AIX7[/]
# no -p -o tcp_fastlo=1

Setting
tcp_fastlo to 1

Setting
tcp_fastlo to 1 in nextboot file

Change
to tunable tcp_fastlo, will only be effective for future connections

AIX7[/]
# no -p -o tcp_fastlo_crosswpar=1

Setting
tcp_fastlo to 1

Setting
tcp_fastlo to 1 in nextboot file

Change
to tunable tcp_fastlo, will only be effective for future connections

AIX7[/]
# no -a | grep tcp_fast

tcp_fastlo = 1

tcp_fastlo_crosswpar = 1

Initially I did not
see any traffic via the fastpath.

AIX7[/]
# netstat -s -p tcp | grep fastpath

0 fastpath loopback connection

0 fastpath loopback sent packet (0
byte)

0 fastpath loopback received packet (0
byte)

So I created two WPARs
in the same LPAR and started transferring files between them via FTP.

So I start with a
standard JFS2 filesystem, without any additional mount options.

NIM
[/] # mount | grep cg

/dev/cglv/cgjfs2Jan 18 21:14 rw,log=/dev/hd8

Then I dynamically
remounted it with the rbr option.
This option will prevent user data pages from being cached after a file is read
from this filesystem.

NIM
[/] # mount -o remount,rbr
/cg

NIM
[/] # mount | grep cg

/dev/cglv/cgjfs2Jan 18 21:14 rw,rbr,log=/dev/hd8

Still can’t
dynamically mount a filesystem with CIO however. But that’s OK.

NIM
[/] # mount -o remount,cio /cg

mount: cio is not valid with the remount option.

According to the
presentation, there are several options that can now be changed dynamically
e.g. atime,rbr,rbw,suid,dev,
etc. Take a look at the presentation if you are interested in this
new functionality.

By the way, just
to make sure, I tried changing the same mount option, dynamically, on an AIX 6.1
TL6 system, and it failed as expected. I’d have to umount and mount the
filesystem to do this on TL6 (or lower).

#
oslevel -s

6100-06-04-1112

# mount
-o remount,rbr /cg

mount: remount,rbr,log=/dev/loglv00 is not valid with the remount
option.

OK, let’s look at the
new LVM Infinite Retry Capability.
Designed to improve system availability, by allowing LVM to recover from
transient failures of storage devices. Sounds interesting!

AIX7[/]
# oslevel -s

7100-01-02-1150

The man page for mkvg states the following:

-O y / n

Enables the infinite retry option
of the logical
volume.

n

The infinite retry option of
the logical
volume is not enabled. The failing I/O of the logical volume is not
retried. This is the default value.

y

The infinite retry option of
the logical
volume is enabled. The failed I/O request is retried until it is
successful.

I think “logical
volume” should be “volume group”. But anyway, I get the idea.

So let’s create a new
VG with infinite retry enabled.

AIX7[/]
# mkvg -O y
-S -y cgvg hdisk6

cgvg

AIX7[/]
# lsvg cgvg

VOLUME
GROUP:cgvgVG IDENTIFIER:00f6048800004c0000000134f4236851

VG
STATE:activePP SIZE:128 megabyte(s)

VG
PERMISSION:read/writeTOTAL PPs:1599 (204672 megabytes)

MAX
LVs:256FREE PPs:1599 (204672 megabytes)

LVs:0USED PPs:0 (0 megabytes)

OPEN
LVs:0QUORUM:2 (Enabled)

TOTAL
PVs:1VG DESCRIPTORS: 2

STALE
PVs:0STALE PPs:0

ACTIVE
PVs:1AUTO ON:yes

MAX
PPs per VG:32768MAX PVs:1024

LTG
size (Dynamic): 128 kilobyte(s)AUTO SYNC:no

HOT
SPARE:noBB POLICY:relocatable

MIRROR
POOL STRICT: off

PV
RESTRICTION:noneINFINITE RETRY: yes

AIX7[/]
#

Now, let’s disable it.

AIX7[/]
# chvg -On
cgvg

AIX7[/]
# lsvg cgvg

VOLUME
GROUP:cgvgVG IDENTIFIER:00f6048800004c0000000134f4236851

VG
STATE:activePP SIZE:128 megabyte(s)

VG
PERMISSION:read/writeTOTAL PPs:1599 (204672 megabytes)

MAX
LVs:256FREE PPs:1599 (204672 megabytes)

LVs:0USED PPs:0 (0 megabytes)

OPEN
LVs:0QUORUM:2 (Enabled)

TOTAL
PVs:1VG DESCRIPTORS: 2

STALE
PVs:0STALE PPs:0

ACTIVE
PVs:1AUTO ON:yes

MAX
PPs per VG:32768MAX PVs:1024

LTG
size (Dynamic): 128 kilobyte(s)AUTO SYNC:no

HOT
SPARE:noBB POLICY:relocatable

MIRROR
POOL STRICT: off

PV
RESTRICTION:noneINFINITE RETRY: no

AIX7[/]
#

The man page for mklv states the following:

-O y / n

Enables the infinite retry option
of the logical volume.

n

The infinite retry option of
the logical volume is not enabled. The failing I/O of the logical volume is not
retried. This is the default value.

y

The infinite retry option of
the logical volume is enabled. The failed I/O request is retried until it is
successful.

And last, but not
least, let’s take a brief look at Active
System Optimiser (ASO). To be
honest, I’m still not entirely sure how ASO works. But I have no doubt that
more information will be available from IBM soon. According to the presentation
material, ASO can “increase system performance by autonomously tuning system
configuration”. Wow, cool! It focuses on optimizing cache and memory affinity.
Hmmm, interesting. How the heck does it do that!? Only works with POWER7 and
AIX 7.1.

So can I enable this
on my p7 LPAR? Let’s give it a try!

AIX7[/]
# oslevel -s

7100-01-02-1150

AIX7[/var/log/aso]
# asoo -a

aso_active = 0

AIX7[/var/log/aso]
# asoo -p -o aso_active=1

Setting
aso_active to 1 in nextboot file

Setting
aso_active to 1

AIX7[/var/log/aso]
# asoo -a

aso_active = 1

Is the aso daemon
running already? Nope.

AIX7[/]
# ps -ef | grep aso

AIX7[/]
# lssrc -a | grep aso

asoinoperative

Can I start it now? Nope.

AIX7[/var/log]
# startsrc -s aso

0513-059
The aso Subsystem has been started. Subsystem PID is 7209122.

There may be better ways to do this, and if there are, please let me know. But lately I've been "hacking around" with cloud-init on AIX and trying to make it behave the way I want it to. There were two problems I faced and solved.

My first challenge. The AIX /etc/hosts file isn't updated with the IP address and hostname of the new AIX VM after deployment from PowerVC.

To work-around this niggle*, I added the following short but effective shell script to the Activation Input in the PowerVC GUI.

And the second niggle. In my customers lab environment, where there's no DNS at all. Everything is /etc/hosts only. Every time they deployed an AIX VM, there was a significant delay for ssh sessions, even with netsvc.conf set to hosts=local4, the dodgy**, default, resolv.conf (search localdomain) forced DNS first....and ssh connections would hang for a minute or so before a login prompt appeared....waiting on name resolution to complete.

Some people told me that I could change "manage_resolv_conf" to false in my cloud-init config file and this would prevent the resolv.conf file from being managed (over-written) by cloud-init. But changing that option did nothing. And I really didn't want a resolv.conf file at all anyway!

What I really wanted was for cloud-init to deploy the AIX VM and to NOT create an /etc/resolv.conf file. But how? Well, I managed to fudge it***. I made a change to the aix.py python script. With this change, the script now writes out an /etc/resolv.conf.cloud file instead. This works OK.

And I guess the next time I install the latest release of cloud-init for AIX, I'll need to modify the script again. But I'm OK with this, as I expect the newer release may actually provide me with a fix to each of the problems I faced.

*niggle: To cause slight but persistent annoyance, discomfort, or anxiety.

Before
the change, I was unable to compile anything. Note the highlighted text below.

#
/usr/vac/bin/xlc cgc.c

The
license for the Evaluation version of IBM XL C/C++ for AIX V11.1 compiler
product has expired. Please send an email to compiler@ca.ibm.com
for information on purchasing the product. The evaluation license can be
extended to 74 days in total by either a) setting an environment variable XLC_EXTEND_EVAL=yes; or, b) specifying a compiler command line option
-qxflag=extend_eval. The extended evaluation license will expire on Sat
Jun 18 10:26:33 2011. Use of the Program continues to be subject to the terms
and conditions of the International License Agreement for Evaluation of
Programs, including the accompanying License Information document (the
"Agreement"). A copy of the Agreement can be found in the
"LicAgree.pdf" and "LicInfo.pdf" files residing in the root
directory of the installation media. If you do not agree to the terms and
conditions of the Agreement, you may not use or access the Program.

With
the environment variable in place, I was able to compile again.

#
/usr/vac/bin/xlc cgc.c

#
./a.out

Hello
World!

Of
course I could also export the environment variable as required instead e.g.

#
export XLC_EXTEND_EVAL=yes

#
/usr/vac/bin/xlc cgc.c

#
./a.out

Hello
World!

If
you do place this environment variable in /etc/profile, you may need to restart
any processes on the system that need to call the compiler.

I enjoy it when I open my email in the morning and find a new message with a subject line of “weird one….”! I immediately prepare myself for whatever challenge awaits. Fortunately I do delight in helping others with their AIX challenges so I usually open these emails first and start to diagnose and troubleshoot the problem!

This week I was contacted by someone that was having a little trouble with a mksysb backup on one of their AIX systems.

“Hi Chris,

This one has me stumped, any ideas? I’ll have to log a call I think as I’m not sure why this is happening. I run a mksysb and it just backs up 4 files! I also can’t do an alt_disk_copy that also fails.

My /etc/exclude.rootvg is empty.

# cat /etc/exclude.rootvg
# mksysb -i /mksysb/aixlpar1-mksysb

Creating information file (/image.data) for rootvg.

Creating list of files to back up.

Backing up 4 files

4 of 4 files (100%)
0512-038 mksysb: Backup Completed Successfully.

# lsmksysb -f /mksysb/aixlpar1-mksysb
New volume on /mksysb/aixlpar1-mksysb:
Cluster size is 51200 bytes (100 blocks).
The volume number is 1.
The backup date is: Wed Oct 21 22:12:04 EST 2015
Files are backed up by name.
The user is root.5911 ./bosinst.data
11 ./tmp/vgdata/rootvg/image.info
11837 ./image.data
270567 ./tmp/vgdata/rootvg/backup.data
The total size is 288326 bytes.
The number of archived files is 4.”

This little tip was passed on to me by
a friendly IBM hardware engineer many years ago.

When entering a capacity on demand
(CoD) code into a Power system, you can tell how many processors and how much
memory will be activated, just by looking at the code you’ve given by IBM.

For example, the following codes, when
entered for the appropriate Power system, will enable 4 processors (POD) and
64GB of memory (MOD). I can also tell* that once the VET code is entered, this
system will be licensed for PowerVM Enterprise Edition (2C28).

This is a significant enhancement, as it will allow AIX administrators to install TLs and SPs (and ifixes) without restarting their AIX systems. From the announcement:

"AIX Live Update for Technology Levels, Service Packs, and Interim Fixes

Introduced in AIX 7.2, AIX Live Update is extended in Technology Level 1 to support any future update without a reboot, with either the geninstall command or NIM.

The genld command is enhanced to list processes that have an old version of a library loaded so that processes can be restarted when needed in order to load the updated libraries."

In this post I'll show you how to install updates without rebooting your AIX server. I recommend you first review my original article on Live Updates (from Oct 2015) in order to better understand the Live Update process, how it works and the requirements.

TL and SP Live Update support is delivered in AIX 7.2 TL1 (available November 11th 2016).

One of the biggest differences between the Live Update process for TL/SPs versus ifixes, is that you must backup your system prior to the update. This will be used in case you need to back out your system. The easiest way to do this is, is to create an alternate rootvg (alt disk clone).

On my system, I first applied TL1 for AIX 7.2 and verified the correct level was installed.

root@AIXmig / # oslevel -s

7200-01-00-0000

root@AIXmig / # cat /proc/version

Oct 10 2016

11:53:17

1640C_72D

@(#) _kdb_buildinfo unix_64 Oct 10 2016 11:53:17 1640C_72D

I had several "free" disks that I could use for the Live Update process. I'd need at least 3 disks, one for my alternate rootvg (back out), one for the mirror disk and one for the new rootvg. In this case I used hdisk6 (alt rootvg), hdisk3 (mdisk) and hdisk2 (ndisk). This was specified in the lvupdate.data configuration file. All three disks were large enough to hold a complete copy of my existing rootvg.

root@AIXmig / # lspv

hdisk0 00f94f58cecabed6 rootvg active

hdisk1 00f94f58697b768f datavg active

hdisk2 00f94f58697b7655 None

hdisk3 00f94f58ce74a739 None

hdisk4 00f94f58a3b2f963 None

hdisk5 00f94f58a3b2f9d4 None

hdisk6 00f94f58def77f2c None

root@AIXmig / # cat /var/adm/ras/liveupdate/lvupdate.data

...

disks:

nhdisk = hdisk2

mhdisk = hdisk3

Note, with 7.2 TL1, only two disks need to be specified in the lvupdate.data file. The tohdisk and tshdisk are not needed with TL1 unless you have a paging or dump device outside of rootvg.

I cloned my rootvg to a spare disk (hdisk6) first. I specified the -B flag to ensure the boot list was not changed.

If I needed to back out, to the previous level, I could change the boot list to point to the alternate rootvg (hdisk6), and restart the system.

root@AIXmig / # lspv

hdisk0 00f94f58cecabed6 rootvg active

hdisk1 00f94f58697b768f datavg active

hdisk2 00f94f58697b7655 lvup_rootvg

hdisk3 00f94f58ce74a739 None

hdisk4 00f94f58a3b2f963 None

hdisk5 00f94f58a3b2f9d4 None

hdisk6 00f94f58def77f2c altinst_rootvg

root@AIXmig / # bootlist -m normal -o

hdisk0 blv=hd5 pathid=0

hdisk0 blv=hd5 pathid=1

root@AIXmig / # bootlist -m normal hdisk6

hdisk6 blv=hd5 pathid=0

hdisk6 blv=hd5 pathid=1

root@AIXmig / # bootlist -m normal -o

hdisk6 blv=hd5 pathid=0

hdisk6 blv=hd5 pathid=1

root@AIXmig / # at now

shutdown –Fr

Job root.1477276394.a will be run at Mon Oct 24 13:33:14 AEDT 2016.

This is very cool technology. Gone are the days of needing to plan reboots shortly after applying a new TL or SP to a critical AIX system. You simply "live update" your system, without disrupting your workloads or your users. This is a win for AIX administrators everywhere!

Please refer to the AIX 7.2 Knowledge Center for more information on Live Update.

A customer was attempting to move an LPAR from one POWER7 750
to another POWER7 750 using Live Partition Mobility (LPM). But he was receiving
the following error message when performing an LPM validation.

This was an
easy one to fix. The reserve policy on the hdisk on the VIOS (yes, it was a
single VIO server not dual VIOS, don’t ask why!) was set to PR_exclusive instead of no_reserve.

In a perfect world, 99.9% of AIX administrators would prefer their systems to look like this:

# lspv | grep rootvg

hdisk000c342c68dfcbdfbrootvgactive

However, in reality, 99.9% of AIX administrators live with systems that look something like this:

# lspv | grep rootvg

hdisk3900c342c68dfcbdfbrootvgactive

And 99.9% of them don’t have time to tidy up their systems so that rootvg resides on hdisk0.

Most of them have much bigger fish to fry, such as performance, virtualisation, automation, security, project delivery, TPS reports, etc!

If they did have time, they could use the mirrorvg and rendev commands to ‘bring order to the Universe’.

WARNING! Let me make this perfectly clear! The procedure that is shown below is NOT SUPPORTED by IBM. If you choose to follow these procedures, DO NOT contact IBM support for help. They will not be able to assist you. YOU HAVE BEEN WARNED!

Note: Disk drive devices that are members of the root volume group, or that will become members of the root volume group (by means of LVM or install procedures), must not be renamed. Renaming such disk drives may interfere with the ability to recover from certain scenarios, including boot failures. Some devices may have special requirements on their names in order for other devices or applications to use them. Using the rendev command to rename such a device may result in the device being unusable.

Note: To protect the configuration database, the rendev command cannot be interrupted once it has started. Trying to stop this command before completion, could result in
a corrupted database.

IBMs PowerVP tool became available in November 2013. It was designed to provide Power Systems administrators with performance information in an enhanced visual format. The aim was to accelerate the identification of performance bottlenecks so that performance analysts could make better decisions based on more detailed and comprehensive data from POWER7 (and POWER8) systems. PowerVP presents both System (frame) and Partition level views of performance data. This has not been possible in the past using any single tool. Administrators would typically need to use many different tools and interfaces to obtain a single, system-wide performance view across an entire CEC and to drill down to all individual partitions.

The tool was originally developed for IBM internal use only (known as Sleuth) which helped the IBM development team with rapid development of prototype technology and performance analysis. After a brief demonstration of the tool during an internal, invitation only, event for customers at IBM Austin, almost all of the customer attendees requested that the tool be made available for use outside of IBM.

In this post I will briefly discuss how to quickly install and configure PowerVP in an AIX environment. I will start by discussing how to install the PowerVP GUI on a Windows laptop and then cover to how to install the PowerVP agents on an AIX and/or VIOS partition. I’ll then show you how to monitor your system and collect system wide metrics for an entire frame by recording and playing back your PowerVP sessions.

To begin, let’s download, extract and install the latest version of PowerVP. Customers that are entitled (PowerVM Enterprise Edition customers) can download the PowerVP software directly from the IBM Entitled Systems Support (ESS) website.

You can be forgiven for thinking that the latest version of PowerVP isn’t available in ESS but if you look closely and expand everything out, you’ll find v1.1.3 is listed for download.

Once you’ve downloaded the software you’ll end up with a PowerVP package that is named something similar to ESD_-_PowerVP_Standard_Edition_v_1.1.3_62015.zip. Extract the zip file and you’ll discover the following directory structure:

To install the PowerVP GUI for Windows, simply run the PowerVP.exe from the Windows folder. Choose PowerVP Client GUI and click next.

When prompted, select to install Liberty for PowerVP as part of the GUI installation. The IBM PowerVP Redbook explains why:

“Starting with Version 1.1.3 PowerVP has a web based GUI. It is packaged in the Web Application Archive (WAR) format and it must be deployed onto an application server. By default, PowerVP GUI uses IBM WebSphere® Application Server Liberty Core. Liberty profile is a new server profile of IBM WebSphere Application Server V8.5. Liberty profile provides all features required to run the PowerVP, it is lightweight, has a small footprint and fast startup time. PowerVP and a configured Liberty profile are packaged into a compressed file. This provides for an easy and efficient distribution and a simplified installation procedure. Because the new PowerVP GUI is web based it is now possible that a single instance of this GUI be accessed by multiple users using web browsers. This eliminates the need to install a console for each PowerVP user and avoids the potential overhead generated by additional performance data requests initiated from multiple consoles. PowerVP users can connect to the web GUI using web browsers. Users must be able to connect to the ports on which the application server is listening. Default port numbers are 9080 for HTTP traffic and 9443 for HTTPS traffic. Port numbers can be changed during the installation process.”

The following diagram, also from the Redbook, provides a visual representation as to where Liberty fits in with the new GUI, the System and Partition level agents.

Once the GUI is installed on your Windows desktop, the next step is to extract the PowerVP agent for AIX/VIOS. To do this you, once again, run the PowerVP.exe installer. Select PowerVP Server Agents and click next.

Select AIX/VIOS and click next.

When prompted for a System Level Agent Hostname or IP Address, you can enter anything here as it is ignored. All we want to do here is extract the installation software, not connect to an agent. I entered localhost, even though this is not where the PowerVP agent will reside. Click next.

Once the install process is complete you’ll find the extracted AIX/VIOS installation filesets in the C:\Program Files (x86)\IBM\PowerVP\PowerVP_Installation\PowerVP_Agent_Installation_Instructions\AIX directory.

You can now transfer these files to your AIX and/or VIOS system of choice, essentially wherever you’d like to install and run the PowerVP server agent and any partitions you’d like to monitor as a partition level agent. Many customers have chosen to install the PowerVP System level agent on their VIOS. This seems like a logical place to install it as these systems are typically always up and available. Ensure that you copy the powervp.1.1.30.bff fileset and the the GSKit filesets to the destination system, as both are needed for installation. Of course, you should download and install the latest fixes for PowerVP from the IBM Fix Central site as well.

The agent installation on AIX is very simple. Make sure that your hardware and system firmware support PowerVP before you install it. The IBM Redbook, IBM PowerVP Introduction and Technical Overview REDP-5112-00, has a comprehensive list of supported systems and minimum requirements.

To install the agent on an AIX or VIOS partition, simply copy the filesets to the system and use installp to install both the GSKit and powervp filesets. The GSKit filesets are required for SSL support with PowerVP. Even if you don’t plan on using SSL with PowerVP, these filesets must be installed when powervp.rte is installed, regardless.

With the agent installed successfully, you can simply start the PowerVP agent. No further configuration is required at this point. The agent will run as a System level agent and allow you to connect to it with the PowerVP client GUI.

However, if you wanted to configure this agent as a Partition level agent, you need to run the PowerVP iconfig tool to point the Partition agent at an existing System level agent. For example, we could configure the newly installed agent to communicate with an existing System level agent at IP address 10.1.50.59. Then we start the agent using the SPowerVP script. We then confirm that the agent has registered with the System level agent by reviewing the output in the /var/log/powervp.log file on the client partition.

Now that the agent is installed we can connect to it with the PowerVP GUI. You can start the GUI by doubling clicking on the PowerVP icon on your Windows desktop. This starts the Liberty server, opens your web browser and connects you to the PowerVP interface.

You should see the following messages as the PowerVP GUI server is started on your Windows desktop/laptop.

Note: Please ensure that Java is in your path on your Windows machine. If it is not PowerVP will fail to start and you may be presented with an error stating that “javaw” cannot be found. You can check if Java is in the path by opening a DOS prompt and entering a java command For example:

To connect to the System level agent, click on New Connection and enter the IP address or hostname of the partition running the System leve agent, followed by the user name and password of the root user (or padmin if running the agent on a VIOS). Then click on connect.

You will be presented with the PowerVP main panel. From here you can start exploring each of the main views available, such as System Topology, Node Drill Down and Partition Drill Down.

The System Topology view shows the hardware topology of the system we are connected to in the current session. In this view, we can see the topology of a POWER8 S824 with two processor modules. We can see each node has two chips/sockets. We can also see numbers in the boxes which indicate how busy each of the chips are on the system. The lines between the nodes show the traffic on the SMP fabric between each node. If you select Toggle Buses button, PowerVP GUI will show lines between the processor module boxes and processor nodes which represent buses. The Toggle Affinity button is intended to show affinity where every partition has a different colour.

The Node Drill Down view appears when you click on one of the nodes and allows you to see the resources being consumed by the partitions running on the system. In this view we can see this processor module has 12 cores/processors. We can also see lines showing the buses between the chips. We can also see the memory controllers and the PHB buses which shows traffic to and from our remote I/O. We also see connecting to the other processor module; this is the SMP connections to other nodes and shows traffic.

The Partition Drill Down view allows us to drill down on resources being used by a specific partition that we clicked on. This view opens in a new tab in our web browser. In this view, we see CPU, Memory, Disk Iops, and Ethernet being consumed. We can also get an idea of cache and memory affinity (under Detailed LSU Breakdown).

The main panel also provides you with a view of processor utilisation for each LPAR on the system. You can easily sort LPARs based on utilisation to quickly understand which LPARs are consuming the most (or least) CPU across a single system.

Overall system processor utilisation is available from the main panel also. This view provides a graph of total processor utilisation, over time, for the entire POWER system. Directly above this graph useful information is displayed for items such as clock frequency, total cores, platform (AIX, Linux, VIOS or IBM i), system model/serial number and sample rate.

One very useful feature of PowerVP is the ability to record and playback your PowerVP sessions. By clicking on the Start Recording button, PowerVP will start to record your session to your local machine (in my case, my Windows laptop). I can then load this recording at later date for playback inside the PowerVP GUI.

A new feature, in version 1.1.3, now allows you to run the VIOS advisor (part) directly from the PowerVP GUI. When you connect to a System level agent on a VIOS, you will be presented with the VIOS Performance Advisor panel in the GUI. You can configure PowerVP to run the VIOS advisor at a particular time or you can run it on demand. You can also retrieve previously created VIOS advisor reports from the GUI. This is a very nice feature.

When I clicked on the Run Advisor button I noticed, on my VIOS, that a new topas_nmon and part process were started. The process ran for 10 minutes (default) and then a new tar file was created in /opt/ibm/powervp/advisor.

A new tab was automatically opened in my web browser which showed me the VIOS advisor report for my VIOS. Impressive stuff!

Please refer to the PowerVP Redbook for more information how to configure and use this option.

Several features and functions have changed since the last release of PowerVP. Here is the short list of important changes I’ve encountered so far:

The new PowerVP web interface does not support the older level of PowerVP agents. You will need to update both the System level and Partition level agents to v1.1.3 in order for the new version to function.

The old PowerVP GUI was previously installed on Windows as a PowerVP.exe application. This has been replaced by a laucnh-powervp.bat file (a shortcut will be created on your desktop). This starts the Liberty server for the GUI. You must select to install Liberty for this file to be installed. The following screenshot lists the contents of my PowerVP GUI Installation directory for my Windows laptop.

I also came across this useful tip in the Redbook. PowerVP can record large amounts of data when recording is enabled. So you should make sure you have sufficient space available to store recorded data on your local machine. It is recommended to increase the sample rate from the default of 1 second to reduce the amount of data collected during recording. The sample rate can be changed by editing the/etc/opt/ibm/powervp/powervp.conf file and changing SampleInterval to a larger value. You need only change the sample interval on the System level agent (the Partition level agents pick up the sample interval from the System level agent). Once you’ve modified the powervp.conf file you must restart the PowerVP system level agent (syslet on AIX).

The aim of this post was to help you quickly install and configure PowerVP in your AIX environment. I encourage the reader to review the available PowerVP material from IBM, in particular the PowerVP Redbook, to learn more about the features and functions of the tool. This tool, finally, provides Power System administrators with a single method of obtaining some important performance data in their POWER7/POWER8 systems environment.

I finally got PowerVP installed in my lab environment today. As part of the server agent installation process I needed to install the latest service pack (SP1) for PowerVP. After I downloaded the service pack from IBM Fix Central, I tried to run the install script from the command line on my AIX partition. It failed everytime I ran it, no matter what options I supplied the installer script!

[root@gibbo]/tmp/cg # ./PowerVP.bin.SP1 –i Silent

Preparing to install...

Extracting the installation resources from the installer archive...

Configuring the installer for this system's environment...

Launching installer...

Graphical installers are not supported by the VM. The console mode will be used instead...

The installer cannot run in this UI mode. To specify the interface mode, use the -i command-line option, followed by the UI mode identifier. The valid UI modes identifiers are GUI, Console, and Silent.

The only way I could get the SP installed was to use the GUI installer. I installed VNC server on my AIX partition, connected to the VNC server and then ran the installer script (as shown in the following screenshots). This worked fine.

I was attempting to install the new AIX Runtime Expert on an LPAR today. I found that the base level fileset for this new utility was not included with the supporting Technology Level i.e. TL4 for AIX 6.1.

I was presented with the following 'Requisite Failures' message:

Requisite Failures

------------------

SELECTED FILESETS:The following is a list of filesets that you asked to

install.They cannot be installed until all of their requisite filesets

are also installed.See subsequent lists for details of requisites.

artex.base.rte 6.1.4.1# AIX Runtime Expert

MISSING REQUISITES:The following filesets are required by one or more

of the selected filesets listed above.They are not currently installed

and could not be found on the installation media.

artex.base.rte 6.1.0.0# Base Level Fileset

However, I discovered that the base fileset IS included with the latest Fix Pack for the Virtual I/O Server i.e. FP22 for VIOS 2.1.

I was then able to install the base filesets i.e. the AIX Runtime Expert and the sample profiles. i.e.

artex.base

+ 6.1.4.0AIX Runtime Expert

+ 6.1.4.0AIX Runtime Expert sample profiles

After the filesets installed successfully I needed to then apply the latest level from TL4. e.g.

The AIX minidump facility was introduced with AIX 5.3 TL3. A mini dump is a small compressed dump that is stored to NVRAM when a system crashes or a dump is initiated, and then written to the AIX error log on reboot. It can be used to see some of the system’s state and do some debugging when a full dump is not available. It can also be used to get a quick snapshot of a crash without having to transfer the entire dump from the crashed system to IBM support.

Please refer to the following, official guide, on "How to examine a minidump in AIX".

"Using this crash stack IBM support personnel can then search through the database to find what the fault may mean".

"The RAS effort mentioned..is part of an ongoing effort by AIX to increase stability and to make more information available for troubleshooting when a problem occurs. The ability to look at minidump data has helped solve many issues that would otherwise go unresolved".

Of course, minidumps have their limitations and do not replace the need for a full system dump in many cases.

"Limitations of minidumps
A minidump is of limited or no use in situations where a server has hung. In this situation we can use mdmprpt to see what was running on each cpu. Practically speaking a full dump is usually needed to determine how and why a server has hung."

Here are some examples of how I've used minidump to assist me (and IBM support) in diagnosing the root cause of a system crash.

The first example is from a customer that found that one of their AIX partitions would crash when they ran optmem DPO (Dynamic Platform Optimiser) against one of their new POWER8 E880s. Immediately after optmem ran, one specific LPAR would crash. The LPAR reference LED code would show "888 102 300 C20". This LPAR was installed with AIX 7.1 TL3 SP4. When we tried to restart the partition, it would crash (and dump) several times before it would start successfully. Once the partition booted successfully, we noticed that there were several system dumps (SYSDUMP) and minidumps (COMPRESSED MINIMAL DUMP) shown in the AIX error report.

In this case, the stack trace showed messages relating to trc_generate and trc_inmem_recor. Using this information, I was able to find several hits (both internally to IBM and using my preferred internet search tool) for the potential cause of the crash. The problem was a known issue and related to the following APAR.

We verified our findings with IBM support (who performed their own internal search) and we concluded that an update would be required. We chose to update the AIX system to AIX 7.1 TL4 SP3, and the problem went away.

In the next example, the customer had updated one of their AIX partitions to AIX 7.1 TL4 SP3 (previously on AIX 7.1 TL4 SP1). This LPAR also housed a single, AIX 5.3 versioned WPAR. The update was successful but whenever they ran the stopwpar command, the entire partition would crash (dump). So, once again, I employed the mdmprpt utiliy to check the stack traces.

The report showed us pcmUserGetDevIn and sddUserInterfac in the stack trace. Both are related to the IBM sddpcm device driver. They appeared to be the likely culprits. Searching for these presented us with a couple of potential reasons for the crash, such as too many paths configured for a sddpcm device (which we checked and confirmed was far less than the maximum of 32). So, my next question was, what version of sddpcm was installed? We found that the customer had (unexpectedly) two different versions of sddpcm installed. In the Global LPAR, the sddpcm version was 2.6.9.0 (the latest) and in the vWPAR, 2.6.6.0. During the TL update, the customer had updated sddpcm to the latest level available, but had forgotten to update sddpcm inside the vWPAR as well. We then discovered, that if they simply ran "pcmpath query device" from inside the vWPAR, this would crash the partition. The latest sddpcm readme file contained the following information "5557 Fix for AIX 7.1 crash during pcmpath query device"! This was indeed the cause of the system crash. After the customer updated sddpcm, inside the vWPAR, the system was stable and the problem was resolved.

If you encounter a system crash in the future, and there's minidump data available, why not consider using the minidump reporting tool to analyse the issue? It might just help speed up the root cause analysis of the problem.

“This option improves the performance on 10 Gigabit Ethernet and faster adapters for workloads that manage data streaming (such as file transfer protocol (FTP), RCP, tape backup, and similar bulk data movement applications). The virtual Ethernet adapter and shared Ethernet adapter (SEA) devices are exceptions, where the large send offload option is disabled by default due to inter operability problems with the Linux or IBM i operating system. Enabling Large Send and other performance features can be done in AIX and virtual Ethernet adapter or SEA environments.”

A colleague of mine contacted me during the week to ask how one could
determine if a POWER7 system was capable of Active Memory Expansion. I thought
I'd share my response with everyone who follows my blog.

You can check if the system is capable of providing Active Memory
Expansion (AME), from the HMC command line using the lssyscfg command, as shown here:hscroot@hmc1:~>
lssyscfg -r sys -m Server-8233-E8B-SN1000 -F active_mem_expansion_capable1

Alternatively you can view the capabilities of
the managed system from the HMC GUI, under System properties/Capabilities, as
shown in the following image.

Once you’ve concluded that your POWER7 system is
AME capable, the next step is to check if AME is enabled or disabled for
an LPAR. This is easily done from the AIX command line with the lparstat and/or amepat commands, as shown in the following example:;Running
lparstat and amepat on a LPAR with AME
disabled.

Note
that I deliberately ran these commands as a non-root user to highlight the fact
that you don’t need root access to ascertain whether or not AME is active on an
LPAR.

By the
way, did you know that AME is now supported in SAP production environments running
on AIX and POWER7? Both SAP application servers, as well as database servers
running DB2 LUW are supported. Unfortunately there’s no support for LPARs with
Oracle RDBMS at this stage. Thanks to George Manousos from IBM Australia for
providing me with this update.

As AME is not
recommended with large page support, AME
will disable AIX

64KB pages by default.

AME monitoring
capability in CCMS

Monitoring
capabilities are available with saposcol v12.46 for more details see SAP note
710975.

It looks as
though SAP have been quick to support AME on AIX/POWER7. This is a great
benefit to SAP customers running with PowerVM on the IBM POWER platform. The
following screenshot shows the SAP CCMS with AME statistics being reported.

The SAP AIX porting team continues
in further improving the integration of PowerVM into SAP system monitoring. While
in the past already most processor and AIX virtualization metrics could be
monitored via CCMS, now POWER7 Active Memory Expansion (AME) has been
included. Customers will find a new memory section in the respective
CCMS-panel as depicted in this screen shot.

For those
who are not familiar with what AME actually is and how it can benefit you, I
suggest you take a look at the IBM AME Wiki site first:

I was working
with a customer recently on a Power Blade that was running the Integrated
Virtualisation Manager (IVM). They’d installed a VIO partition onto the Blade
and had hoped to install a couple of AIX LPARs on the system. However they didn’t
get very far.

As soon as they
attempted to NIM install the LPARs, they would get stuck at trying to ping the
NIM master from the client. Basically, the Shared Ethernet Adapter (SEA) was
not working properly and none of the LPARs could communicate with the external
network. So they asked for some assistance.

The Blade server
name was Server-8406-71Y-SN06BF99Z. The SEA was configured as ent7.

On the network
switch port, the native VLAN (PVID), was configured as 11, with VLAN tag 68
added as an allowed VLAN. If the client LPARs tried to access the network using
a PVID of 68, instead of a VLAN TAG of 68, they would get stuck at the switch
port i.e. the un-tagged packets for 10.1.68.X via PVID 11 would fail. The
packets for 10.1.68.X needed to be tagged with VLAN id 68 in order for the
switch to pass the traffic.

So the question
was, how do we add VLAN tags in the IVM environment? If we’d been using a HMC,
then this would be simple to fix. Just add the VLAN tags into the Virtual
Ethernet Adapter used by the SEA and we’d be done.

We had to use the
lshwres and chhwres commands to resolve this one. First we listed the virtual
adapters known to the VIO server (IVM). At slot 12, we found our SEA adapter
with port_vlan_id set to 68 and addl_vlan_ids set to none.

We needed to
change port_vlan_id to 11 and addl_vlan_ids to 68. We also required
the ieee_virtual_eth value set to 1.

First we removed
the existing SEA adapter, as we would not be able to make changes to it while
it was “active”. We then removed the adapter from slot 12 and then re-added it,
again at slot 12, with port_vlan_id
and addl_vlan_ids set to the desired
values.

Recently I've come across an odd issue at two different customers. I thought I'd share the experience, in case others also come across this strange behaviour.

In both cases they reported j2pg high CPU usage.

Similar to this...

And, in both cases, we discovered many /usr/sbin/update processes running. Unexpectedly.

When we stopped these processes, j2pg's CPU consumption dropped to nominal levels.

The j2pg process is responsible for, among other things, flushing data to disk and is called by the sync(d) process.

The /usr/sbin/update command is just a script that calls sync in a loop. Its purpose it to "..periodically update the super block., ..execute a sync subroutine every 30 seconds. This action ensures the file system is up-to-date in the event of a system crash".

Because of the large number of /usr/sbin/update (sync) processes (in some cases over 150 of them), j2pg was constantly kept busy, assisting with flushing data to disk.

It appears that the application team (in both cases) was attempting to perform some sort of SQL "update" but due to an issue with their shell environment/PATH setting they were calling /usr/sbin/update instead of the intended update (or UPDATE) command. And yes, a non-root user can call /usr/sbin/update - no problem. So, in the "ps -ef" output we found processes that looked similar to this:

The application teams were directed to fix their scripts to prevent them calling /usr/sbin/update and instead call the correct command.

And here’s some information (more than you’ll probably ever need to know) about j2pg on AIX.

"j2pg - Kernel process integral to processing JFS2 I/O requests.

The kernel thread is responsible of managing I/Os in JFS2 filesystems,so it is normal to see it running in case of lot of I/Os or syncd.

And we could see that j2pg runs syncHashList() very often.The sync isdone in syncHashList(). In syncHashList(), all inodes are extracted from hashlist. And whether the inode needs to synchronize or not is then judgedby iSyncNeeded().

** note that a sync() call will cause the system to scan *all* thememory currently used for filecaching to see which pages are dirtyand have to be synced to disk

Therefore, the cause of j2pg having this spike is determined by thetwo calls that were being made (iSyncNeeded ---> syncHashList). What isgoing on here is a flush/sync of the JFS2 metadata to disk. Apparentlysome program went recursively through the filesystem accessing filesforcing the inode access timestamp to change. These changes would haveto propogated to the disk.

Here's a few reasons why j2pg would be active and consume high CPU:
1. If there several process issuing sync then the j2pg process will be
very active using cpu resources.
2. If there is file system corruption then the j2pg will use more cpu
resources.
3. If the storage is not running data fast enough then the j2pg process
will be using high amount of cpu resources.

j2pg will get started for any JFS2 dir activity.
Another event that can cause j2pg activity, is syncd.
If the system experiences a lot of JFS2 dir activity, the j2pg process
will also be active handling the I/O.
Since syncd flushes I/O from real memory to disk, then any JFS2 dir's
with files in the buffer will also be hit."

It appears that the number of sync which causes j2pg
to run is causing spikes.

We see:
/usr/sbin/syncd 60

J2pg is responsible for flushing data to disk and is
usually called by the syncd process. If you have a
large number of sync processes running on the system,
that would explain the high CPU for j2pg

The syncd setting determines the frequency with which
the I/O disk-write buffers are flushed.

The AIX default value for syncd as set in /sbin/rc.boot
is 60. It is recommended to change this value to 10.

This will cause the syncd process to run more often
and not allow the dirty file pages to accumulate,
so it runs more frequently but for shorter period of
time. If you wish to make this permanent then edit
the /sbin/rc.boot file and change to the 60 to 10.

You may consider mounting all of the
non-rootvg file systems with the 'noatime' option.
This can be done without any outage:

However selecting a non-peak production hours is better:

Use the commands below: For Example:
# mount -o remount,noatime /oracle
Then use chfs to make is persistent:
# chfs -a options=noatime /oracle

- noatime -
Turns off access-time updates. Using this option can
improve performance on file systems where a large
number of files are read frequently and seldom updated.
If you use the option, the last access time for a file
cannot be determined. If neither atime nor noatime is
specified, atime is the default value."

While attending the IBM Power Systems
Symposium this week, I learnt that starting with AIX 7.1 (and AIX 6.1 TL6) JFS2
logging is disabled during a mksysb restore. You may be familiar with disabling
JFS2 logs, if not, take a look at this IBM technical note:

I’ve been unable to find any official
documentation from IBM that mentions this new enhancement to the mksysb restore
process. However, when I checked my own AIX 7.1 system I found the following
statement/code in the /usr/lpp/bosinst/bi_main script:

I received the following
question from an AIX administrator in Germany.

“Hi Chris,

on your blog, you explain how to find out the active value
of

num_cmd_elems of an fc-adapter by using the kdb. So you can
decide, if the

value of lsattr is active or not ...

I wonder if you can find out the values fc_err_recov and
dyntrk of the

fscsiX device.?

# lsattr -El fscsi0

attach
switch How this adapter is
CONNECTED False

dyntrk
yes Dynamic Tracking of
FC Devices True

fc_err_recov delayed_fail FC Fabric Event Error RECOVERY
Policy True

scsi_id
0x1021f Adapter SCSI
ID
False

sw_fc_class
3 FC Class
for
Fabric
True

I try to use echo efscsi fscsi0 | kdb .. but I can't figure
it out..

Can you help my please?”

I did a little research on his behalf
and came up with an answer. However, I’m not at all surprised he had trouble
finding the right information. It's not easy, clear or documented!

I received the following information
from my IBM AIX contacts.

“The following relies on internal structures that are subject to
change.

The procedure was tested on 6100-06, 6100-07, and 7100-01. I don't
have a lab system with physical HBAs and 5.3 at the moment.

Hopefully the same steps should work for 5.3. You may need to
first run efscsi without arguments to load the kdb module before running efscsi
fscsiX.

# kdb

(0)> efscsi fscsi1 | grep efscsi_ddi

struct efscsi_ddi ddi
= 0xF1000A060084A080

(0)> dd 0xF1000A060084A080+20 2

F1000A060084A0A0: 0101020202010200 000000B400000028 ...............(

FFDDNNNNNNNN

FF = fc_error_recov: 01=delayed_fail
02=fast_fail

DD = dyntrk: 00=disabled 01=enabled

NNNN=num_cmd_elems - 20 (20 reserved)

e.g. 200 - 20 = 180 = B4

So in
this example, fc_err_recov is set to fast_fail (02), dyntrk is set to yes (01)
and num_cmd_elems is set to 200.“

I tested this on a lab system
running AIX 6.1 TL6 and AIX 7.1 TL1. Starting with an FC adapter with dyntrk disabled (set to no), fc_err_recov disabled (set to
delayed_fail) and num_cmd_elems set
to 500.

# lsattr -El fscsi1

attachnoneHow this adapter is CONNECTEDFalse

dyntrknoDynamic Tracking of FC DevicesTrue

fc_err_recov delayed_fail FC Fabric Event Error RECOVERY
Policy True

scsi_idAdapter SCSI IDFalse

sw_fc_class3FC Class for FabricTrue

# lsattr -El fcs1 -a num_cmd_elems

num_cmd_elems 500 Maximum number of COMMANDS to queue to the adapter
True

# kdb

(0)> efscsi fscsi1
| grep efscsi_ddi

struct efscsi_ddi
ddi = 0xF1000A060096E080

(0)> dd 0xF1000A060096E080+20
2

F1000A060096E0A0: 0101020201000100
000001E000000028...............(

FFDDNNNNNNNN

OK, let’s break it down. From the kdb output we can determine the
following:

·fc_error_recov is currently set to
delayed_fail (FF=01
= fc_error_recov = delayed_fail).

Based on the output, num_cmd_elems is set to 200 (C8) and max_xfer_size is set to 1048576
(100000).

The max_xfer_size
for VFC is tricky because it is contained in a structure that can and does
change between SPs and TLs.In
6100-06-01 max_xfer_size is offset
3932 bytes into the structure so we get the value like this:

Perhaps the easiest way to handle
changes between versions is to use the fact that max_xfer_size is immediately after num_cmd_elems and that is very unlikely to change. So, knowing that
the structure size does not change by very much you can grep in the general area:

Attention: just a note about max_xfer_size
and virtual FC adapters. In my experience, if the values for this attribute on
the VIO client do not match those on
the VIO server, then you will have trouble configuring the virtual FC adapters.
Possible side effects may include your system never booting again!

So if I change the value to
0x200000 on the client, without mirroring this value on the VIO server, I may encounter
the following effects:

# rmdev -Rl
fcs1

sfwcomm1
Defined

fscsi1
Defined

fcnet1
Defined

fcs1 Defined

# chdev -l
fcs1 -a max_xfer_size=0x200000

fcs1 changed

The cfgmgr command will report errors for the FC adapter.

# cfgmgr

Method error
(/usr/lib/methods/cfgefscsi -l fscsi1
):

0514-061 Cannot
find a child device.

Method error
(/usr/lib/methods/cfgstorfworkcom -l sfwcomm1 ):

0514-040 Error initializing a device
into the kernel.

Errors, similar to the
following, may appear in the AIX error report.

# errpt
errpt | grep fcs

0E0C5B310726123812 U S fcs1Undefined error

8C9E92210726123812 I S fcs1Informational message

You’ll observe messages in
the error report that claim a request from the client was rejected by the VIOS.

If you encounter this
problem, restore the clients FC adapter attributes to their previous values
before restarting the system. If you don’t, then your LPAR may no longer boot
and may hang on LED 554. Change your VIOS first then update your VIO clients.

I came across this strange LPM issue recently. Thought I’d share it with you.

All the customers VIOS were configured with the viosecure level set to high.

When the VIOS is configured with a security profile set to high, PCI or DoD, a new feature is enabled during LPM. This new feature is called “Secure LPM”. As a result when you initiate an LPM operation, the “Secure LPM” feature automatically enables the VIOS firewall and configures a secure (ipsec) tunnel for all LPM traffic over the network.

Before we started LPM, the VIOS firewall is off:

$ viosecure -firewall view

IPv4 Firewall OFF

ALLOWED PORTS

Local Remote

Interface Port Port Service IPAddress Expiration

Time(seconds)

--------- ---- ---- ------- --------- ---------------

And it's activated after we start LPM:

$ viosecure -firewall view

IPv4 Firewall ON

ALLOWED PORTS

Local Remote

Interface Port Port Service IPAddress Expiration

Time(seconds)

--------- ---- ---- ------- --------- ---------------

In the customer’s case, we moved a partition from a 795 running a pair of VIOS at 2.2.1.3 to another 795 running another pair of VIOS at 2.2.2.1. This worked OK. When we attempted to move the partition back, we received an error stating that the source MSP had rejected the request.

HSCLA230

The mover service partition on the source managed system has rejected the request to stop the migration. Verify that the migration state of the partition is Migration Starting, and try the operation again.

After a while we discovered that when the LPM operation started, the source VIOS would enable its firewall and we would immediately loose connectivity (i.e. our SSH session to the VIOS would hang). We also found that RMC connectivity between the source VIOS and HMC also dropped. Somehow the VIOS firewall was blocking network connectivity to pretty much everything.

Eventually, after much digging we found that the source VIOS was the victim of an errant firewall rule (configured on the source VIOS itself).

0 permit 0.0.0.0 0.0.0.0 0.0.0.0 0.0.0.0 yes all any 0 any 0 both both no all packets 0 all 0 none Default Rule

We were then able to move the partition back to the original frame. We are still searching for an answer to how this happened. The IBM support team believe the rule must have been created by an administrator at some point. Of course, the administrator claims that is untrue.

The trace data shows that IVM is calling the migmover -m end command on the destination partition. That command in turn calls a system call. When everything is in the correct state, that system call will end the migration and cleanup. The problem is it is not being called because the mover has not yet been notified that VASI is in the END state as indicated by the mover's state being: MIG_STATE_COMPLETED. The trace shows this notification arriving a split second after the migmover -m end command was issued.

On a good migration, the mover is notified by VASI of the state change before IVM calls migmover -m. IVM runs migmover -m end once it gets a migration completed message from phyp. If this message is issued about the same time as the message that tells VASI that the migration has ended, a timing issue seems a real possibility.

The problem was fixed in the VIOS code. A new call was added when the "end" state change comes in from IVM/HMC. If the state is complete we are good to go, otherwise we issue an error.”

Installing this fix on my VIOS resolved the problem. My VIOS are running 2.1.1.10-FP21. If you are running FP22 for VIOS 2.1, then you must request a different ifix from IBM support.

If you run out of space in the root
file system, odd things can happen when you try to map virtual devices to
virtual adapters with mkvdev.

For example, a colleague of mine was
attempting to map a new hdisk to a vhost adapter on a pair of VIOS. The VIOS
was running a recent version of code. He received the following error message
(see below). It wasn’t a very helpful message. At first I thought it was due to
the fact that he had not set the reserve_policy
attribute for the new disk to no_reserve
on both VIOS. Changing the value for that attribute did not help.

I found the
same issue on the second VIOS i.e. a full root file system due to a core file
(from cimserver). I also found no trace of a full file system event in the error
report. Perhaps someone had taken it upon themselves to “clean house” at some
point and had removed entries from the VIOS error log.

Make sure
you monitor file system space on your VIOS. Who knows what else might fail if
you run out of space in a critical file system.

It’s possible, with PowerVC, to manage existing AIX and Linux partitions in your environment. This may be useful for customers that have recently installed PowerVC in their environment and have existing AIX partitions. The question that is often asked is “Can I manage these existing partitions with PowerVC?”. These existing partitions were deployed outside of PowerVC, often long before PowerVC was even available as product from IBM. Fortunately the answer to this question is most likely yes!

Before you add an existing virtual machine to PowerVC, ensure that the following requirements are met:

Here’s an example of managing an existing partition with PowerVC. Click on the Hosts icon from the PowerVC home screen then click on the “Manage Existing Virtual Machines” icon.

You are presented with a drop-down list of all the (Power Systems) Hosts known to PowerVC. Select the appropriate host where the existing partition currently resides.

Next you’ll be asked if you want to manage any supported virtual machines that are not currently being managed by PowerVC. “Supported” (with PowerVC Standard Editionv1.2.1) means any virtual machine that is either using virtual fibre channel adapters attached to supported SAN and storage devices (typically brocade switches and v7000) and/or virtual SCSI adapters attached to disk in a VIOS shared storage pool, that is currently managed by PowerVC.

Or you can select a specific virtual machine (or machines) to manage. I’ve selected one specific partition I wish to manage with PowerVC. It is an AIX partition that is connected to a shared storage pool (SSP). The SSP is already managed by PowerVC but the AIX partition is not.

A list of virtual machines (partitions) will be displayed next. Select the partition you’d like to manage with PowerVC and click on Manage.

Under the Virtual Machines view you’ll see the partition that you selected. It will appear in an Active and Pending state.

After a minute or two, the partition will change to an Active and OK state. The virtual machine is now under the management of PowerVC. You can now stop, start, restart, resize, migrate, attach volumes to, capture and/or delete the virtual machine through PowerVC.

Just as a point of interest, you can also access the Manage Existing option from the Virtual Machines view in PowerVC.

If you need to remove a partition (virtual machine) from PowerVC management (without deleting the partition) you can do so using the following procedure. From the Hosts view, select the desired Host and under Virtual Machines, select the partition you want to un-manage.

PowerVC will prompt you to confirm the removal of the virtual machine from PowerVC management. This does not delete the virtual machine; it will continue to run without disruption.

Starting with AIX 7.1, CSM is no
longer supported or available. It has been replaced by Distributed Systems
Managment (DSM).Section 5.2 of the IBM AIX 7.1 Differences Guide Redbook provides
details of the new DSM capabilities.

Fortunately DSM still provides access
to the dsh command.I’ve written about how I’ve used this utility
in the past. The new dsh command (and other tools) are
provided in the new DSM filesets named dsm.core
and dsm.dsh.

These filesets are NOT installed by default. You must
manually install them. They can be found on your AIX 7.1 media.

If dsh is something you use, then I recommend you read the section on
DSM in the Redbook. Also take a look at section 5.2.7 Using DSM and NIM,
in which it describes how you can integrate DSM and NIM and completely automate
the installation of AIX:

“The
AIX Network Installation Manager (NIM) has been enhanced to work with the Distributed
System Management (DSM) commands. This integration enables the automatic
installation of new AIX systems that are either currently powered on or off.”

Although I’ve written about the dsh command before, there’s one usage
I’ve not covered.And that is using dsh to manage users across a group of
LPARs. In particular, changing a user’s password.

Before I go any further, I should
state that for the following to work you must first configure ssh keys on your
NIM master (or central mgmt AIX system) so that you can communicate with all of
your AIX systems via SSH, as root,without being prompted for a password. Read my article on dsh to find out how to do this if
necessary.

In the following example, I use dsh from my NIM master. It is my
central point of control for my AIX environment.

My ssh keys for root on my NIM master
have been generated and distributed to all of my LPARs.

root@nim# ssh-keygen -d

Generating public/private dsa key pair.

Enter file in which to save the key (/.ssh/id_dsa):

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /.ssh/id_dsa.

Your public key has been saved in /.ssh/id_dsa.pub.

The key fingerprint is:

ed:18:e9:00:37:13:7c:7c:74:6a:a9:e0:ad:c0:09:a9
root@nim

The key's randomart image is:

+--[ DSA 1024]----+

|... .. .|

|...o .+|

|o .
=. .+|

| . o = = =|

|E+ o
S .|

|.
+ +|

|.
o .|

||

||

+-----------------+

root@nim# ls -ltra

total 40

-rw-------1 rootsystem214 17 Sep 2010authorized_keys

drwxr-xr-x7 rootsystem4096 16 Nov 11:43 ..

-rw-r--r--1 rootsystem3615 16 Nov 12:04 known_hosts

-rw-r--r--1 rootsystem601 16 Nov 12:06
id_dsa.pub

-rw-------1 rootsystem672 16 Nov 12:06 id_dsa

drwx------2 rootsystem256 16 Nov 12:06 .

On my AIX LPARs, the authorized_keys file has been updated with
the public ssh key from my NIM master:

On the NIM master, the root user was
configured for the DSH environment. The following entry was placed in roots .profile:

root@nim# cat /.profile

ENV=$HOME/.kshrc

The following entry was placed in
roots .kshrc file:

root@nim# cat /.kshrc

export
DSH_NODE_RSH=/usr/bin/ssh

export
DSH_NODE_LIST=/usr/local/etc/nodes

A /usr/local/etc/nodes
file was created on the NIM master. This file contains a list of each of the
nodes that dsh can communicate with
from NIM:

root@nim# cat /usr/local/etc/nodes

aixlpar1

aixlpar2

aixlpar3

aixlpar4

aixlpar5

aixlpar6

aixlpar7

aixlpar8

aixlpar9

aixlpar10

aixlpar11

The first time that the dsh command is run against a new host,
the following message will be displayed. dsh
uses the FQDN, and the FQDN needs to be added to the known_hosts file for ssh. Therefore you must make an ssh connection first with FQDN to the
host:

root@nim# dsh uptime

aixlpar1.cg.com.au
: Host key verification failed.

dsh:2617-009 aixlpar1.cg.com.au remote shell had exit code 255

It is necessary to ssh directly to each node using its
FQDN. This step is only required once for each node. For example:

root@nim# ssh aixlpar1.cg.com.au

The authenticity of host 'aixlpar1.cg.com.au (172.1.6.17)' can't be established.