Are you backing up your AIX systems
over Virtual Ethernet adapters? Of course you are, who isn’t right? Are your
backup server and clients on the same physical POWER system? You are most
likely backing up over Virtual Ethernet to another AIX LPAR that is running
your enterprise backup software, such as TSM or Legato Networker for example.
And you probably have a dedicated private virtual network (and adapters) on
both the clients and the server to handle the traffic for the nightly backups.
The next question is, have you tuned your Virtual Ethernet adapters?

There are several tips available for
tuning your Virtual Ethernet adapters for better performance on AIX. These tips
include changing settings such as MTU size, TCP window sizes, enabling
largesend, etc. I highly recommend the following blog posts from Anthony English
and Nigel Griffiths on this subject:

OK, so you got everything humming
along nicely, your backups are flying over the virtual network (across the POWER
hypervisor) and everybody is happy. After a period of time, you notice that the
backups have started to “slow down”. They are taking longer to finish. The
overall throughput of a backup drops. Some backups start in the evening around
9pm and are still running the next morning at 7am! In some cases you need to
kill the backups or even reboot the backup server LPAR for things to return to
normal.

“What is going on!?” You cry.

Well, there are a number of reasons
why this could be happening. For example, your shared processor pool may be
overwhelmed during the backup window. As we know, Virtual Ethernet adapters require
CPU to do their work. If the CPU pool is running low on available CPU
resources, this could contribute to the problem. And of course there could be
tuning issues with the Virtual Ethernet adapters or the AIX OS in general. Or
there may be issues with other pieces of the infrastructure, like network and
SAN switches, adapters, etc. Perhaps there’s an issue with the applications
and/or databases on the AIX systems? They often have their own mechanisms/tools
for backing up their data to your enterprise backup software. Is the backup
server sized to cope with the load i.e. CPU, memory, disk layout and I/O,
sufficient tape drives, disk storage pools, etc?

So assuming you’ve checked all of
the above (and more), then perhaps you’ve hit a problem that I encountered
recently. In my particular case, backups “over the hypervisor” were slowing
down, without any discernible cause. Initially the backups would be “very fast”
but after a month or so, things would start to slow down dramatically.

We noticed that there were very
large (and increasing) values for “Packets
Dropped”, “Hypervisor Send/Receive
Failures” and “No Resource Errors”
in the output from the netstat –v
command.

ETHERNET
STATISTICS (ent1) :

Device Type:
Virtual I/O Ethernet Adapter (l-lan)

Hardware
Address:41:ba:13:e7:25:0b

Elapsed
Time: 42 days 4 hours 3 minutes 34 seconds

Transmit
Statistics:Receive
Statistics:

---------------------------------------

Packets:
5978589961Packets: 26139832411

Bytes:
779465989202Bytes:
711051516630458

Interrupts:
0Interrupts: 6804561727

Transmit
Errors: 0Receive Errors: 0

Packets
Dropped: 0Packets Dropped: 86012309

...

Max
Collision Errors: 0No Resource Errors:
46113807

...

Hypervisor
Send Failures: 0

Receiver Failures: 0

Send Errors: 0

Hypervisor Receive Failures: 46113807

After some discussion with IBM AIX
support, we discovered that would should increase some of the buffer sizes for
our Virtual Ethernet adapter (the entX device). This would alleviate the no
resource issues we’d been experiencing. Looking at the output from the netstat -v command, we also noticed
that the Medium, Large and Huge buffers had
all reached their maximum values in the past.

...

Receive Information

Receive Buffers

Buffer TypeTinySmallMediumLargeHuge

Min Buffers5125121282424

Max Buffers204820482566464

Allocated5135351482864

Registered5125101272413

History

Max Allocated5769512566464

Lowest Registered502502641211

...

The advice from IBM
support was to increase these buffers using the chdev command (they also advised that we should reboot for the
changes to take effect):

Since implementing this
tuning change (to the adapter on the backup server), we have not had a repeat
of the problem. We will continue to monitor the performance and I’ll be sure to
let everyone know if we have further issues.

Whenever
I’m building a new AIX system I always make sure to install it. I really like
the fact that I can quickly list processes that are connected to TCP and UDP
ports on my system. For example, to check for the current SSH connections on my
system I can run lsof and check
port 22 (SSH). Immediately I have a good idea of the existing SSH
sessions/connections. I can also check to see if the SSH server (sshd daemon)
is running and listening (LISTEN) on my AIX partition.

But
sometimes I work on systems that don’t have lsof installed. It may not be practical or appropriate for me to
install it either. So I have
to find another tool (or tools) that will do something similar.

Of course,
I could use netstat to check
that a server daemon was listening on a particular TCP port and view any
established connections. But this doesn’t give me the associated process id’s.

$ netstat -a | grep -i ssh

tcp400*.ssh*.*LISTEN

tcp4048aix01.ssh172.29.131.16.50284ESTABLISHED

Fortunately,
the rmsock command
can provide that information. So if I wanted to find the process id for the
sshd daemon that is listening on my system I’d do the following. First I need
to find the socket id using netstat*.

# netstat -@aA | grep -i ssh
| grep LIST | grep Global

Globalf1000700049303b0 tcp40 0*.ssh*.*LISTEN

Then
I can use rmsockto
discover the process id associated with the sockect. In this case it’s PID 282700.

$ rmsock f1000200003e9bb0
tcpcb

The socket 0x3e9808 is being
held by proccess 282700 (sshd).

Unlike what its name implies, rmsock
does not remove the socket, if it is being used by a process. It just reports
the process holding the socket. Note that the second argument of rmsock is the protocol. It's tcpcb
in this example to indicate that the protocol is TCP. The results of the
command are also logged to /var/adm/ras/rmsock.log.

#
tail /var/adm/ras/rmsock.log

socket
0xf100020001c45008 held by process 434420 (writesrv) can't be removed.

socket
0xf100020000663008 held by process 418040 (java) can't be removed.

socket
0xf1000200012ad008 held by process 418040 (java) can't be removed.

socket
0xf100020000dec008 held by process 163840 (inetd) can't be removed.

socket
0xf100020000deb008 held by process 163840 (inetd) can't be removed.

socket
0xf10002000016f808 held by process 192554 (snmpdv3ne) can't be removed.

socket
0xf100020001c51808 held by process 442596 (dtlogin) can't be removed.

socket
0xf1000200012a4008 held by process 418040 (java) can't be removed.

socket
0xf100020000666008 held by process 315640 (java) can't be removed.

socket
0xf100020000deb808 held by process 163840 (inetd) can't be removed.

*Note: In my
example I specified the @ symbol with the netstat command. I
also grep’ed for the string Global.
You may have to do the same if you have WPARs running on your system. In my
case I have two active WPARs who both have their own sshd process. My Global
environment also has an sshd process. So in total there are three sshd daemons
that I can view from the Global environment. By specifiying the @ symbol with
netstat, I can quickly determine which process belongs to the Global
environment and those that exist within each WPAR.