Then I went to http://support.oracle.com and searched patch 122300-63. The patching info page says I'll need "Vintage Solaris download access/privilege" to download this patch, but obviously none of my CSI had this Vintage Solaris download access/privilege.

As this account issue may take some time to resolve, so I choose cluster patch or you may say patchset method to do the patching on solaris 9. Here's the steps we need to do cluster patching on solaris 5.9:

2.unzip the package and have a read of Recommended.README file comes with the package

3.ensure there's enough free space on /, /var(better >4Gb)

4. Now run ./install_patchset or ./install_cluster(you can add -nosave parameter if you have limited free space on /, /var, but you will not be able to backout individual patches if the need arises)

5.For more installation messages refer to the installation logfile: /var/sadm/install_data/<patchset-name>_log

6.reboot your machine to make all patches applied to your host.

NB:

If you have raid 1(mirror) on your solaris system, you can try first patch submirror and then apply to all system if server runs well after booting up. You can refer to the following for more infomation:

2.Now, we tried booting up without the corrupted /apps/kua(seems this is a tricky one in RHCE). After commented out /apps/kua from /etc/fstab, it got a "read only" error. We tried mount -o rw,remount /, but it still didn't work.

3.Finally we thought of the rescue cd. After commented out /apps/kua from /etc/fstab, system finally booted up. It automatically updated selinux policy related. Now the only things left would be backup contents under /apps/kua, then make a new partition mounted under /apps/kua, and at last copy back contents to the newly created partiton.

PS:
Actually, this problem may be resolved by mounting with an alternative superblock. More details can be found here:

If you're running netbackup to backup data to tapes, and you find that some netbackup jobs hang there. You may want to umount filesystems netbackup is using to release these jobs.

But a small note here:

umount -l(Lazy unmount. Detach the filesystem from the filesystem hierarchy now, and cleanup all references to the filesystem as soon as it is not busy anymore) will not kill NB job. umount -l is potentially dangerous and I would discourage its use unless it's clearly the only way forward, certainly not something we would want to put in a script.

Essentially it marks the filesystem as unmounted, effectively blocking any new processes from accessing it, but all processes that already have handles opened can still traverse directories, write and read the files etc. This means mounting this filesystem again in the same location will create a situation when
a) data in the backup is not consistent (some data from old bcv snapshot, some from the newer one)
b) metadata is corrupted (eventually it will umount the FS once the first NB process ends, and put this into superblock; next umount will likely fail to handle this).
Worst of all, this will mask the issue instead of resolving it so I agree that we should chase netbackup team for correct resolution.

If solaris's svm has broken, and that broken one is for rootdisk, then the system will fail to boot up. We can now try boot from mirror disk rather than SVM. If the mirror is in good condition, then your system will boot up and after it's up, we can do something to repair the broken solaris svm.

Here goes the steps to boot solaris from mirror disk without svm:

1.Prepare a cd/dvd with solaris of your host's version.

2.goto ok mode

3.ok> boot cdrom -s ( Or boot net -s)

4.mount the root slice on /a

5.Take backup of /a/etc/vfstab and /a/etc/system files.

6.Modify the entries of the vfstab files and system files of /etc

7.Edit the /a/etc/system file, and remove the "rootdev" line shown below:

# vi /a/etc/system
*rootdev:/pseudo/md@0:0,0,blk #yours may be different
------> Do not comment the line. Remove it.

8.In the /etc/vfstab file, replace the lines for the system file system
metadevices with their underlying partitions.

For example, change lines from:

/dev/md/dsk/d0 /dev/md/rdsk/d0 / ufs 1 no -

to:

/dev/dsk/c0t0d0s0 /dev/rdsk/c0t0d0s0 / ufs 1 no -

ONLY change the lines for root (/) and the file systems which were affected. All other metadevices, may stay 'as is', in this file.

If you found inconsistent paths on your vxfs based filesystem, you may consider re-initialize veritas devices layouts, i.e. remove all rdmp and dmp entries from /etc/vx/dmp and /etc/vx/rdmp and recreate them later.

---Prior starting of implementation freeze VCS cluster on each node

hasys -freeze testhost

---kill vxconfigd #This step is especially not required on Solaris 10 with VxVM5.0. Kill the vxconfigd daemon - Note "-k" argument is not killing it is restarting

# kill -9 <pid of vxconfigd>

---Stop the eventsource daemon

# vxddladm stop eventsource

---Remove all rdmp and dmp entries from /etc/vx/dmp and /etc/vx/rdmp

---Move /etc/vx/array.info to something like /etc/vx/array.info.old

---repeat 3 for /etc/vx/jbod.info or /etc/vx/disk.info #optional

You may need to make changes to the underlying storage layer - if part of your issue does include device path confusion in the underlying os layer. Run your cfgadm or devfsadm -C (as for solaris) or whatever as required to get your OS understanding of devices in the state you want (There are many things that can be done which might seem extreme or risky and require a sound knowledge of device configuration too large to include here but a decently current OS would not normally require a boot to address such issues as these - that said it may sometimes be more expedient to do so anyway).

---start vxconfigd if you had to kill it in step 1

# vxconfigd -x syslog -m boot

You wait a minute for it to return or just nohup it in the first place

This article is going to show a howto about vxvm extending filesystems using luns from EMC DMX3-24 array(HP-UX).

First, WWN zoning and LUN masking on SAN switches/storage controller.(the part will not be shown in this article, more is here. LUN mapping is assigning a a logical number to a LUN and presenting to a Host. LUN Masking allows only selected hosts the ability to see a particular Logical Unit where as Zoning allows the hosts to see a particular Storage Array or other Fibre Channel device in the fabric.)

Now imagine you already got LUNs 23AB allocated from DMX array to this host, here's the steps to adding it to the OS and extending vxfs based filesystem.

The -xn directives instruct iostat to report extended disk statistics in tabular form, as well as display the names of the devices in descriptive format (for example, server:/export/path). The following example shows the output of iostat -xn 20 during NFS activity on the client, while it concurrently reads from two separate NFS filesystems. The server assisi is connected to the same hub to which the client is connected, while the test server paris is on the other side of the hub and other side of the building network switches. The two servers are identical; they have the same memory, CPU, and OS configuration:

The iostat utility iteratively reports the disk statistics every 20 seconds and calculates its statistics based on a delta from the previous values. The first set of statistics is usually uninteresting, since it reports the cumulative values since boot time. You should focus your attention on the following set of values reporting the current disk and network activity. Note that the previous example does not show the cumulative statistics. The output shown represents the second set of values, which report the I/O statistics within the last 20 seconds. The first two lines represent the header, then every disk and NFS filesystem on the system is presented in separate lines. The first line reports statistics for the local hard disk c0t0d0. The second line reports statistics for the local floppy disk fd0. The third line reports statistics for the volume manager vold. In Solaris, the volume manager is implemented as an NFS user-level server. The fourth and fifth lines report statistics for the NFS filesystems mounted on this host. Included in the statistics are various values that will help you analyze the performance of the NFS activity:

r/s

Represents the number of read operations per second during the time interval specified. For NFS filesystems, this value represents the number of times the remote server was called to read data from a file, or read the contents of a directory. This quantity accounts for the number of read, readdir, and readdir+ RPCs performed during this interval. In the previous example, the client contacted the server assisi an average of 34.1 times per second to either read the contents of a file, or list the contents of directories.

w/s

Represents the number of write operations per second during the time interval specified. For NFS filesystems, this value represents the number of times the remote server was called to write data to a file. It does not include directory operations such as mkdir, rmdir, etc. This quantity accounts for the number of write RPCs performed during this interval.

kr/s

Represents the number of kilobytes per second read during this interval. In the preceding example, the client is reading data at an average of 1,092.4 KB/s from the NFS server assisi. The optional -M directive would instruct iostat to display data throughput in MB/sec instead of KB/sec.

kw/s

Represents the number of kilobytes written per second during this interval. The optional -M directive would instruct iostat to display data throughput in MB/sec.

wait

Reports the average number of requests waiting to be processed. For NFS filesystems, this value gets incremented when a request is placed on the asynchronous request queue, and gets decreased when the request is taken off the queue and handed off to an NFS async thread to perform the RPC call. The length of the wait queue indicates the number of requests waiting to be sent to the NFS server.

actv

Reports the number of requests actively being processed (i.e., the length of the run queue). For NFS filesystems, this number represents the number of active NFS async threads waiting for the NFS server to respond (i.e., the number of outstanding requests being serviced by the NFS server). In the preceding example, the client has on average 3.2 outstanding RPCs pending for a reply by the server assisi at all times during the interval specified. This number is controlled by the maximum number of NFS async threads configured on the system. Chapter 18 will explain this in more detail.

wsvc_t

Reports the time spent in the wait queue in milliseconds. For NFS filesystems, this is the time the request waited before it could be sent out to the server.

asvc_t

Reports the time spent in the run queue in milliseconds. For NFS filesystems, this represents the average amount of time the client waits for the reply to its RPC requests, after they have been sent to the NFS server. In the preceding example, the server assisi takes on average 93.2 milliseconds to reply to the client's requests, where the server paris takes 336.7 milliseconds. Recall that the server assisi and the client are physically connected to the same hub, whereas packets to and from the server paris have to traverse multiple switches to communicate with the client. Analysis of nfsstat -s on paris indicated a large amount of NFS traffic directed at this server at the same time. This, added to server load, accounts for the slow response time.

%w

Reports the percentage of time that transactions are present in the wait queue ready to be processed. A large number for an NFS filesytem does not necessarily indicate a problem, given that there are multiple NFS async threads that perform the work.

%b

Reports the percentage of time that actv is non-zero (at least one request is being processsed). For NFS filesystems, it represents the activity level of the server mount point. 100% busy does not indicate a problem since the NFS server has multiple nfsd threads that can handle concurrent RPC requests. It simply indicates that the client has had requests continuously processed by the server during the measurement time.

Firstly, this is a known bug on Intel e1000e driver on linux platforms. This is a driver problem with the Intel 82574L(MSI/MSI-X interrupts issue). The internet connection lost itself now and then and there's nothing logged about this which is very bad for troubleshooting.
You can see more bug reporting about this at https://bugzilla.redhat.com/show_bug.cgi?id=632650

Fortunately, we can resolve this by install kmod-e1000e package from ELrepo.org. To solve this, you need do as the following(ignore lines with strikeouts):

pcie_aspm is abbr for Active-State Power Management. This is somehow related to powersaving mechanism, you can get more info here.

acpi is abbr for Advanced Configuration and Power Interface, you can refer to here

apic is abbr for Advanced Programmable Interrupt Controller, it's somehow related to IRQ<Interrupt Request>. apic is one kind of many PICs, intel and some other NICs have this feature. You can read more info about this here.

Now reboot your machine and you're expected to have a more steady networking!

PS2:

The reason why there's so much strikeouts in this article is that I've struggled a lot with this kernel bug. Firstly, I thought it's caused by kernel bug of e1000e driver, and after some searching, I installed kmod-e1000e driver and modified the kernel parameter. Things turned better for a short time. Later, I found the issue was still there, so I tried compile the latest e1000e driver from intel. But neither this worked.

Later, I tried a script which monitored the networking of the time NIC went down. After the NIC failed for several times, I found that Tx traffic was so high each time NIC went to failure(TX bytes went up like 5Gb at a very short time). Based on this, I realized that there may be some DoS attack on the server. Using ntop & tcpdump, I found that DNS traffic was very large, but actually my host was not providing DNS services at all!

Then I wrote some iptable rules to disallow DNS queries etc, and after that, the host now is becoming steady again! Traffic went down as per normal, and everything is now on the track. I'm so happy and so excited about this as this is the first time I've stopped an DoS attack!

This problem is due to bug on Intel NICs' MSI and/or MSI-X interrupts. To solve this, you need download the latest Intel 82574L driver here. After downloading the source tarball to your server, do the following steps as the driver's README file:

unzip: tar zxf e1000e-x.x.x.tar.gz

cd e1000e-x.x.x/src/

make CFLAGS_EXTRA=-DDISABLE_PCI_MSI install #this step is critical

rmmod e1000e; modprobe e1000e

add e1000e to /etc/modprobe.conf

reboot server

After that, when you check intel e1000e driver module, you should now see:

If there's DISABLED after checking, first check your hardware connection, after that do a symcfg discover to refresh symantec database files on the host, and then use vxdmpadm enable ctlr=<your controller tag> to enable the controller or use vxdmpadm enable path=<your path tag> to enable the specific path's DMP.