Archive for the ‘administration’ Category

Background
Assume that you have a common three-tier architecture on a web farm with layers being web, application and database servers. The load balancing is performed by an F5 BIG-IP LTM 1600 load balancer and the logging takes place on the web farm that uses Apache web servers.

Problem
When you attempt to review the access logs of the Apache web servers the only IP address for all the requests is that of the F5 load balancer. Assuming that the load balancer address is 10.10.10.10, the log entries would always look like that:

Mistake
By default this F5 load balancer will perform SNAT (Source Network Address Translation) and this is why the requestor IP address is always the load balancer’s one.

Resolution
The solution is to utilize HTTP header field XFF. On the load balancer side you will first have to follow the below steps in the BIG-IP configuration utility:
– Go to “Local Traffic”
– Select “Profiles”
– On the “Services” menu choose “HTTP”
– Create a new profile by clicking on “Create”
– Activate “Insert X-Forwarded For” check box and select “Enabled” from the menu
– Finally click on “Update”
At last, you can use this new HTTP profile to the virtual servers you want to have the XFF HTTP header field.
Moving to the web server side you will have to create a new custom log format on the virtual hosts you want to have proper source IP address logging. So, here is an example custom log format that will include the XFF field.

Background
After a hardware maintenace performed by a company some virtual machines could not be backed up by the VMware Solution of Symantec NetBackup.

Problem
All the affected virtual machines were hosted on the same VMware ESX server and if you log in to the VMware vSphere you were receiving the below error.

Unable to apply DRS resource settings on host 'hostname.somewhere' in SomeDatacenter'(Reason:A general system error occured: Invalid fault). This can significantly reduce the effectiveness of DRS.

Mistake
After the hardware maintenance the engineers did not check that all the VMware services were running properly. In this case Distributed Resource Scheduler (DRS) had some issues on this specific server.

Resolution
This is very simple. A restart of the hostd daemon will almost certainly fix the problem. In this case after restarting the management services everything went back to normal operation.

So, a lot of people have difficulties configuring RAID-10 using MegaRAID because if you have for example four hard disks and you add them to a Disk Group, the only available options are RAID-0 and RAID-1.
Here is how to do this on an IBM System x3650 server with four 300GB SAS hard disks.

During the system boot you will be given the following options.

You will select the “Diagnostics” (in this case using F2 key) and when MegaRAID is loaded you can use “Ctrl+H” key combination to enter the WebBIOS configuration utility.

From the above capture you can also see that there is no virtual disk configured and the controller detected four JBOD disks.
When you enter the utility, you will have the ability to select the adapter you want to configure. In our case we only have one adapter so it is very straightforward.

Next, we have the MegaRAID BIOS configuration main menu for the selected adapter.

From the main menu you select option “Configuration Wizard” and you will get the following screen.

From the above configuration types, we select “New Configuration” since there is no prior configuration. This can also be used if you want to replace the existing configuration with a new one. Before proceeding you will get a warning that this selection will erase the current configuration as shown below.

Next, you select the hard disk drives you want to add to the RAID array. To select more than one press CTRL key. In our case all four disks will be selected in order to implement the RAID-10 level.

In the next window you will have to select “Manual Configuration” if you want to create a RAID-10 array.

Below you select (once again using CTRL key) the first two disks and click “Add To Array” in order to add them to the “Drive Group0” on the right side panel.

After adding the first two disks you click on “Accept DG” to complete the setup of this drive group and create a new one.

Follow the same procedure and add the other two disks to the new drive group and then click “Accept DG” and “Next” to continue.

You add the two newly created arrays to a span by selecting each one of them and then clicking on “Add to SPAN”.

And as you can see below, the selected RAID level is 10. Here you can tune your RAID-10 configuration and when you are ready you click on “Update Size” and then “Accept” button.

Finally, you can continue by hitting “Next” and after the usual warning messages and final review of the configration, the RAID will start initializing.

On the bottom right you have some additional options that you can use but in any case, when the initialization process is completed the RAID-10 virtual disk will be ready to use.

Mistake
The files were compressed on a Mac (Snow Leopard) which is using BSD TAR and the destination host was Linux (that uses GNU TAR). As you might have guessed, there is an incompatibility between BSD and GNU TAR regarding the handling of vendor extended attributes. Specifically, BSD TAR supports them (as defined in IEEE Std 1003.1-2001 (POSIX.1-2001)) while GNU TAR doesn’t.

Resolution
There are a few different options we have to avoid this mistake. The best one is to simply use either BSD or GNU TAR but not combined. The other option is to use the “–format” option in order to use a compatible format between the systems. Here is the equivalent documentation for BSD TAR:

--format format
(c, r, u mode only) Use the specified format for the created archive. Supported formats
include ``cpio'', ``pax'', ``shar'', and ``ustar''. Other formats may also be supported; see
libarchive-formats(5) for more information about currently-supported formats. In r and u
modes, when extending an existing archive, the format specified here must be compatible with
the format of the existing archive on disk.

Background
So, you have a request to upload and configure a new website on some specific web server. The policy is to have a separate configuration file for each website (each new virtual host) under /etc/httpd/conf.d/ directory.

Problem
After finishing writing of the configuration file (which was about 200 lines due to numerous special requirements) you run the following command

# /etc/init.d/httpd configtest
Syntax OK

in order to check that there is no syntax error. And then you reload the Apache’s configuration…

# /etc/init.d/httpd reload
Reloading httpd: [ OK ]

However, when you check for the running Apache processes you see that it is not running.

# ps -C httpd
PID TTY TIME CMD
#

Now, let’s move to the next section to see what caused this problem.

Mistake
After having another look at the newly added configuration I noticed that the ‘ErrorLog’ directive was pointing to an invalid directory due to a typo. If Apache is not able to access the configured log files, it won’t start and this is what happened.

Resolution
Since each web server could host numerous websites and these were maintained by many different people, I wrote the following simple shell script that reports any missing log files.

So, with this post I’m introducing another new category named mistakes where I’ll be posting some mistakes I have done that will hopefully help other sysadmins avoid them.

Background
It is late afternoon and you have about 15 SSH sessions open on various servers. The operating systems range from Linux, AIX, Solaris and a couple BSD derivatives. While doing some benchmark you realize that the benchmark application is probably stuck on some infinite loop and it is eating up all system’s memory (which was not expected). So, you try to kill it…

killall bench_application

Since this was not a production system we didn’t really care and as we all know ‘kill’ could take some time to terminate the application depending on the application’s signal handler. So, after calling ‘kill’ you go back to some other urgent tasks you were doing on the other servers.
Then you start getting alert emails for the memory usage on this server, so you open up the SSH session and you accidentally type…

killall

Instead of:

killall -s KILL bench_application

But then you realize that you’re on a Solaris server (not on the Linux you thought you were)! And more specifically, on a development server where software developers have shell access for testing and building.

Problem
On Solaris, ‘killall’ is used to terminate all active processes. Meaning that all users got disconnected and any running building process was terminated.

Mistake
So, of course the mistake was that I was not paying the appropriate attention on the SSH sessions I had open.

Resolution
Just think at least twice before you hit that return key. :P