Author: Brandon

I spent the last several hours troubleshooting any annoying error with the mongodb init script on Ubuntu. The script would start the daemon easy enough, but would report a failure. When subsequently trying to stop the daemon, it would then then say it was successful, but in-fact, still be running.

Fortunately, I found that somebody else had already reported the bug, but the comments pointing developers to the wrong place – looking inside the mongod code instead of in the init script.

After digging in to it for longer than I’d like to have spent, I found the cause was the –make-pidfile option being used in the init script.

My understanding of this process is that the start-stop-daemon command was creating the pidfile (as root), before spawning the actual mongod process (as the mongodb user). mongod in some cases (at least when not configsvr=true) must fork again before saving its own pidfile. Since the file created by start-stop-daemon is being run as root, the less-privileged mongodb user can not overwrite it (perhaps this should be logged, or logged at a less verbose level?), leaving the pidfile containing a pid that is no longer correct.

On my machine, the pidfile created with the –make-pidfile options was consistently exactly three less than the PID shown in the output of ‘ps’

After making that change to the init file, I can now reliably start/stop the mongod process using the expected commands.

Hopefully that bug will be closed soon and released so that I don’t have to customize the init script on every mongo server I have.

I just recently fixed an issue I wanted my Monit monitoring process to restart a daemon who was segfaulting and causing 100% CPU usage according to top and most other system tools. I had seen configuration examples where Monit could detect that and restart the process, so I figured that adding a configuration like that below would fix it easily enough:

After letting that run for a bunch of cycles the process remained running, and monit didn’t do anything to acknowledge it even in log files. (FYI, a “cycle” is defined in the Monitrc config file in the “set daemon” line and defaults to 120 seconds).

After some research, I finally came upon this post on the Monit mailing list where somebody describes that the CPU usage that Monit bases its numbers off is a percentage of the CPU available for all processors. My machine had 4 processors, so what was seeing as 100% CPU usage in top, monit would only see that as 25%.

I quickly changed my Monit config to check for CPU Usage > 22% as ween in the following. That now works perfectly, even acknowledging in the log each of the 8 times that the CPU was over the limit before restarting it:

I’ve had need for this on several occasions, and never come up with a decent solution until now. The problem happens for me most frequently when dealing with a group of transactions, and I need to find a subset of them that add up to a specific amount.

I had a customer complaining lately that messages sent via Gmail to one of my mail servers was occasionally receiving SMTP Authentication failures and bounce backs from Gmail. Fortunately he noticed it happening mainly when he sent a messages to multiple recipients and was able to send me some of the bounces for me to track it down pretty specifically in the postfix logs.

The Error message via Gmail was:

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 535 535 5.7.0 Error: authentication failed: authentication failure (SMTP AUTH failed with the remote server) (state 7).

This was a little odd, because the SMTP AUTH failure is what I would typically expect with a mistyped username and password. However, I could see that plenty of messages were being sent from the same client. By looking at the specific timestamp of the bounced message, I tracked down the relevant log segment shown below. It indicates 5 concurrent SMTPD sessions where the SASL authentication was successful on 4 of them and failed on the 5th.

So it seemed likely that the 4 child processes were in use and that Postfix couldn’t open a connection to a 5th simultaneous SASL authentication server, so it responded with a generic SMTP AUTH failure.

To fix, I simply added a couple of extra arguments to the saslauthd command that is run. I added a ‘-c’ parameter to enable caching, and ‘-n 10’ to increase the number of servers to 10. On my CentOS server, I accomplished that by modifying /etc/sysconfig/saslauthd to look like this:

# Directory in which to place saslauthd's listening socket, pid file, and so
# on. This directory must already exist.
SOCKETDIR=/var/run/saslauthd
# Mechanism to use when checking passwords. Run "saslauthd -v" to get a list
# of which mechanism your installation was compiled with the ablity to use.
MECH=rimap
# Additional flags to pass to saslauthd on the command line. See saslauthd(8)
# for the list of accepted flags.
FLAGS="-r -O 127.0.0.1 -c -n 10"

I’ve had this error occur when something went terribly wrong on a master server and it had to be rebooted. It appears that the master’s binary log file was not completely written to the server before it was rebooted. However part of the master log file was transmitted to the slave before that time. The error is coming from the slave server when it tried to reconnect to the master. It is trying to connect to the master server and start copying the binary log file from a position which the master does not have.

The proper way to fix this error is to completely resync the data from the master and begin replication over again. However, in many cases, that is impractical, or not worth the hassle, so I was able to fix the problem by setting the slave back to a valid position on the master and then skipping forward over the missing entries.

First, you need to find out what the last valid position is on the master. Run ‘SHOW SLAVE STATUS’ on the slave to figure out which master log file it was reading when the master crashed. You’ll need the value of the Relay_Master_Log_File parameter. Also take note of the Exec_Master_Log_Pos value.

Next, go to the master, and ensure that log file exists in the MySQL data directory. The file size will probably be less than the Exec_Master_Log_Pos value that the slave had stored. To find the last valid position in the log file use the mysqlbinlog utility. The command

mysqlbinlog [LOGFILE] |tail -n 100

should show the final entries in the binary log. Find the last line in the file that looks similar to this:

The number after end_log_pos is the last valid entry that the master has available. The difference between the master log position that the slave had, and this highest one on the master will give you some idea how far ahead the slave got before the master crashed.

You can now go back to the slave and issue the commands:

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_POS=8796393;
START SLAVE;

This will tell the slave to try to start from that valid position, and continue from there.

There is a strong possibility that you’ll run into some duplicate key errors if the slave was very far ahead of where the master’s log ended. In that case, you can issue this command to bypass those one at a time (or more if you want to skip more than one)

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

!! Note that if you have to do this, then your data between the master and slave is almost certainly inconsistent !! Be sure to understand your data to make sure that this is something you can live with.

This is due to changing requirements to the Product Advertising API. The AssociateTag parameter is now a required and validated field. You can see in the list of changes that it is the first thing mentioned.

These changes are made to keep the Product Advertising API in line with its purpose of driving affiliate sales through the Amazon.com website. The AssociateTag is a tag generated from your Amazon Associates account.

So far, it looks like the tag is not entirely validated to your account. I’ve had luck using fictitious tags such as aztag-20.

I have a system at home with a pair of 640 GB drives, that are getting full. The drives are configured in a Linux Software RAID1 array. Replacing those drives with a pair of 2 TB drives is a pretty easy (although lengthy process).

First, we need to verify that our RAID system is running correctly. Take a look at /proc/mdstat to verify that both drives are online:

After verifying that the system is running, I powered the machine off and removed the second hard drive and replaced it with the new drive. Upon starting it back up, Ubuntu noticed that the RAID system was running in degraded mode and I had to hit yes at the console to have it continue booting.

Once the machine was running, I logged in and created a similar partition structure on the new drive using the fdisk command. On my system, I have a small 200 partition for /boot as /dev/sdb1, a 1 GB swap partition, and then the rest of the drive is one big block for the root partition. The I copied the partition table that was on /dev/sda, but for the final partition, made it take up the entire rest of the drive. Make sure to set the first partition as bootable. The partitions on /dev/sdb now look like this:

The system is now running with one old drive and one new drive. The next step is to perform the same steps with removing the other old drive and rebuilding and re-adding it to the raid system. The steps are the same as above, except performing them with /dev/sda. I also had to change my BIOS to boot from the second drive.

Once both drives are installed and working with the RAID array, the final part of the process is to increase the size of the file system to the full size of our new drives. I first had to disable the EXT3 file system journal. This was necessary so that the online resize doesn’t run out of memory.

Edit /etc/fstab and change the file system type to ext2, and the arguments to include “noload” which will disable the file system journal. My /etc/fstab looks like this: