Secondary Navigation

Kernel nasties during backup

This morning I noticed my slug wasn t responding (lights were still on), so I gave it the old reboot and started checking the logs. Looks like I got some some

Message 1 of 11
, Aug 14, 2008

This morning I noticed my slug wasn't responding (lights were still
on), so I gave it the old reboot and started checking the logs. Looks
like I got some some bad mojo with my kernel last night while it was
running it's cron'd backups. These backups have been working
flawlessly for a few weeks now, so I'm at a loss as to what really
happened here. Here's the log:

That backups script simply calls two other scripts, run data backup,
and run system backup. The system backup didn't happen. Looks like the
data backup happened, but the permissions on them are wrong, so I
think that's my clue as to where in the backup script the crash
happened. Here's a copy of my databackup script:

#!/bin/bash
# ----------------------------------------------------------------------
# mikes handy rotating-filesystem-snapshot utility
# Edited by Drew for my nslu2 system
# ----------------------------------------------------------------------
# this needs to be a lot more general, but the basic idea is it makes
# rotating backup-snapshots whenever called
# ----------------------------------------------------------------------

unset PATH # suggestion from H. Milz: avoid accidental use of $PATH

# ------------- system commands used by this script --------------------
ID=/usr/bin/id;
ECHO=/bin/echo;

# step 3: make a hard-link-only (except for dirs) copy of the latest snapshot,
# if that exists
# Note: Since the slug does not have GNU cp, it doesn't support the
# -l option, so this has been replaced with a combination of
# find and cpio as described at www.mikerubel.org/computers/rsync_snapshots

# step 4: rsync from the system into the latest snapshot (notice that
# rsync behaves like cp --remove-destination by default, so the destination
# is unlinked first. If it were not so, this would copy over the other
# snapshot(s) too!
$RSYNC \
-va --delete --delete-excluded \
--exclude-from="$EXCLUDES" \
$STUFF_TO_BACKUP $SNAPSHOT_RW/daily.0 ;

# Change data file permissions so that no one but root can modify the files.
# Only do this to files shared through samba, i.e. /mnt/storage crap
# File permissions will be preserved on the rest of the files
# (i.e. full system backup)
if [ -d $SNAPSHOT_RW/daily.0 ] ; then
$CHOWN -R root:root $SNAPSHOT_RW/daily.0; \
$CHMOD -R 644 $SNAPSHOT_RW/daily.0; \
fi;

# step 5: update the mtime of daily.0 to reflect the snapshot time
$TOUCH $SNAPSHOT_RW/daily.0 ;

Any gurus out there see anything wrong with this? Like I said, this
has been running perfectly, and I haven't changed my scripts in some
time. Thanks!

-Drew

Mike (mwester)

Good info, all, but sadly incomplete. I can deduce from the log and the script that it would appear you are not running Unslung. But apart from that, it

Message 2 of 11
, Aug 14, 2008

Good info, all, but sadly incomplete. I can deduce from the log and the
script that it would appear you are not running Unslung. But apart from
that, it could be anything else. If you can share what firmware you are
running, what version of that firmware, and in order to be able to make
some sense of the addresses in the kernel crash dump, the specific
kernel version (uname -a), that would be helpful.

Well, looks like it has nothing to do with my backup scripts. It just happened again. I did have it under a fairly heavy load, copying lots of big files to the

Message 4 of 11
, Aug 14, 2008

Well, looks like it has nothing to do with my backup scripts. It just
happened again. I did have it under a fairly heavy load, copying lots
of big files to the system through samba. Hopefully I don't have a bad
slug, this is my second one.

Most interesting to me is the log entry at 01:14:13, where there s a kernel Alignment trap caused by rtorrent a couple minutes before the backup problems

Message 5 of 11
, Aug 14, 2008

Most interesting to me is the log entry at 01:14:13,
where there's a kernel "Alignment trap" caused by rtorrent a
couple minutes before the backup problems began. I think I
would look for rtorrent problems before anything else.

... There are many known issues with the SLUG going wrong when it runs out of memory; but those usually show up in the logs with an OOM Killer message, so it

Message 7 of 11
, Aug 14, 2008

Drew Kirkpatrick wrote:

> Rtorrent does that all the time. Everytime it has crashed, Rtorrent
> was running, so I've stopped. I'll see if it happens again...

There are many known issues with the SLUG going wrong when it runs out
of memory; but those usually show up in the logs with an 'OOM Killer'
message, so it could be unrelated. You may like to try turning off
overcommit in the VM:

echo -1 > /proc/sys/vm/overcommit_memory

(I think.)

This may help. Or it may not. Or it may cause your system to crash
differently...

Thanks for the tip. At the least you have me something else to google around on. I have allocated a bunch of swap space, but if the memory management is out of

Message 8 of 11
, Aug 14, 2008

Thanks for the tip. At the least you have me something else to google around on. I have allocated a bunch of swap space, but if the memory management is out of whack a bit it might not help. Worst comes to worst, I can throw gentoo back on there and roll my own kernel, but I have enough gentoo boxes to contend with at the moment. I'm getting lazy in my old age. Thanks to all for the help...

> Rtorrent does that all the time. Everytime it has crashed, Rtorrent
> was running, so I've stopped. I'll see if it happens again...

There are many known issues with the SLUG going wrong when it runs out
of memory; but those usually show up in the logs with an 'OOM Killer'
message, so it could be unrelated. You may like to try turning off
overcommit in the VM:

echo -1 > /proc/sys/vm/overcommit_memory

(I think.)

This may help. Or it may not. Or it may cause your system to crash
differently...

Well, my problems with my slugos 4.8 beta BE kernel has continued even without running rtorrents. Now all the slug does is samba, ssh, and nightly rsync

Message 9 of 11
, Aug 26, 2008

Well, my problems with my slugos 4.8 beta BE kernel has continued even
without running rtorrents. Now all the slug does is samba, ssh, and
nightly rsync backups to a second drive. I still get kernel oop's
regularly. Sometimes it's the chown process. Often it's the kswapd
process. They always start like this:

As you can see, this particular kernel oops was from kswapd. This is
with kernel 2.6.21.7, which I'm guessing is the only option in slugos
4.8. There was nothing high memory use going on at this time. The
crashes on process chown look exactly the same (I have some chown's in
my backup scripts). This is a new slug. I bricked my old one, and just
put this one in about 2 weeks ago. The slug is on a ups, so it should
be getting clean power.

I've updated and upgraded both ipkg, and ipkg-opt feeds hoping that
newer versions of packages would help. It hasn't.

I'm about at the point that I think I might need to try another
firmware. Is slugos commonly this unstable, or do I have something
screwy going on (most likely)? Should I be trying debian instead? All
I need is openssh (not dropbear, I do a lot of ssh tunneling and
such), samba, rsync, and a system that is stable. Crashing every other
day just doesn't cut it for what I need this little server to do.

I was going to try setting the overcommit_memory to -1 as David Given
suggested, but after googling around I don't think -1 turns off the
overcommit. I think 0 (the default) is the off setting for overcommit.
I see a number of references to setting this to 1 or 2, usually in
regards to problems to the OOM killer. But I have yet to have an out
of memory/process killed event. I just get these damn crashes.

> Thanks for the tip. At the least you have me something else to google around on. I have allocated a bunch of swap space, but if the memory management is out of whack a bit it might not help. Worst comes to worst, I can throw gentoo back on there and roll my own kernel, but I have enough gentoo boxes to contend with at the moment. I'm getting lazy in my old age. Thanks to all for the help...
>
>
> -----Original Message-----
> From: David Given <dg@...>
>
> Date: Fri, 15 Aug 2008 00:56:15
> To: <nslu2-linux@yahoogroups.com>
> Subject: Re: [nslu2-linux] Re: Kernel nasties during backup
>
>
> Drew Kirkpatrick wrote:
>> Rtorrent does that all the time. Everytime it has crashed, Rtorrent
>> was running, so I've stopped. I'll see if it happens again...
>
> There are many known issues with the SLUG going wrong when it runs out
> of memory; but those usually show up in the logs with an 'OOM Killer'
> message, so it could be unrelated. You may like to try turning off
> overcommit in the VM:
>
> echo -1 > /proc/sys/vm/overcommit_memory
>
> (I think.)
>
> This may help. Or it may not. Or it may cause your system to crash
> differently...
>
> --
> ┌─── ｄｇ＠ｃｏｗｌａｒｋ．ｃｏｍ ───── http://www.cowlark.com ─────
> │
> │ "I love the way Microsoft follows standards. In much the same manner
> │ that fish follow migrating caribou." --- Paul Tomblin
>
>
>

Mike (mwester)

... That s the first page of the kernel. Odd. ... c0092d88 t shrink_icache_memory This is code in the middle of a bunch of inode-management code. ... Yes.

Yes. (But of course you build a more recent SlugOS with newer kernel,
if you wished).

There was nothing high memory use going on at this time. The

> crashes on process chown look exactly the same (I have some chown's in
> my backup scripts). This is a new slug. I bricked my old one, and just
> put this one in about 2 weeks ago. The slug is on a ups, so it should
> be getting clean power.
>

What about the USB devices? Have you swapped any of those? Assuming
that there was indeed nothing going on (which isn't really true, of
coure, otherwise the slug wouldn't be doing I/O, but I know what you
mean...), anyway, assuming there was nothing other than normal
background and housekeeping stuff happening, then I'm beginning to
wonder if perhaps you have a USB device that's going to sleep, and
something goes terribly wrong when SlugOS attempts I/O, which presumably
usually wakes up the device.

The stack dump is quite clear that this is I/O-related. Corruption such
as this may also be an event that happened some time earlier; the stack
dump then would reflect nothing unusual. So another question would be
if there *was* some I/O activity, possibly intensive, that may have
caused swapping or some other memory shortfall, perhaps some time in the
past -- and the system just had the misfortune to finally get around to
touching the damaged data structure.

I'd run an fsck on all partitions, testing for bad-blocks as well. I'd
also format and test the swap partitions thoroughly before returning
them to being swap space. If you have done a turnup to disk, you'll
have a syslog that survives reboots -- it won't have data from close to
that crash obviously, but it might have other events from earlier -
paging errors, disk errors, etc.

> I've updated and upgraded both ipkg, and ipkg-opt feeds hoping that
> newer versions of packages would help. It hasn't.
>

> I'm about at the point that I think I might need to try another
> firmware. Is slugos commonly this unstable, or do I have something
> screwy going on (most likely)? Should I be trying debian instead? All
> I need is openssh (not dropbear, I do a lot of ssh tunneling and
> such), samba, rsync, and a system that is stable. Crashing every other
> day just doesn't cut it for what I need this little server to do.
>

SlugOS is very stable. There are known issues with the OOM stuff in
ARM, and have been for a very loooong time -- other than that there are
no known problems with the stock-memory NSLU2s.

Looking at your software list, though, you are certainly stressing the
device, so I think looking more closely at the swap and OOM behavior is
the most likely direction to find a solution. OpenSSH takes a lot of
CPU, Samba is just I/O intensive -- the combination can stress a system.
But I suspect rsync may be the issue. It builds its entire comparison
list in memory, and if you are rsyncing anything like a typical rootfs
for a backup, or such, you'll certainly be at or beyond the amount of
RAM the little box has before you even transfer a single byte of file data.

It would be interesting to note if Debian behaves differently, but I
certainly have no expectation of that based on the stack trace.

> I was going to try setting the overcommit_memory to -1 as David Given
> suggested, but after googling around I don't think -1 turns off the
> overcommit. I think 0 (the default) is the off setting for overcommit.
> I see a number of references to setting this to 1 or 2, usually in
> regards to problems to the OOM killer. But I have yet to have an out
> of memory/process killed event. I just get these damn crashes.
>
> Any tips out there?

I've attached the System.map file for that kernel for your reference.
I'd keep a close watch on my I/O for errors, fsck and test *all* disk
and flash memory space, do whatever is necessary to make sure that
*nothing* is going to sleep or spinning down on me, and I'd persue the
OOM settings -- and I'd start to monitor the rsync stuff closely to see
if there is a correlation that can be drawn between the rsync activity
and future crashes.

Personally, just to perturb things, I'd turn off my swap space and see
what happened. That'll tell you in a hurry if you're swapping. If I
was, I'd try a swap file -- and move the file about on the various
partitions just for fun.

A serial console is a must, but netconsole might be acceptable.

And I'd try a different set of USB devices if available.

Mike (mwester)

Drew Kirkpatrick

I haven t swapped my USB devices, I assumed this was memory related. Much thanks for helping me interpret that data. I never would have thought to look at the

Message 11 of 11
, Aug 26, 2008

I haven't swapped my USB devices, I assumed this was memory related.
Much thanks for helping me interpret that data. I never would have
thought to look at the I/O stuff.

I only have the one external hard drive box (dual enclosure, one disc
for root, swap, and storage, the other drive for backups), but I'll
plug that into one of my gentoo boxes and check the partitions and
such carefully.