Bug Description

Binary package hint: gnome-utils

hi there,
here are some additional information omn this bug:

during the creation of an .bin file (with imgburn) via my MS Windows Vista client the application reaches a
timeout with lossing the connection to the samba server for a while.
the ubuntu 9.10 samba server generates this messages in syslog

here are two new tasks blocking for more than 120 seconds. but this time the task "kjournald2" followed dy "rsync" are affected. this happens when i tried to copy from one ext4 volume to an other with rsync.
So its not samba - specific as i assumed at first let me say it seems to be a problem with ext4.

Since i converted two of my volumes from ext3 to ext4 in the last week i got this problems.
So i have to go back to ext3 to avoid loosing of data.

Bug 276476 states that we should file separate bugs for the blocked processes (if I interpret the comments correctly). Can we instead dump everything here?

The easiest way to reproduce this is just running rsync or (s)cp with a big file. I see mainly blocked kvm, pdflush and kjournald in kern.log before the server goes down and I have to find a reset button. (doesn't always go down, but I disabled the rsync backup for now.)

I've been getting the same errors (INFO: task xyz blocked for more than 120 seconds) since I upgraded to ext4 on my 4TB /home partition. It gets a lot of I/O including rsync backups, smbd and nfs fileserver for ~20 people. I have attached a dmesg.

Unfortunately, apport-collect doesn't work for me because the machine doesn't have X and the launchpad logon page doesn't have a selectable [Continue] button in w3m; only cancel works.

and the first crash I had, it was trackerd the first blocking task :
(from kern.log)
Apr 19 11:28:43 localhost kernel: [156600.910099] INFO: task trackerd:3486 blocked for more than 120 seconds.
(full kern.log on demand)

The second time it was cron (but I didn't check if the ran script was IO consuming or not)
(from kern.log)
Apr 20 14:20:59 localhost kernel: [68160.710147] INFO: task cron:28148 blocked for more than 120 seconds.
(log of Apr 20 seems to be polluted by the systReq I misused)

No "strange" update the last days, but I enabled a new django site the day before the first crash (in debug mode) => maybe memory whore.

For my specific case (may be different than the one reported here) I suspect 2 causes :
* hardware failure.
* or bad behavior of something in case of (low memory + huge IO activity.)

As suggested by Brian Rogers in bug #276476 I'm posting an update here. Hope to attract some attention!

I'm running a file server that shares home directories and other directories (~4TB, ext4) to 4 linux clients via nfs and about 20 windows and mac clients via samba. Rsync runs several times per day to mirror all files onto a second server. This seems to be the main trigger for the freezes to happen but I've seen it being caused by other I/O intensive operations, too.
Mostly, the file server recovers 'quickly' and the freezes only last for 10 to 60 seconds. Such interruptions occur at least once per hour.
When working on a linux client, everything stops (possibly due to the shared ~) until the file server has recovered. On windows clients the program accessing the share freezes. During that time, the load of the server increases rapidly as more and more processes are queuing up for a read/write.
Rebooting the server helps but the first 120 second block normally happens within several hours and rebooting every time is not an option.
Blocked tasks include pdflush, nfsd and kjournald2.
OS is Ubuntu karmic 9.10 on 2.6.31-21-server #59-Ubuntu SMP Wed Mar 24 08:26:06 UTC 2010 x86_64
All packages updated

I have success with setting the block-io-scheduler to something different than »deadline« for each block device in our raid arrays.
This is with kernel: 2.6.32-24-server

I tried »noop« and it worked fine for me. This is not reproduced yet, because a backup lasts more than 12 hours each. Today I'll test cfq and report.

To read the current or set the scheduler for each block device through /sys/block/sd*/queue/scheduler.
On the other hand you can set it globally with the boot option: »elevator=cfq« (or »noop« or »anticipatory«)

Lars,
by "have success setting" you mean "avoid the freezes"?
I'm also running 2.6.32-24-server (but with an adaptec hardware raid) and I haven't had a "task blocked for more than 120 seconds." message in my kern.log since Jul 8.
However, I think the freezes are still happening just not for 120 seconds, so no log is written. Will keep an eye open and try noop.
Cheers

With our setup it was reproducable. After some (3-4) hours of heavy io-load from the hdds our process gets blocked for more than 120s and then it stopps itself because it couldn't write to the tape drive any more.

With noop the backup (actual two of them in parallel) has run 13 hour without any strange log entry.

Happening to me too. Several times on the 11th and 12th. May be triggered by load from outside. Looks like a synflood after an incident yesterday. Added mod_evasive, blocked a few bad bots, Reduced maxclients to 50 (prefork). No events since last night. https://www.bijk.com/p/2199b5ea shows yesterday's outage which will roll off tomorrow. I can grant access to someone who wants a closer look.

In my case, I suspect the events are associated with external web server load and are accompanied by dysfunction.This is my only server which gets significant traffic. The others are operating peacefully.

the kernel update from last week has made our troubles worse.
I have not managed to get one backup without the 120s blocking since the update.
Althout I didn't change anything else. Still using noop as io-scheduler.

I'll open a new bug report when no one is responding here.
Sorry but this is most important to us.

If there are no server kernels to test, there should be enough information for the developers to work towards resolving this issue. For all reporters, it would be good to file individual bugs for each system with issues. The bugs can be filed against "linux" as the package, using "ubuntu-bug linux". Please add this bug 494476 to the comments for reference. The individual bugs will give the developers the needed log files and allow them to decide if all of the fixes will be the same.

This is a very serious problem. It's causing unpredictable lockups on my server every 2-3 days, requiring a force-reboot. There are many related reports from other users: e.g. #588046, #667656, #628530, and my particular one, #684654.

This bug is not:

* Just on server kernels. My kernel is the latest -generic.
* File-system specific. Here it's on ext4, mine is on reiserfs, others have reported it on xfs.
* A hardware problem, comment #29 notwithstanding. My hardware shows no signs of failure, and too many people are reporting it for it to be caused by simultaneous hardware problems.

I hope that someone is going to work on and fix this very soon.
Andrew.