fflush Crash in BFS

Description

We're having troubles with a KDL crash in fflush().

Unfortunately it's hard to recreate without waiting days. Also unfortunately, it happens often on that time scale, getting the same crash again and again. It's in Inode::TransactionDone() / Transaction::NotifyListeners() when an fflush() seems to trigger a write (at least it's calling the layers of file system write functions).

The application code logs to text files, using a redirected stdout and stderr (2 separate files), with an fflush (stdout) called once a second, even if no new data is output. Seen somewhere around hrev50163.

The last item ends in a panic, so you probably would have noticed. This leaves 1. and 2. -- the second item leaves a note in the syslog, while the first one doesn't, so that would help to differentiate between the two. Also, if the first one happens, the syslog should mention that it's low on memory.

Item two would either hint towards an interrupt or driver issue, or even broken hardware.

So even if we fix the bug in BFS (which we should), the problem might just choose a different outcome for you.

It's quite likely that it's running out of memory, since we often see other programs (like SoundPlay) consuming hundreds of megabytes (there's a logging program monitoring the other ones, once per second, kind of annoying that it triggers the crash when writing the log). I'll see if I can get a syslog the next time it crashes, though if the file system crashes while writing, that may not work for detecting the error :-)

Is there an API for getting the available and free memory totals? It would be useful to see if it's getting close shortly before the crash. The kernel system info just has a page count, which isn't actually memory used, I assume. I did use counting up the sizes of areas for particular program teams to find the memory used by a program, guess I could do that for all teams. Wonder if that's what ProcessController does...

Thanks Axel. That could indeed be related to kernel team running out of memory. With further system monitoring (now periodically listing top memory users to a log file) we're seeing out of kernel memory as a major cause of long duration crashes (kernel team grows to 1.5GB then bam!).

Leak finding is on the to-do list. The plan is to make a dummy device driver that just iterates over the memory areas and dumps them to a file. Then see what's in those gigabytes (audio, bitmaps, disk sectors?).