You know it's bad when I'm posting random cries for tech help here... So yeah. It's that bad.

The webcast machine at the club loses its mind at least once a week: it appears to run out of memory and crash, but I can't figure out what the culprit is.

The machine is a dual CPU Athlon 2400+ with 1GB RAM and 500MB swap. It's running Fedora Core 3, but I was also experiencing this problem on FC2 and RH9. Memtest86 says the RAM is fine. It's got an Osprey 100 BT848 video capture card and an SB Live EMU10k1 audio card.

I set up a cron job that once a minute captures the output of "top -bn1" and "ps auxwwf" to a file. Here's are a pair of those files as it loses its mind. Note that the load goes from 3.44 to 22.73 in a minute and a half.

I've compared the two files character by character, and I don't see a smoking gun. The differences look quite trivial to me.

So while I was sitting there staring at this, I saw something very intersting happen: "top" was running on the machine's console, and showed 380MB swap available -- and the oom-killer woke up and shot down an xemacs and an httpd.

So, how's that even possible? Does this mean that some process has gone nuts and started leaking wired pages, so that it can't swap at all? Or what?

Update, Dec 29: It looks like something is leaking in the kernel; /proc/slabinfo shows the size-256 slab growing to 3,500,000 entries (over 800MB.) Current suspect is the bttv/v4l driver (since one of the things this machine does is run "streamer" to grab a video frame every few seconds.) That would be about 525 leaked allocations per minute, or around 26 leaks per frame.

When the system starts going psycho, hit Alt-SysRq-T, Alt-SysRq-M, and Alt-SysRq-W. If the system is alive enough, it'll write a whole bunch of nearly nonsensical shit into /var/log/messages. If not, you'll want to get a serial console hooked up to the system to capture it instead. Post it here and we can take a look.

Anyhow, if he can throw a serial console onto it (and said serial console is something like another system running minicom, instead of a vintage VT100 or somesuch) it shouldn't have trouble catching it. At least, IME I can't recall seeing customers have problems catching all of the output.

Swap space seems light from what I recal of modern kernels. Heck, you might be better with swap off, though I would tend to put it at 2GB. Heck, make a 1.5GB swap file, and just have it a lower (higher numerical?) swap priority, and see what happens.

My only real qualifications at this point are being a regular reader of kernel traffic.

They're probably an effect -- I bet by this point enough of the parent "httpd" process is swapped out that it just isn't calling wait() very promptly when its children are being OOM-killed.

A few zombie processes by themselves won't cause many problems by themselves -- all the process'es resources are already released; it just can't disappear from the process table until its parent calls wait() and retrieves the exit code

As I said in my other reply this looks exactly like a kernel memory leak to me

Given that it sounds like a kernel/driver issue, I'd try running that system in single-CPU mode and see if the problem goes away. It looks like he might have just enough horsepower to get away with this, and I wouldn't be surprised to see that some random video capture driver has bugs with SMP Athlon systems...

Once my memory usage reaches close to the 768mb of RAM my machine has, I see a similar spike in system load until the OOM kills Mozilla or Thunderbird (the usual large memory chewers here). It, thankfully, won't eat X11. This is on Slackware 10 and Slackware current, both using the latest 2.6.x kernel.

I think what really needs to happen is to have a good review of the Linux MM system and get these strange problems out (because, at least in my case, shrinking the disk cache would release ~150-200mb, depending on the size it's at). They seem to be an effect of half-baked MM behaviour to me.

There has been a lot of talk lately on lkml about making the OOM killer a little less "twitchy" in certain circumstances. Maybe that is the problem you are having but it isn't what jwz is seeing. He's actually seeing the machine die due to lack of resources -- the OOM kills are just symptoms of that (but they're helpless since the kernel is leaking away the RAM)

Yes, but is the I/O heavy at the point that the OOM kill happens? It could well be that the system needs that amount of cache to keep your working set

People here "out of memory" and assume that it really means completely out of all RAM and swap with all caches purged. Which is understandable, but it's wrong.

Imagine a machine with a couple gigabytes of swap and a process that's quickly leaking RAM. Pretty soon all of the inactive processes will end up on swap and you'd be in a thrashing scenario where even the active processes (including the leaker) are pretty much running in swap and going about a million times slower than they should. Now the memory leak is only growing very slowly (since everything is going very slowly) and there's basically no way to log in and kill it. In theory eventually the leaker will run totally out of swap but this could literally take years of thrashing. Obviously this would make the OOM killer pretty useless since the admin would have to hit the Big Red Switch long before it would trigger; this is exactly what OOM-kill is trying to avoid.

So rather than meaning "I'm totally out of memory" it means "I seem to be approaching thrashing; I'm worried that if I continue I might soon be unresponsive so I should do something NOW before that happens"

This obviously requires some heuristics to determine and they can never be perfect (simply because the kernel can't predict the future -- for instance there could be a big simulation thats a millisecond away from completing and releasing all its RAM) Like any MM heuristic it's a constant tuning battle to make it work reasonably well on all workloads. As I said earlier there is a lot of talk that it's too trigger-happy for some people and various tweaks are being tried both in 2.4 and 2.6 that will hopefully make it work better.

So I guess what I'm trying to say is this -- maybe it really is busted for you, but how much disk cache you currently have isn't really evidence one way or another. Thrashing is trashing regardless of how much cache or swap you have.

Well to really know for sure you'll need to wait until the problem is clearly exhibiting itself. However I do think the growth of the size-256 slab looks interesting -- up from 66300 to 417480 objects. That's a rate of about 200MB/day which sounds fairly in line with what you're experiencing.

So what that tells us is that whatever seems to be leaking is a kmalloc() allocation of a size between 257 and 512 bytes.

Isn't debugging kernels fun? Now you can see how I became such a bitter person.

Yeah, it looks pretty clear there that almost all of the pages on the machine are in the slab cache (i.e. what the kernel uses for dynamic memory allocations... it's version of malloc() so to speak) On your 1G system you have a grand total of 262140 4K pages and 243624 of them are in the slab.

The next step is watching /proc/slabinfo to see what is getting really large before the box croaks -- that at least narrows the bug down a bit.

Also could you send me an "/sbin/lsmod" output so I can see exactly whats loaded on the box? Thanks.

The bttv driver in the kernel is known to be quite buggy, and I don't know if Fedora is patching it or not. My MythTV/Freevo box had similar problems until I applied a lot of v4l patches. Latest and greatest patches can be found here: http://dl.bytesex.org/patches/

hrm but as I understood it, pretty much everything hits the slab first, even kmalloc() and the likes, and if the request can be satisfied from that, than it is. I am not a super kernel hacker, just a hobbyiest, but I was somewhat under the impression there was no way to say 'i want a chunk of memory and it had better not come from the slab damn you'

however, none the less your advice for watching slabinfo probably isn't a bad one.

jwz, a less scientific method if the problem prooves to be in user space is to start slaughtering processes one by one until it no longer happens.

> it, pretty much everything hits the slab first, even kmalloc() and the likes

Yes exactly, kmalloc is implemented on top of slab... it basically just converts the size you ask for into the next-larger slab so if you say kmalloc(600,GFP_KERNEL) it says "ah, you want something from the size-1024 slab"(It's actually a touch more complicated that but that's the basic idea)

Other high-volume users of the allocators make their own slab caches to allow them to get allocations of EXACTLY the size they want (plus some other cool features of having your own cache to esoteric to get into here)

Slab caching originally appeared in Solaris around 1993 and rapidly spread to most other current OSes. If you're interested in this sort of thing I'd highly recommend you read Jeff Bonwick's original paper on slab caching and his 2001 follow-up paper describing some of the more recent enhancements.

So watching what slab grows will at least indicate about how big the leaking allocation is which might narrow it down a bit. If we're really lucky it'll be leaking from a special-purpose slab which would be even more revealing.

> but I was somewhat under the impression there was no way to say 'i want a chunk of memory and it had better not come from the slab damn you'

Under Solaris there's pretty much not (at least for the kernel heap area) but linux is pretty different in that regard. I'm 99% certain that __get_free_pages() and friends bypasses the slab allocator. Also linux has vmalloc() which builds larger-than-1-page contiguous allocations by using the MMU to glue a bunch of pages together; it is also separate. There is a lot more in Chapter 7 of the Linux Device Drivers book which is conveniently online.

Slab caching originally appeared in Solaris around 1993 and rapidly spread to most other current OSes.

Yea I knew that, so many people knock solaris, but they have been 'industry-strength' (tm) for a long time now, but maybe i just think people are like that because I sit next to 2 people at work who hate solaris.

So watching what slab grows will at least indicate about how big the leaking allocation is which might narrow it down a bit. If we're really lucky it'll be leaking from a special-purpose slab which would be even more revealing.

Ah, for some reason I thought you were suggesting that all of his memory was in slab so that he was running out of memory, which i suppose could be happening. With your explanation above however, it makes total sense.

Going from memory, I am pretty sure most of the calls that deal with pages deal bypass the slab, vmalloc() as well- because, again going from memory, it uses some of those *pages() functions. Also, I am probably wrong here, but I was under the impression vmalloc() took a bunch of non-contiguous memory and mapped it so that it was contiguous in regards to the process point of view, however that may have been what malloc() itself did, I find myself not using this stuff often enough to remember it.

Those papers you point out are good, as is 'understanding the linux virtual memory manager' (which i seem to remember is also online somewhere), im probably about half way through that book, however set it and linux device drivers aside for a herman hess book at the moment.

didnt he say he had more than one of these boxes? if the others are doing fine and they are the same software-wise, then I would be inclined to think it may be hardware related.

1) Do add swap to cover your ram -- I believe recent VMM getsupset if it cannot swap all dirty pages while memory usagegets high. Just add an extra 1G swap file of lower prioritythan your swap partition and see if it helps.

2) Some people concern with stability of apache 2.0.x andstill recommend 1.3.x. Personally i do not have any problemswith a very similar setup, but my load is much lower(and i have 6GB of swap space -- 2GB on each harddisk I havein the machine).

See, sysadmins are superstitious types. For instance back in Version 7 on the PDP-11 there wasn't a real way to formally shut the system down; you just had to make sure the system was quiet, then run "sync" to get all the buffered data out to disk.

The problem was that when "sync" returned there were still some cases where things might still be streaming from the disk controller or what not so if you were too quick on the switch you could still lose some data. I rule of thumb was created -- just run sync three times. By the time you were done typing the third command the first one would definitely be done. Simple and easy to remember -- even a barely-literate tape monkey could handle it.

But here's the weird part -- people still do it. Hell, there's people who weren't even alive when it was needed who still dutifully run three syncs when shutting a machine down. Even funnier is that they often run "sync; sync; sync" which never would have made sense because the whole point of the rule was just to slow down the process a tiny bit.

The "have more swap than physical RAM" thing is the same deal.

There's a grain of truth to it: back in SunOS 4 (and maybe some other 80's-vintage *NIXs; I don't know) this really was a requirement. The swap allocation was done at page mapping time so for every non-pinned location in RAM you needed a space reserved to swap it to.

This isn't the case in any modern OS I'm aware of. Yet still I hear this claim made a couple times a year -- it's just a rumor that's been traveling in circles among sysadmins for the last 15 years. I'll probably be hearing it for another 15 too.

> 2) Some people concern with stability of apache 2.0.x and still recommend 1.3.x.

Only a problem if you're using weird PHP modules *AND* you're using one of the multi-thread models of apache 2 (you can configure apache 2 to use the "pre-fork" model just like apache 1.3 if you think this might be a problem for you) Even then you'll just have an apache core-dump, not a hosed computer.

RedHat has been shipping apache 2 exclusively since at least RHL9. It's has been stable for years now.

The bottom line is that the current vm does its best when it has essentiallyunlimited swap. swap >= 2*RAM seems a reasonable approximation of that.If you have trouble allocating a couple gigs off your harddisk, you probably need to add another drive.

Apache2 may be great, but I've had problems with mod_fastcgi and apache2 that went away when using apache1.3. When a fastcgi child process died the fastcgi process manager also died during the process of spawning a new child.

It's still a good idea to have at least as much swap as RAM, at least for smaller desktop systems. Other things can use swap too besides just offloading huge memory hogging programs - swsusp, for one; VMWare 5 Beta can also optionally put a portion of the running guest's memory directly into swap so as not to hog so much of the physical RAM.

Putting aside the usual use a different distribution/recompile your kernel/sacrifice a chicken suggestions: can you afford to switch off overcommit altogether? See vm/overcommit-accounting in your kernel docs.

I was about to say "well, clearly you're running into one of the relentlessly stupid OOM situations recent Linux kernels frequently produce" but you have plenty of swap. The high idle percentage and 22.7 kernel load suggests that your machine is spending a great deal of time thrashing; this is common behavior (unfortunately).

While there could be a kernel memory leak, I suspect there's a more pedestrian explanation. It would be better if your ps/top monitoring overlapped the OOM notifications in the syslog, but here's my best guess, based on very similar experiences. First, a fairly busy system has a burst of activity (for whatever reason), runs out of memory, and the system starts thrashing. Something happens to demand real RAM. Then the OOM comes up and kills stuff.

Some things that are different from what I'm used to with similar situations on my little server: you have swap available. With the CPU at 56% idle with a load of 22.73, you're certainly maxing out your I/O bandwidth hitting the swap partition, and the OOM events may be triggered by the kernel concluding that it couldn't possibly write data out as fast as it comes in. But the only situation I can think of where that might come up is in real time events.

So, how do you fix this? Well, my first reaction is "add more RAM". The OOM I'm not so sure about (although it's indicative of this general class of problem), but going from 3.44 to 22.73 with a 56% idle time is very characteristic of thrashing, and even if there weren't any OOMs the system would still be unusably slow.

Additionally, I would suggest that you add more swap. Yes, even though you're not running out. Since you don't have enough swap to back all of RAM (and some change) there are different stupid thrashing problems that can occur with that.

Incidentally, the above highlights that I am particularly unhappy with the Linux OOM behavior, although I find the general VM performance a damn sight better than the BSDs I've run. I deal with it by choking the machine with RAM and swap, which is not a real answer but is frequently a lot cheaper than my time and deals with it.

Now that I read <lj user="bodyfour">'s post, the idea that the OOM is coming up because the kernel itself is snarfing all the RAM is more plausible. I would check that out before throwing more RAM at the problem; if there is a big kernel leak more RAM will just put off the crash.

I know you already replied again on this thread but I thought I'd point a couple other things out.

> but you have plenty of swap.

OOM is almost entirely unrelated to how much swap you have

> So, how do you fix this? Well, my first reaction is "add more RAM".

Look at the top output after the machine has been up for an hour -- there's over 400M of RAM that hadn't even been touched yet (so the page cache probaby still has every fs page we've ever even looked at, for instance) I'd say this box probably has at least twice as much RAM as it really needs (not that that's a bad thing, mind you, having plenty of RAM is great, especially at today's prices)

Thanks for the comment; it's been some years since I seriously followed kernel development, and on my main server I've only ever seen OOM errors after it has eaten every bit of real and swap. I've also never had the Linux kernel seriously leak memory on me--the fact that I didn't realize to look for it or even what to look for embarasses me somewhat.

Yeah, I think there's something to this idea. We recently experienced a problem where two systems running the same software (C/Perl and mysql based network stats software) experienced a memory leak after a FC2 errata kernel autoupgrade. That same kernel running on other hardware ran without a leak, but, for whatever reason, the load or hardware on these machines produced a memory behavior that would fill swap and then physical memory and cause the system to become unresponsive to user-land tasks.

In particular, FC2 kernel 2.6.5-1.358 was fine, and both 2.6.9-1.3_FC2 and 2.6.9-1.6_FC2 experienced a steady apparent memory leak. Booting on the older kernel removed the problem.

Beyond that, it's hard for me to say much of merit. I'm pretty unhappy that the problem is apparently the *kernel* or a major device driver.

I've seen drastic increases in load average like that before. The basic problem was that processes waiting for IO are counted, and there was an unreliable nfs mount.

You have quite a few processes waiting for IO in the "going down" list. I wouldn't be the least bit surprised if the pages of the waiting-for-IO processes are locked in physical memory as well, so maybe your oom problem is acutally a problem with an IO bottleneck. I'm probably full of shit, but it's a theory I would check out.

Are any of these processes using remote mounted filesystems, or is it local filesystems?

You're generally better off looking at the RSS (resident set size) numbers since that's how much physical RAM is really being taken up by the process. VSZ includes any virtual memory areas in the process including things that aren't paged in (like parts of the program that have never been accessed) or things that aren't even RAM (unused parts of mmaped files) If you look at an X server the VSZ will be huge since it usually includes all of the video card's memory.

But, yeah, the overall story is that none of the userland processes are using a particularly large amount of RAM; the kernel is eating it all.

I figured the VSZ would be the Worst-Case-Scenario, showing what the absolute most ammount of memory the processes could be eating, which, presumambly they are in the "Boot +1hour" file, as there is no swap used yet. It seemed that if those numbers wern't even approaching the system's memory, that would definatley lend weight to the naughty kernel theory.

none of the userland processes are using a particularly large amount of RAM; the kernel is eating it all.

Every once in a while, I need a reminder of why I wanted to forget so much about sysadminning. It's a thankless job - people only notice you when something goes wrong.

Memory looks OK. CPU usage from the processes is near zero. I'd say the oom-killer is probably just making a system call to get the total free memory and if it falls below a certain threshold, it finds the first process taking up the most memory and kills it.

I'd say the problem is somewhere in the kernel. Wouldn't be the first time a kernel has leaked memory, and oh God, that brings back some bad memories. I'm stopping now.

you've had this problem in 2.4 and 2.6 kernels. this machine does video capture. i seem to recall that video capture cards on x86 boxes had very specific ram requirements - had to dump data to specific areas of physical ram which conflicted with how the vm people wanted to do things.

What's happening is something buggy is waking up all your swapped out processes and making it shit the bed because suddenly you have 100megs to swap in.

Another weird thing that can be happening is that the oom killer is running because you're low on kernel memory, not system memory, killing processes will usually free up kernel (non-swappable) memory.