Memory Leak

Recently reflashed my Asus RT-N16 with shibby's build 100. Looking at the main status page I can see the free memory dropping constantly. Within a 6 hour window, I usually lose 40mb and I'm down to approximately 60mb free. I don't think it's an issue with the firmware as it does the same for toastman's. Usually with toastman's build it levels out around 95mb and stays constant.

I did clear nvram settings and reconfigured from scratch. What I'm wondering is there a way to tell what could be causing it? Also, is it possible it's a hardware issue?

I'm running build 100 on my RT-N66U. Mind you I have very few services running, but my Total / Free Memory is 249.74 / 237.41 (95.06%) after 21 hours uptime.

When you clear NVRAM and reconfig from scratch, make your config settings one at a time and observe memory usage. I know bittorrent client and TOR both use a decent chunk of memory, especially bittorrent if you have a lot of torrents going.

Thanks GhaladReam, I'll give that a try when I reconfigure it. As for the Bittorrent Client and TOR, I'm currently not using either of them. Just the regular DHCP Services, IPv6 Tunnel, DDNS, Virtual Wireless, Port Forwarding, File Sharing, VPN (OpenVPN & PPTP)

Yes, but no tools that come with Busybox (thus Tomato/TomatoUSB) out-of-the-box will show you the information you need. Even commands like top -- specifically the Busybox version -- do not show you RES/RSS. Likewise, Busybox ps also does not provide a way to see RES/RSS. Both tools only show VSZ/VIRT, which is not helpful in this situation. As usual, Busybox = pile of junk.

You could install Entware and then install htop (opkg install htop) + run htop --sort-key=RSS, which will provide the necessary information. How to install Entware + its pre-requisites is outside of the scope of this thread.

There is also the possibility that the kernel contains a memory leak. cat /proc/meminfo would be helpful here. Please do not look at the output and start reaching your own conclusions; it requires someone familiar with the attributes to know what's important / what isn't. Output from lsmod would also be helpful.

I am not noticing this memory leak. I am using the services described by you guys above, but I am NOT using IPv6 Tunnel, DNSCrypt or OpenVPN. Does disabling any or all of these features fix the memory leak? Might be worthwhile to pinpoint what is causing it by disabling them one at a time.

Output from free will probably show something similar, but with different numbers (htop does its calculations differently than what free looks at; free looks at /proc/meminfo, if I remember right).

The htop output doesn't show any processes taking up excessive amounts of RAM in userland; the biggest is init taking up roughly 600KBytes.

/proc/meminfo shows LowFree: 34108 kB which indicates the lowest amount of memory available/free in the past 4 hours was 34MBytes, which may be an indicator of some process that may have died + been restarted, or possibly kernel-level issue (I'm thinking this).

Thus I am left to believe the RAM issue you may be experiencing (I need actual proof) over time is probably related to the Web Usage or IP Traffic features, since these store massive amounts of data in kernel memory space, which makes it very hard to examine. Please read this thread fully to understand that fact (don't skim it, read it -- I provide insights near the end):

The output from the previous post was after about 3 hours runtime from previously being rebooted.

Here is the current output after running the command you provided. This was after a reboot around 5:00, I also disabled the IPv6 Tunnel previous to the reboot. Currently I am sitting at Total / Free Memory123.79 MB / 95.96 MB (77.52%) and it seems to be holding steady.

I am running Shibby 100 now and was running 099 on a E4200. Do not have reboots but I am NOT using IPv6 Tunnel, DNSCrypt or OpenVPN. My wife and I watch youtube videos plus watch netflex movies and progams using a Roku box. Router has never rebooted. --bill

slabtop output is probably going to be the most helpful here... I hope. It may or may not tell me what in the kernel is actually allocating the memory; welcome to the world of UNIX and *IX kernels.

Thanks.

P.S. -- Your amount of free NVRAM is nearing low levels (4148 bytes; slightly more than 4KBytes), so that may be something you look into as well. It isn't related to the RAM utilisation issue however, so don't misunderstand.

Here is the output from the three commands
...
The command 'slabtop -o' didn't seem to give an output. Just went to a blank command to enter commands. Here is the from slabtop alone.

Click to expand...

The -o flag (also known as --once) is supposed to basically do one iteration of slabtop and then exit. It works fine on my setup; possibly there's a terminal emulation problem causing the issue you're having with it.

Alternately, you can just do cat /proc/slabinfo and look at the output, or this:

...which show you the top 10 slabinfo entries which are taking up the most of memory (based on active number of slab objects). You can change that sort command to sort -n -r -k6 if you want to see the top 10 slabinfo entries which are taking up the most of memory (based on the number of slab objects, not necessarily active).

The last two columns in the output are objsize*activenumberofslabobjs and objsize*totalnumberofslabobjs (should be obvious from the awk command), represented in bytes.

This is absolutely, 100%, something within the kernel allocating all of your memory. That's 64MBytes of memory right there, and if you break down the rest of the slabcache you'll see that bits/pieces go to other things.

So your next question is obviously going to be "what is size-2048?" The kernel can allocate memory (for itself, drivers, modules, etc.) in separate sized chunks. Those chunks vary in size. The number you see after the hyphen is the allocation page size, so in this case, 2048-byte allocations. The problem is that you have 31,786 of those allocations. 31786 * 2048 = 65097728 bytes.

So how do you figure out what's allocating all of those size-2048 pages? Wonderful question. I don't know how to do this on Linux (I do know how to do this on FreeBSD).

But if I had to take a guess, it would be this (which I have mentioned in this thread already):

I'm betting this is the root cause, because it's the only thing in the kernel (that shouldn't, IMO, be in the kernel) that can "grow out of control".

Please disable the Web Usage and IP Traffic features in your configuration, reboot your router (the reboot is absolutely required), and then report back in a few days or a week (whatever).

Remember that TomatoUSB is based on a very old version of the Linux kernel (almost 6 years old, I think) -- 2.6.22.19. Digging around Google I found lots of examples of people complaining about memory leaks in the kernel and related drivers:

Are you sure? Has Riddlah actually rolled back to 097 and confirmed the problem goes away? There's no indication of that in this thread, so how are you so sure? Your setup may not be 100% identical to his; I'm certain you two have different router configuration settings and so on. Be realistic here -- think outside the box, turn off tunnel vision, etc...

Otherwise, someone needs to provide a full, very technical + very detailed changelog of what exactly changed between 097 and 100. Because whatever changed is something kernel-level (device drivers, features, etc.) -- not userland programs! -- and therein lies the problem.

@koitsu: I understand what you're saying, I just don't think it's a coincidence it happened to 3 of us. Both 99 and 100 forced router reboots within minutes of flashing and then somewhat sporadically thereafter, something like once per hour.

@mbreslin: I wouldn't be surprised if the "router reboot" was actually a kernel panic (you'd have no way of determining this unless you had "modded" your router to have a serial console port + had it hooked up to something that stored all serial I/O) as a result of memory being exhausted.

It's pretty clear that whatever is causing these memory leaks / random reboots is something that was added or changed in build 099. Looking at the 099 changelog, I see (amongst other things) IPSec support was added, DNSCrypt was updated, IPP2P was updated, and Transmission was updates. I don't think it's Transmission since you guys are still getting the reboots / memory leaks even when it's disabled.

It's pretty clear that whatever is causing these memory leaks / random reboots is something that was added or changed in build 099. Looking at the 099 changelog, I see (amongst other things) IPSec support was added, DNSCrypt was updated, IPP2P was updated, and Transmission was updates. I don't think it's Transmission since you guys are still getting the reboots / memory leaks even when it's disabled.

Click to expand...

And all of this is assuming the changelog is absolutely 100% accurate and absolutely nothing else was modified/changed other than those things.

Every person who's experiencing this will need to go through the recommendations I listed off earlier in the thread + run the commands I provided (many require Entware) to try and figure out what's taking up all the memory. I imagine the situation may be the same for some people but one cannot assume that is the case for everyone. For example, remember that earlier in the thread, on Riddlah's system for some reason dnsmasq was taking up 13MBytes of memory. The daemon was obviously restarted by him (or maybe by init if someone killed the daemon off) and the memory bloat for that process specifically disappeared. Then later, he provided more evidence of memory bloat, this time in what appears to be the kernel, and I speculate that IP Traffic or Web Usage is what's causing the problem (it's very difficult for me to determine this without a kernel debugger however).

For all I know, your situation may be purely with bloated userland processes/daemons. For all I know Cyberian75's issue may be something completely different. Every case needs to be handled separately/individually.

I've done about as much as I can with this thread -- I'm going to unsubscribe/unwatch it now. I can't do anything more, and juggling 5 or 6 sets of balls while people are doing jigs and cartwheels is simply not feasible for me to do.

Build 100 and just as an update the following have been disabled DNSCrypt, IPv6, Web Usage, Logging. Also no extras have been installed aside from Entware. OpenVPN is still enabled for the purpose of monitoring any issues and any necessary reboots to make it usable again.

The reason I say this is because Riddlah is was still experiencing the problem yet turned off "IPv6 Tunnel" entirely. See the link/post (it's in this thread) for verification. Starting to see why I said every person's situation may be different?

Furthermore, can you explain what "IPv6 Tunnel" means in this context? Maybe in Shibby its labelled different than in Toastman's builds, but the only IPv6 tunnelling feature I've seen is a 6in4 tunnel through places like Hurricane Electric, etc... Is that what you're referring to? It's called "6to4 Static Tunnel" in Toastman, not sure about Shibby. This is different than "6in4 Anycast Relay" (and that behaves quite differently too).

I myself use native IPv6 (specifically: DHCPv6 with Prefix Delegation), since Comcast delegates IPv6 to customers directly (in my area), and do not have issues.

Possibly the issue is related to ICMPv6 route announcements coming from the tunnel provider? I don't have a good way to tackle this kind of situation. There are lots of adjustables and useful things in /proc but I'm simply not familiar enough with the IPv6 stack on Linux to use those to troubleshoot. About the only thing I can think of is to look at the number of IPv6 routes you're seeing using route -A inet6 -n | wc -l via CLI. On my Comcast connection, I see between 60 and 400 routes (it varies, as it should). Most of my route table entries are for /128s (e.g. a "single IPv6 address").

Yes, I'm using Hurricane Electric. When it's on, memory takes a dive; when it's off, memory is stable. What more can I say?

Click to expand...

It would help if you could run some (or all) of the commands I've asked throughout this thread and provide the output from them here (in code blocks for readability). Some may require you to install Entware, but for example the last one I just provided doesn't.

It would help if you could run some (or all) of the commands I've asked throughout this thread and provide the output from them here (in code blocks for readability). Some may require you to install Entware, but for example the last one I just provided doesn't.

Click to expand...

Good luck getting this guy posting anything useful in another form than a one-liner... completely hopeless...

There are 44,934 kernel slab objects of 2048 bytes in use. 44934 * 2048 = 92024832 bytes, or roughly 92MBytes. Lucky for you your router is 256MBytes of RAM, but as shown from /proc/meminfo's LowFree column, the lowest amount of memory you've had is 15MBytes. That's pretty rough indeed.

What's using/allocating these slabs is difficult to determine without a kernel debugger.

So once again, this is a kernel-level problem (whether it be a network stack, device driver, or some other anomaly).

Let me add that I have this problem too (latest builds from shibby, toastman, avenard, etc all have the same problem, and it's not a surprise, because all of them has the same code base) and yes, I have 6in4 tunnel enabled (using HE.net). Leak progresses very fast if the connection is actively used (say, torrent is downloading having a number of IPv6 peers).

Build 1.28.0500.2 by toastman has no leak, and 0500.4 has the leak, so GIT repository may be used to track the problem down.

Unfortunately, there's nothing I can do until mid-October as I'm away, and I have no wireless routers where I could load tomato to try

Note that the official Asus firmware also has known issue with IPv6, issues that were supposedly addressed in their latest firmware.. Would need to check at what they have changed that are Broadcom drivers related.

@jya my conclusions:
- v097 works correct. No memory leaks issue. (this version have new BCM driver and all works great).
- Memory leak ONLY when we have ipv6 enable . I and Cyberian75 are using HE.net tunnel. Memory leaks when we have ipv6 static to HE.net enabled and we are (for ie) watching a Youtube. Memory leaks very fast. For about 5-6mins we have a reboot.
- There is not a problem of 6RD support (this was added in v100 but memory leak issue is also in v99
- There is not a problem with commits:
DDNS: fix HE.net Tunnel Broker messages
Added IPSec support for K26 builds
Transmission: update to 2.61

i reverted those commit and the problem is still exist. All tests i make on RT-N66u with tomato Mega-VPN 64k, WAN DHCP (public static ip) and HE.net ipv6 tunnel without DDNS.

I suppore there may be a problem with Fix virtual wireless (MultiSSID). I`m compiling now 097 (as i said this version hasn`t memory problem) with only yours four commits about MultiSSID and let you know.

I suppore there may be a problem with Fix virtual wireless (MultiSSID). I`m compiling now 097 (as i said this version hasn`t memory problem) with only yours four commits about MultiSSID and let you know.

compiled and we have a memory leaks problem. go back to v097 and problem disappear.

this is why all new builds (my, jya and tomastman) have memory leaks problem.

@jya can you look on that?

Click to expand...

The multi-SSID fix only changes the order in which the wireless interface is initialised. The only change that is relevant is that the mac address of the interface is forced. This occurs only once, right after a boot.
So there's no way that code introduce a leak. Even if that code had a leak, it would be a one-off.

If this issue cause a leak, all it does is exposing a bug in the kernel driver... Not much you can do there...

It may be a bug in broadcom lan drvier, exposed by changing mac address on the interface. I have seen such things before, on x86 platform. I suggest downgrade broadcom drviers (or upgrade to latest from latest Asus beta firmware) and retest.

If radvd is the source of the problem, then the issue is that the daemon itself does not "behave correctly" with the kernel. The point I'm trying to make: the daemon itself does not grow in RSS/RES. Here are posts with relevant output that prove the daemon itself does not have memory bloat (but of course may still be causing memory bloat within kernel slab space):

View these posts and search for "radvd" and look at the RSS/RES column.

I imagine radvd must be doing something bad with related RA (route announcements). There must be some form of ABI used between userland and the kernel for handling RA, and the userland program is expected to tell the kernel "free this", which would explain how there could be a bug in radvd. But the actual memory exhaustion itself happens within the kernel.

When I backported that fix (this one was by me as I was also trying to track down the source of the memory leak at the time) it didn't solve the issue. It's possible however that there were two distinct memory leaks, and while I happened to fix that one, Asus fixed another one with the radvd update (tho I agree, it also sounds doubtful).

No idea which of these two would be responsible for Tomato's specific case however. I'd start by also implementing the kernel patch, and if it still leaks then also try the radvd upgrade.

Riddlah, at what % of RAM does your router reboot? Mine reboots at around 50%.

Click to expand...

I haven't actually seen a reboot. When the memory drops to around 20mb it drops all connections and windows shows the connection as limited. I let it sit for about 5 minutes and it did not reboot. I had to manually pull the power on the router and reboot it.