Now, a traceroute -v showed that it was receiving ping replies from 'itself to itself' which is a bit odd. Then I remembered that I've an external device which pings my WAN address from outside every second....I turned this off and now traceroute completes - the non responding hops (5,6 & 7) now get '*' marks and time out. Re-enabling the external ping box stops tomato's traceroute from working.

The output you show looks to be DNS-related. However, the description you give in your paragraph after the traceroute has to do with ICMP limiting rules in the Linux kernel. Add this to Scripts -> Init to relieve yourself of that issue:

I don't think it's a case of rate limiting or packets being dropped. It's as if the incoming pings (1 per second) are being heard by the traceroute command, even though the traceroute didn't initiate the echo requests from the monitoring box host, and hence traceroute isn't hitting the 'no reply' timeout.

Did you simply paste what I told you into Scripts -> Init, click Save, and expect the commands to be executed (without a reboot)? If so, that doesn't happen. You'll need to go into the router via CLI and issue the echo commands by hand for them to take effect immediately (alternately you can reboot).

Did you simply paste what I told you into Scripts -> Init, click Save, and expect the commands to be executed (without a reboot)? If so, that doesn't happen. You'll need to go into the router via CLI and issue the echo commands by hand for them to take effect immediately (alternately you can reboot).

I still can't reproduce this behaviour, but I wouldn't be surprised if this turned out to be a bug in Busybox traceroute (really -- ICMP hand-off between the kernel and userland is a little tricky, it's not quite the same as TCP and UDP).

At the same time this is running, I have a hosted VPS box that is using ping hit my WAN IP to check for connectivity issues + log the results (so what you're going to see in a packet capture is predominantly ICMP ECHO/ECHO REPLY). That VPS box is 206.125.172.42.

Here's a packet capture which was running at the same time (started before the traceroute). I can't show it in a code block due to "number of characters exceeded" from the forum, so it's an attachment (DOS CR format).

You can see quite clearly in the capture that the ICMP ECHO and ICMP ECHO-REPLY packets are going back and forth between my router and 206.125.172.42 at the same time the traceroute is running. You can see the traceroute running because of all the ICMP TIME EXCEEDED messages. The source IPs vary because that's how traceroute works (read up on how traceroute works, re: incrementing TTL).

And again: please stop using traceroute -v on Busybox, it's broken.

Attached Files:

Thanks for trying to replicate this & the capture which makes perfect sense. I'll try some different firmware versions to see if it's 'always' been like this, or maybe something to do with WNR3500lv2......tomorrow.

So, TomatoUSB therefore does not have this fix. If Busybox was upgraded to 1.19.x or newer, it would.

Like I said, Busybox is such a piece of junk. So many bugs that are utterly catastrophic on so many levels. It should amaze people that there are commercial embedded devices (like cable modems) that use Busybox.

You can contact the committer yourself and ask him why he didn't backport that fix. I'm sure his response will be "you shouldn't be running that old of a Busybox anyway".

I should also note that the diff/fix could be backported by shibby20 or Toastman, but I dunno what the TomatoUSB policy is on manual patches/backports for Busybox. It may, overall, just be better to upgrade to 1.19.x, but given how important/key Busybox is to the firmware, that may be a bigger undertaking than simply backporting the patch.

Well Busybox is updated in Shibby's 106 release to 1.2? (I suspect code shared with the RT-Merlin project) - I hope Shibby updates the git repo soon as I custom build with some dnsmasq fixes (latest version as of 2 days ago 2.66test16 but on-going) - as it currently stands I've not tested the 106 release. I've handed shibby a copy of dnsmasq2.66test10(ish) which he's said will probably make it to 107, meanwhile I'm keeping an eye on dnsmasq and seem to be maintaining the Tomato additions, when they get to a 2.66 release I'll make sure shibby gets that & pause a bit.

I personally would like to see radvd gone from tomato and replaced by the ipv6 RA & DHCPv6 with DNS integration of dnsmasq - it'll do all of what radvd does & more. I (now) have a system at home which is happily handing out addresses to iphones/androids & windows boxes, both ipv4 & ipv6 and it's even maintaining the DNS lookups forward & reverse.....try doing that with radvd

Although I'm really impressed & respectful of Jonathan Z's work (and those of other maintainers) there's a decided lack of documentation as to how any of this stuff really works, and a lack of flags saying 'this release of xyz has some custom tomato code buried in it' which makes maintaining harder than it perhaps should be. Mind you, what do I really know... I can't even program in C.

I just examined the tomato-shibby-RT-N branch via git (repo.or.cz). The included busybox is 1.18.5. So unless Shibby has pending git updates he hasn't pushed out yet, I wouldn't expect any change.

If what you say is true -- if you can get your hands on the /bin/busybox binary from 106, then rename it to traceroute and run it (it's dynamically linked so you'd better hope libcrypt, libm, libgcc_s, libc, and ld-uClibc haven't changed between 105 and 106! -- stick it in /tmp or something), you can test it in advance.

From my review of the (buggy) Busybox traceroute code, you should be able to use -w 1 as a workaround. Their commit message describing the problem is accurate but not verbose enough -- the problem is that while the wait interval code is running (controlled indirectly via -w), if a large amount of ICMP traffic is seen (of any type) during that interval, it can cause traceroute to lock up (I believe in an infinite loop). Setting -w 1 causes the internal variable called waittime to be assigned to 1, which is later used in the safe_poll() call multiplied by 1000 (milliseconds); the default value of waittime is 5. So by decreasing the window of time, effectively you decrease the window of opportunity where "too many" ICMP packets can arrive and lock up the program. Like I said: ICMP is handled very, very differently between kernel and userland than things that use classic layer 4 sockets (UDP, TCP, etc.) because of where ICMP sits in the OSI layer (layer 3). Effectively userland programs can "see" all ICMP traffic received by the kernel (see first line, 2nd paragraph).