For a long time now I've had a very strange problem with my wi-fi network at home. I have a BT Voyager 2100 ADSL modem and an iMac, ageing PowerBook and a PC that connect to it wirelessly. The problem is that I can never access a small number of certain websites because they always time out.

There's nothing apparent that connects these websites in any way. Some examples that I've come across are www.adobe.com, www.microsoft.com, www.portsmouthguildhall.co.uk (a local venue) and subtraction.com (a blog). I can ping some of the sites without problems; there are no timeouts. In fact, I used to be able to access subtraction.com and can still get its RSS feed. I just can't view the site in a web browser any more. This is a very isolated problem—for the majority of my Internet use everything works fine.

It's clearly not a problem with the individual computers because they all have this problem, so it must be a problem downstream with my router or even ISP. I've upgraded the router to the latest firmware and tried resetting it, but it didn't fix the problem.

How can I even diagnose where the problem is? I'm at a loss as to know where to start! Are there any UNIX networking commands that I can use (I have Mac OS X)?

Thanks for any help.

EDIT: Following Alnitak's suggestion, I tried a traceroute and ping with adobe.com. As you can see, the traceroute never gets there:

This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.

Questions on Server Fault are expected to relate to server, networking, or related infrastructure administration within the scope defined by the community. Consider editing the question or leaving comments for improvement if you believe the question can be reworded to fit within the scope. Read more about reopening questions here.
If this question can be reworded to fit the rules in the help center, please edit the question.

It could be a problem with the router, but it could also be a strange network topology or routing issue with your network provider. If you know anyone else using the same provider, especially if they live close to you, see if they have similar issues.
–
EddieMay 1 '09 at 14:47

10 Answers
10

There's likely something between you and those sites that doesn't support the typical 1500 byte MTU, and on top of that probably a firewall blocking the ICMP packets that are used for "Path MTU Discovery", so your end can't tell that the normal MTU can't be used.

Try a traceroute, and then for each hop in turn, try sending a large ping packet (1492 bytes) and see if any of those hops refuse to return the packet.

EDIT - your tcpdump output shows that your end is still trying to initiate TCP's "three-way handshake" because the SYN bit is sent in the packets from your end. However the packets coming back from Adobe appear to be truncated or malformed. That's pretty weird, because there shouldn't be any payload in the packets, just the far end's SYN response. I'd need to see a full dump (including the -X option) of just those first 4 or so packets to know more.

EDIT2 - based on your detailed tcpdumps I believe that your router is corrupting the TCP response from some sites. The best way to test this is to borrow another brand of router.

Can we slap upside the head all the netadmins that still cling to the false belief that all ICMP should be blocked? :-) I can't believe we're STILL dealing with this all these years later. C'mon, PPPoE has been out forever. I can understand not "getting it" before then, since the problem never really came up, but really, nowadays everyone should know better.
–
Brian KnoblauchMay 5 '09 at 20:02

Plug one of your computers directly into your internet connection and let it get all it's network settings from your ISP. If you can't access the sites then it's an ISP issue, if you can then it's a router issue and you can go from there.

I definitely agree with the notion that basic symptoms of this problem sounds like it is related to a PATH MTU problem. There are other possibilities, but that is the most likely place to start.

Given the prominence of the sites you mention and presumably the extended period of time that this has been occurring for, it seems kind of unlikely it is a problem within the ISP's network......although given the traceroute result shown in the question, the path depth and total latency doesn't shine very well on your ISP. Generally speaking, any decent ISP should get you to any major/prominent web property (within the USA) in something [well] under 120ms...but I digress.

Using traceroute and ping to diagnose the problem as others have mentioned is very helpful, but it is far from a definite tool solution given the possibility/likelihood of ICMP blocking/filtering in various locations. And, because of this, except in the hands of a skilled analyst it is pretty hard to tell the difference between specific problems & firewalls messing with ICMP.

The best way to rule out an MTU problem is to start by reducing the MTU of the Ethernet interface in one of the computers that is having the problem. See the procedure located here for MAC systems since you mentioned you have a MAC computer.

If you start lowering your interface MTU as the process describes in steps of say 100 bytes at a time and checking functionality starting from from 1400 down to 500 bytes.....if the problem suddenly goes away at one of the steps, then you definitely have a path MTU problem for sure. If dropping down to 500 as a minimum doesn't solve it, then it is not a path MTU problem and you can move on to investigating other possibilities (after you switch your MTU back up to where it started...which was probably 1500 bytes).

So I should try reducing the MTU on my Mac and leave the two router MTU settings the same?
–
John TopleyMay 9 '09 at 9:25

@John Topley - The router MTU settings should not affect this experiment as long as it (or they) are larger than the Mac's settings. (I'm assuming the router is set to something around 1400 or larger). In other words, the MTU setting at the source prevails as long as it is smaller.
–
Tall JeffMay 9 '09 at 11:01

I've fixed the problem now and in the end the fix was deliciously simple. I logged a support call with my ISP (PlusNet) and they sent me a link to a forum post explaining that this problem is a bug in my router's firmware. The fix was simply to set the router's Internet connection MTU to 1500 (the default is 1400) so that it matches the router's LAN side MTU.

Thanks to everyone who offered help and advice. I'm going to accept Alnitak's answer simply because he/she stuck with me on this and kept coming back with more advice and things to try.

glad it's fixed, and was indeed an MTU problem. There must be something very odd in that firmware if it can't even complete the three way handshake (which has very small packets) in these circumstances.
–
AlnitakMay 10 '09 at 14:53

You did not mention whether you are going through a proxy server. It might be interesting to see if your ISP is potentially transparently proxying you, a practice I consider very evil but I think its quite common. Maybe you could try http://tracetcp.sourceforge.net/usage_proxy.html and do a tcp trace to the hosts that are not working, that could be interesting.

In the meantime going through a proxy server should allow you to access the sites so you at least have a workaround.

Have you tried contacting your ISP about this issue?

To me your traceroute and ping results are totally normal. The lack of reply at the end is normal, that is the last HOP that is sending ICMP max hop reached replies. tracepath is a utility which can be used to diagnose mtu problems which may help you.

The solution to this problem for me (on linux) was to enable advanced router support in the kernel and the TCPMSS target support in the netfilter/core netfilter section of the kernel config. And then to tell iptables to force maximum segment size down:

I have now sent a similar TCP connection request packet to www.adobe.com from my local machine (the only difference being the source IP address) and compared the response packet I get with the one in your latest tcpdump.

I have found 3 differences in the IP/TCP headers:

the "Differentiated Services" field in the IP is set to 0x80 in your case and 0x00 in my case - I am pretty sure this is caused by PlusNet's traffic prioritisation.

the 4 bytes at offset 0x20 are "0000 5012" in your case and "5012 0000" in my case - these are the data offset, flags and window size fields in the TCP header. It looks like something is swapping these 2-byte words in your case. And this is definitely what results in an invalid TCP packet

the connection response request has a TCP MSS option (with value 1460) added in your case, but there are no TCP options in my case

My guess would be that your router tries to be clever by adding a MSS TCP option, but in some cases messes up the TCP header. Does your router have any "MSS clamping" settings - if so, I would try disabling those settings. Otherwise I would suggest asking PlusNet support (showing them the tcpdump output).

I had a similar problem with my router locking up when accessing certain streaming audio/video resources. Updating the WMP network settings resolved that particular issue; not sure if it might be relevant in your case.

Gonna go out on a limb and say it is a subnet mask problem, either with your local LAN (should be 255.255.255.0) or with your WAN-side.

I suggested this because if the subnet mask were incorrectly set to something like 255.254.255.0, you could end up with strange results - for big sites (with multiple A records) seemingly random reachability.