Labels

Milestone

Assignee

4 participants

The steps to reproduce the issue are somewhat complex, but we encountered it when proxying concurrently to many different back-end servers. Somewhere between 30 and 50% of the time the handshake would get corrupted. It appears that the assumption was being made that if the second socket.write had been flushed, the first one would have been flushed as well - but that apparently that is not always the case or perhaps, because noDelay is set they were being split into separate responses instead of as the single response they are supposed to be? Either way, the source of the problem was the double socket.write that was being done to push the string data and the binary data.

I condensed these into a single write by reconstructing the buffer if modifications were necessary and the handshake failures seem to have stopped.

@ashaffer Copying all data into a single buffer makes this particular problem go away, but it doesn't solve the underlying cause. Making two write calls should just work. So I'm curious what kind of "corruption" you're seeing, e.g. are buffers truncated, are bytes changed, do packets arrive in the wrong order?

Unfortunately I never pinned down the exact nature of the corruption that was occurring. My best guess as to what was happening is that the first write was sometimes being flushed completely before the second write happened, and this response splitting was causing issues in some way. Perhaps a chunked encoding header would fix it as well (since during the handshake its still using standard HTTP which cares about such things?).

If there are any decent debugging tools out there for websockets then i'd be happy to run some tests for you, but the best i've found was manually poring through tcpdump logs which was incredibly tedious and didn't end up yielding me much.

@ashaffer Unless your backend service is very poorly written, response splitting should never be a problem. Remember that tcp is a stream-oriented protocol. If this is really the cause then you probably want to fix your backend. What I would advice you to look into however is whether the handshake entirely makes it to the backend. If you see a tcp FIN or RST packet (an 'end' or 'error' event, in node terms) before the handshake is done, then that could be caused by a node-http-proxy bug.

Sorry I should elaborate on the structure of our setup. We are proxying twice (both times through node-http-proxy) with the final backend of SockJS (0.3.1). The foremost proxy splits requests between 3 backend node-http-proxy servers, which in turn only proxy to one final server instance. We have this somewhat odd setup because we have extended the Meteor project to support instanced sub-applications and so have sat our own proxy in front of several Meteor instances, each of which already does one layer of proxying. However, the Meteor proxy instances are a 1:1 mapping between proxy and backend, whereas our subapp proxy is not.

Each client will at any given time be connected (via the proxys) to all 3 backend server instances. We have localized the issue to occurring in the foremost proxy instance in which the fan-out occurs. It seems to occur around 30% of the time, which means often we would get 1 socket failing and two working, although plenty of the time we'd get other configurations like all 3 working or all 3 failing.

The response splitting issue is only my best guess as I can't think of why else two separate socket.write's as opposed to one would so completely resolve the issue for us. Definitely not saying that is the issue and based on my understanding of TCP as well it shouldn't be the issue, but it or something like it fits the evidence more closely than any other explanation I can think of, though i'd certainly be interested to hear any other ideas as i've been trying to nail this down for a few weeks now and this was the solution I came up with.

Also this code is proxying data on its way back to the client, not going to the back-end, so I think it being a SockJS bug is fairly unlikely, as sockjs isn't touching the data between when this happens and when the connection breaks.

The only options to me seem to be that it's a bug in node-http-proxy, a race condition in node itself's handling of write flushes, or in chrome and firefox's implementation of websockets. Whichever way happens to be the culprit, condensing to a single socket.write fixes the problem.