How common are data corruption/errors in TCP? I mean errors that slip by the correction facilities.

I had my game server running 30 minutes, during that time had 2 clients sending about 150 KB of data to the server. After about 30 minutes, my server crashed because it read an illegal value (my server does not yet deal well with illegal client data). The server ran on the same computer as the clients, and although it is possible that I have made a mistake, I find it to be a bad explaination, because for those 30 minutes basically the same messages were sent over and over. It has happened 3 times in the last 2 days. So it's kind of rare, too rare for me to think that it is a coding mistake, yet common enough to make me wonder what it can be.

I was under the impression that TCP is very reliable, especially when just sending to 127.0.0.1.

Unaugmented TCP is basically what is used to download .EXE files and such from the web.

When was the last time you downloaded something and discovered it was corrupted? When was the last time you found a spelling mistake on a web page that magically went away when you refreshed, because it was actually a TCP transmission error?

You say you dont handle illegal client data do you know if anything else could possibly be connecting to it and sending some data your not expecting? What port are you using?

Good point! Maybe that's what's happening. I'll look into this. I was using port 2345 or 1300 something like that. Just typing 4-digit numbers... Although I'm beind a firewall... Where does standard ports that might be open end?

Think of it this way.

Unaugmented TCP is basically what is used to download .EXE files and such from the web.

When was the last time you downloaded something and discovered it was corrupted? When was the last time you found a spelling mistake on a web page that magically went away when you refreshed, because it was actually a TCP transmission error?

How common is data corruption/errors in TCP? I mean errors that slip by the correction facilities.

I had my game server running 30 minutes, during that time had 2 clients sending about 150 KB of data to the server. After about 30 minutes, my server crashed because it read an illegal value (my server does not yet deal well with illegal client data). The server ran on the same computer as the clients, and although it is possible that I have made a mistake, I find it to be a bad explaination, because for those 30 minutes basically the same messages were sent over and over. It has happened 3 times in the last 2 days. So it's kind of rare, too rare for me to think that it is a coding mistake, yet common enough to make me wonder what it can be.

I was under the impression that TCP was very reliable, especially when just sending to 127.0.0.1.

Do you have any ideas of what might be wrong?

TCP data basically doesn't get corrupted. Each little segment is protected with a separate CRC, as well as generally by physical-layer systems (depending on specifics). Consider: When verifying md5sums of downloaded files, I've never had a download fail, even if the download is a 4 GB ISO file.

My guess is that your client or server code is wrong, or that someone else connected to your server and sent it garbage. For example, a memory corruption bug could cause random data to get corrupted.

Hm, just tried again; I ran the server for about 20 minutes before the same problem occurred again. I was keeping track of all the connected clients, no new one ever connects. The message is from one of my own clients. But I'm using asserts in the client to make sure the client never sends illegal data. Moreover, I'm sending the exact same message about 30000 times or so before the problem occurs. I just connected the client and let it sit there sending and receiving updates. This is starting to annoy me. I find it totally believable that I have made a mistake, I just find it strange that it just happens all of a sudden after I have done the exact same thing ten thousand times before.

Any ideas?

Some code (in Clojure, but using Java socket channels. If you know Java and are curious you can probably figure out what I'm doing). The error occurs in this first file when I'm reading from the channel, in this piece (I get something negative out from getShort on the buffer:

I believe google published an article on errors in their data centers. Data corruption is in the order of 1 per petabyte (maybe even messages) or so. It's absurdly low, but it's there. Only application-level checksums detected those.

It's also highly unlikely it happened here, there are way too many other more likely factors to consider.

However, nio used to have some obscure bugs and unexpected or undefined behavior, some even conflicting official documentation. My bet in this case would be that networking errors are caused by those. They were never fixed even in standard library and most ended up with WONTFIX. They also depend on individual platform stacks, so they are not universally reproducible. Problems are caused by inconsistent implementation of Berkeley socket API and incorrect interpretation of something or another, I forgot.

Ideally, you would go rummaging through Sun's bug tracker for nio related issues and try to ensure they don't occur in Clojure bindings. But that's rather tedious task. Mina (IIRC) project did a framework on top of nio which might contain some more insight into that.

I also seem to remember there being a race condition or synchronization bug with nio selector.

Beside that, there could be a lot wrong in Clojure bindings. The selectors work fine most of the time, but I do recall there being some obscure edge cases that simply aren't handled. I think even most Java examples are broken in this way.

Something to consider: The maximum number in a signed short is 32767. If you use a message sequence number or similar, and store it as a short, you would get an error after that value.

Sounds like you have a reproducible case, though. That's good! Print out the number of messages you have gotten every 100 messages or so, and check what the value is after crash. Then set a breakpoint after that number of messages, and re-run the case, so you can debug when the crash happens. Or, if you have access to VMWare Workstation, try using Replay Debugging.

You can also log all the data to a big file after receiving it, for later analysis. You may be able to then pipe that file back into the server to repeat the behavior that clients already had, to reproduce the crash faster so you can debug it.

I'm not that familiar with Clojure (been 20 years since I did Scheme :-) so I didn't take the time to read through all your code, sorry. Maybe someone else on the board?

I believe google published an article on errors in their data centers. Data corruption is in the order of 1 per petabyte (maybe even messages) or so. It's absurdly low, but it's there. Only application-level checksums detected those.

It's also highly unlikely it happened here, there are way too many other more likely factors to consider.

However, nio used to have some obscure bugs and unexpected or undefined behavior, some even conflicting official documentation. My bet in this case would be that networking errors are caused by those. They were never fixed even in standard library and most ended up with WONTFIX. They also depend on individual platform stacks, so they are not universally reproducible. Problems are caused by inconsistent implementation of Berkeley socket API and incorrect interpretation of something or another, I forgot.

Ideally, you would go rummaging through Sun's bug tracker for nio related issues and try to ensure they don't occur in Clojure bindings. But that's rather tedious task. Mina (IIRC) project did a framework on top of nio which might contain some more insight into that.

I also seem to remember there being a race condition or synchronization bug with nio selector.

Beside that, there could be a lot wrong in Clojure bindings. The selectors work fine most of the time, but I do recall there being some obscure edge cases that simply aren't handled. I think even most Java examples are broken in this way.

Something to consider: The maximum number in a signed short is 32767. If you use a message sequence number or similar, and store it as a short, you would get an error after that value.

Sounds like you have a reproducible case, though. That's good! Print out the number of messages you have gotten every 100 messages or so, and check what the value is after crash. Then set a breakpoint after that number of messages, and re-run the case, so you can debug when the crash happens. Or, if you have access to VMWare Workstation, try using Replay Debugging.

You can also log all the data to a big file after receiving it, for later analysis. You may be able to then pipe that file back into the server to repeat the behavior that clients already had, to reproduce the crash faster so you can debug it.

I'm not that familiar with Clojure (been 20 years since I did Scheme :-) so I didn't take the time to read through all your code, sorry. Maybe someone else on the board?

The number is the length of the messages, not a sequence number of some kind. If it was off I would notice it right away because I use a ByteArrayInputStream to read the messages back, and it would not work if I didn't give it a byte array with a whole object in it (the byte array is the next <length> bytes of the socketchannel). When I said it was the exact same messages over and over I didn't lie.

Hm, now the server has run for an hour or so without running into the problem, the client has send over 200 000 messages to it. Not so reproducible after all...

But maybe it's a good idea for me to use a framework such as MINA anyway. Do any of you have any experience with it? Any tips?

Btw, Antheus, I have not been able to find any info on bugs in nio. It sounds strange that Sun (or Oracle?) would just not fix bugs in it. And Mina is built upon it, so it must be manageable somehow? You are making me paranoid.