I very frequently get network errors, like connection lost, connection reset, etc., because of the very bad internet where I currently live. But why can't boinc-client handle that while trying to connect? Every time I either get lots of "exited with zero status but no 'finished' file" or simply get "Unrecoverable error"! WTF? Why are computation and network linked so much together? Shouldn't they be separate processes which doesn't interfere with each other? Considering how many hours of work I'm loosing (today about 12 hours in 4 WU's) I'd rather not run BOINC then...

This has happened both in various windows and Linux OS's, in various BOINC-versions from 5.x.x - 7.x.x and on different computers. I wasn't bothered by it so much before, because I lived in a 1st world country with ok or better internet so it happened maybe once a month, but where I live now it happens daily, even several times a day if I'm not micro-managing, but all the time we're told we're not supposed to be micro-managing boinc (and I don't want to). The way it is now I have to do just that: Manually disabling computation, before allowing network activity, wait for uploads / downloads / server contact to finish, then disable network activity and enable computation. :(

This one though, is a computation error, and has to do with either the application being not so stable, or trouble with your memory, your virtual memory (page file) or a bad batch of tasks (it happens).

This one happens when something outside of BOINC is interfering with the running of BOINC, like an anti-virus, anti-spyware or other anti-malware program actively scanning the BOINC Data directory.

That you see all three at the same time in your log can be due to extra stresses that BOINC brings along on a normal day. Running all kinds of calculations through BOINC stresses a computer out already, but when there are network problems, the computer will go into extra stress. With the network card these days being standard integrated in the motherboard, it's (part of) the CPU that will have to cater for the network connection. And when that CPU is busy doing intricate calculations...

I see that here as well on an otherwise stable system. Throw a slow network transfer in the bunch and my computer struggles. But then when I use a separate PCI 1000Mbit add-on card, the whole system flies, no matter what. And it ain't an old & slow one either. ;-)

So, first checks first:
1. Do you have an anti-virus or other anti-malware program scanning actively in the background?
1a. Is your BOINC Data directory excluded from being scanned?
2. What kind of system is it?
3. The network card, is it integrated into the motherboard, or a separate add-on card?
3a. If integrated, do you have the option to try an add-on card? (No, I didn't say you have to go out and buy one... :-))

Now, there is a problem with BOINC that I reported recently where when BOINC comes out of hibernation or sleep and the network card hasn't reinitialized yet, and BOINC has downloads waiting, that it will try to do those before there is an internet connection which results in corruption in the files. However, this is a difficult one to track and reproduce. (not all projects have download problems ;-))Jord

To my knowledge the BOINC network access ie the internet connection with the server is completely separate from the running of the tasks. I don't think I've ever had an internet problem or server access problem affect how my tasks run. You shouldn't need to suspend the running of your tasks while your files upload.

I suggest you look at the possible causes of your task crashes. Every crashed task should generate an stderr file (you need to go to your account on the project and then find your task list to see these files). These stderr files are sometimes rather cryptic and you need to get used to what some of the strange phrases mean. Often you see far more details than what's in the Event Log.

For example, from your messages, the Signal 11 error. Here's what Jorden tells us about it in the BOINC FAQs:

Thanks to both of you, Ageless and mo.v for the quick replies :) I'm almost 100% sure it's not a coincidence, because it happens _every_ time I have network-problems (if I don't suspend computation), but not exclusively then (busy system sounds plausible, like the old "no heartbeat from client"?). Perhaps the asteroids-application can't handle that situation so they fail. Sadly can't investigate further now as I'm preparing for a longer trip, so it'll be some weeks to get back on this. In the meantime I'll micro-manage ;) Still would like to know what log-flags could maybe help investigate this issue...

It is not a coincidence. During the internet problem times, DNS resolution can be affected. If boinc tries connecting to a project server and can't resolve the DNS quickly, it causes the no heartbeat error. Some science applications error out with signal 11 when receiving the no heartbeat. My Linux (Lubuntu 11.10, ver. 7.0.27) has recently errored on 9 Asteroids tasks when DNS resolution was having problems.

Editing the hosts file to include the IP address of projects and running a local DNS cache has helped, but still has not completely eliminated the problem. So it may be more than DNS and possibly just general internet connection problems that can cause the no heartbeat error.

Are the boinc client communications with the internet and the client communications with the science applications linked together?

It is not a coincidence. During the internet problem times, DNS resolution can be affected. If boinc tries connecting to a project server and can't resolve the DNS quickly, it causes the no heartbeat error. Some science applications error out with signal 11 when receiving the no heartbeat. My Linux (Lubuntu 11.10, ver. 7.0.27) has recently errored on 9 Asteroids tasks when DNS resolution was having problems.

Editing the hosts file to include the IP address of projects and running a local DNS cache has helped, but still has not completely eliminated the problem. So it may be more than DNS and possibly just general internet connection problems that can cause the no heartbeat error.

Are the boinc client communications with the internet and the client communications with the science applications linked together?

I agree that the task errors and the internet problems are linked, and I also agree that DNS name resolution on the flakey internet connection is likely to be implicated in that linkage.

My suspicion is that when the BOINC client asks the libcurl sub-component to connect (by name) to a project server, everything is put on hold until, at least, the resolved IP address comes back from DNS. If that involves a wait of more than 30 seconds and a timeout (which, in non-corporate environments, is plausible, because the DNS server is likely to live with your ISP at the other end of the local loop), then the heartbeat mechanism may be stalled and the errors follow.

An added complication is that libcurl handles all TCP/IP communications for the client, and - as well as project internet comms - that includes localhost loopback messages between the client and BOINC Manager, and any remote RPC calls that might be issued by a local aggregator like BoincTasks or BoincView.

Comms are tricky things, and failures anywhere can cause delays and problems. I recently lost a host which was listed by name in my remote_hosts.cfg file: I noticed the other machines on my LAN stuttering as they generated the "Can't resolve hostname in remote_hosts.cfg: xxx" message and notice, far more often than I would have thought was necessary.

Edit - communications between the client and the science applications are handled by files written into a shared memory area - a virtual solid-state disk. They should be exempt from the TCI/IP problems.

So basically I have to micro-manage for connection until ticket 113 is fixed (not likely) to avoid errors. Was looking for that one yesterday before posting, but couldn't find it so thought it was fixed...

I realize now the WUs that error out is actually the applications fault, not BOINC directly, but still it bothers me... I only crunch on 3 out of 8 cores on my main system (to make less heat), so should be (and there is) plenty of room left for other things. Still this happens, not only if network gets problematic, but also if (mechanical) harddrive gets busy: Try create a big non-dynamic virtual harddrive for use in virtualbox, say 100GB and watch BOINC-manager get unresponsive and WUs error out or "no finished file" after creation is finished.

The trouble is, all you're doing is to assume that the ISP's DNS is the cause of the problem, and bypassing it.

For that solution to work, the ISP's routers and connectivity (both upstream and downstream) have to be fully present and correct.

If the problem is BOINC's use of synchronous DNS resolving (as the very interesting quote from Nicolas in that trac ticket suggests), then a better solution would be the installation of a local caching DNS server. But then you have some tricky management decisions to make regarding caching: SETI's download server url, for example, deliberately has a TTL of 5 seconds, and according to the rules shouldn't be cached. There may be others - it depends which projects are running.

In Windows, the command

ipconfig /displaydns

is a useful tool for getting an idea of what caching is allowed on the sites you visit regularly - I've just discovered that SETI have slowed down their round-robin DNS with a TTL of at least 50 seconds (ipconfig shows the remaining TTL since the last lookup, not the full value).

Also Anycast (used by OpenDNS) can have some influence:
"With TCP anycast, there are cases where the receiver selected for any given source may change from time to time as optimal routes change, silently breaking any conversations that may be in progress at the time"http://en.wikipedia.org/wiki/Anycast

Here is another example of the no heartbeat error causing tasks to error out with signal 11 on Linux. In this case it was not an internet connection problem, but a misbehaving project (DNA@home) that was holding up the client from communicating with the science applications. First time, 4 Asteroids and 1 WUProp tasks errored, second time Correlizer exited but managed to recover, and the third time 4 WCG HCMD2 and 1 WUProp tasks errored. DNA is now on NNT, so it is not contacting the project anymore.

Sorry for resurrecting this one, but during my current trip to a country with decent internet-connections, I found I can very easily replicate this behavior of boincclient (linux / windows, doesn't matter) locking up when trying to connect to internet:

Connect to a router (dhcp / manual, wire / wireless, ISP-dns / router-dns / manual-dns, doesn't matter) so your computer registers it's connected to a network. Disconnect the cable from the router to "the internet" ;) and next time boincclient wants to connect, it's starts locking up. You can't use boincmanager to suspend network activity (I didn't try with boinccmd this time, but I seem to remember that also didn't work).

To get response again from boincclient, first stop boinc-service (I don't have any non-service boinc installation, so didn't check). Then edit "client_state.xml". Locate "<user_network_request>2</user_network_request>" near the end and change the value to "3" and save. Restart boinc-service and all is well again.

My main complaint is that boincclient completely locks up while waiting for connection. Since it obviously has no problem handling multiple processes, why does network-connection seemingly change it into a single-process-only program?