Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

If you end up needing a new CPU for Bruno let me/us know what type and we'll get you sent a replacement asap.

You should be able to get the root prompt, but you might not be able to launch (page in) ssh/login/bash to get any prompt. Not sure how you are configed, but you might need to leave a terminal logged in and set to above normal priority. Obviously a security risk so it needs to behind a physically locked door.

As to that leak, not sure what debugging tools you have, but unless it is one of the POSIX designed in leaks, you should be able to find and quash it. Perhaps a little personal development time reading up on the different available tools might find a new path to try. Worse you will find the right tool but it isn't available for Fedora. e.g. Malloc Debug http://www.manpagez.com/man/3/malloc/

I've spent more than my fair share of time with malloc debug and various equivalents. It can work, but for non-trivial cases it can take like, forever (I've spent weeks on this sort of problem).

Hopefully there's some decent memory profilers for *nix. If so, can you dummy up a bucketload of either simulated or actual data and throw at a testbed assimilator. Memory profiling should at least help you with where to look, if it doesn't give you the smoking gun.

That said, as it's all DB backed, unclosed queries/result sets would be a place to start.Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

As for MySQL, I've rolled my own (I use Gentoo, ergo I have no choice), and have had no troubles (but then I don't use InnoDB, replicas or countless other features you do). The trick, as ever, is finding a 'good' version and then figuring out what arcane combination of configure options pushes the right buttons to make it have all the features you want, in the right order. Not sure if Fedora have 'volunteers' as such, but if they do, maybe one of them could help.Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

Thanks for the update Matt. Hope things start getting better for everyone there in the lab. Thanks for your hard work and dedication to the project. Good luck with the music.... Play a few songs for me. ;)

I assume that when Matt is silent either things are going according to plan, so there is nothing really to report, or things are going so badly that he hasn't got time to report. I hope that in the next few weeks it is the former that dominates, and that his plans to divert more time to his music are well fulfilled, without too much interruption from the lab.

(off topic - Matt what's the guitar in your sig, and how's your young, feline, apprentice doing, looks as if it could be a mean picker.....)Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

If the driver is still part of the Linux kernel source code, you could just compile a custom kernel as part of your Fedora installation. Copy the /boot/config-2.6.xx file into /usr/src/kernel/2.6.xx/.config before running the make menuconfig.

If the event causing the leak happens infrequently, finding it in the mounds and mounds of output can take forever. Hence finding a tool that reduces output when all memory is reachable makes the task perhaps possible. There will be delay in the output and going backwards to find the issue is another matter. If the issue is library calls that leak - there are some - then the problem may be intractable. If enabling debugging makes the process too slow, that is another issue. But if you can find out what use the block is that leaks then you can design in debugging to find where it may go missing if a read through doesn't tell you.

Something I ran across recently that is probably right on target for you guys (and a lot of others). If you want the stability of RHEL/CentOS AND the latest key software packages (like MySQL 5.5), I don't know of a better way to go. It's sponsored by RackSpace who obviously knows something about this sort of thing and has a vested interest in making it all work.