The Lattice Project - How did we miss this one?!?

The current work units are much better behaved than the last time I played with this project. Work consistently takes 925MB of RAM so a quad core with 4GB of RAM will run 4 of them if nothing else is going on. Each work unit takes about 4 to 6 hours and are otherwise fairly well behaved.

The problem I'm having is that even running the Lattice WUs 1 at a time they slow down the POEM GPU app noticeably. Haven't seen this level of slowdown with Yoyo, Docking, QMC, LHC or Correlizer. POEM GPU apps are CPU hogs though for sure.

We might squeek past LAF but it would be tough to hold. There is nobody within a million behind us to worry about. Whizbang is certainly doing a bang-up job on Lattice right now. I just snuck past Beyond but don't think he noticed I am going to keep a machine or two on here until Simap in a few days.

Something weird is up with the credit on the latest batch. 15k seconds on an i7 used to give around 90 credits (about 200 cobbles/hour/core). It now gives around 30. There is some moaning going on on the official forum. It affects everyone equally but makes it a bit harder to play catchup.

Something weird is up with the credit on the latest batch. 15k seconds on an i7 used to give around 90 credits (about 200 cobbles/hour/core). It now gives around 30. There is some moaning going on on the official forum. It affects everyone equally but makes it a bit harder to play catchup.

I wonder if Lattice has updated their sever software and gotten the infamous "Credit New" system that DA put into all BOINC sever software as of sometime in early 2011. If you want to use the old credit system you now need to roll your own server code as all support for all older mechanisms to grant credit have been pulled from the newer software. I know it is used @ SETI@Home since that project its DA's official beta test site for all BOINC server software complained hardcore last year when it was first rolled out. DA has basically said to lump it to all BOINC projects on the new credit system; either a fixed credit/ WU, or you use my system, or roll your own from scratch as he pulled all hooks out of the newer server side software in the generic validator code for BOINC. Another name giver to Credit New on seti was random magic credit or no credit system. If the amount of credit/ WU varies to much for each WU the system goes unstable for days at a time. The new credit system is interesting to watch as it has a bit of averaging in/ CPU typer and model # and a whole lot of other things in it. I got lost and confused when someone tried to explain it to me.EDIT When the new system rolled out @ SETI the amount of credit/WU fell at the same time. I know all bug fixes and patches for BOINC server software is only taking place on versions with Credit New in it. So if the software was rolled out for a bug fix or security patch they would have no choice but also get credit new.EDIT 2Some projects have had the credit new system go so unstable that at time on WU would get 10 Cr and the next sequentially generated WU from the same Data set would get 1,000. Some projects have defaulted to a fixed credit/ WU while they have a vigorous discussion with David Anderson(DA) the PI and chief programer for for the BOINC server software and BOINC client software. DA does not want to redo the Credit New system from scratch and does not want to change his baby so they are doing a lot of code tracing to find bugs and quash them. DA wants credit new to avoid projects over paying in credits to attract users away from projects that give "correct" pay in credits for work done.

Just had a post discussion with Adam over on lattice. He said the project will be in a downtime as the last big push was to finalize data for publication. He said all the PIs are off writing right now. He said that any new work will be small batch with short deadlines untill the writing is done. Edit spelling iPad autocorrect for the win

Just had a post discussion with Adam over on lattice. He said the project will be in a downtime as the last big push was to finalize data for publication. He said all the PIs are off writing right now. He said that any new work will be small batch with short deadlines untill the writing is done.

Another batch going out now. I'm unfortunately have 2 long running WU from Prime grid and CPDN so for me due to BOINC v7 cache rules. I suspended one WU and the server is now telling me it has no work for OSX Intel as the app does not exist.

Edit looking at posting on the latest batch they are having difficulties, with the WU long run time (30-180 hour work unit runtime), large memory footprint (2-6 GB), long wait to first checkpoint as in 2-4 hours until the first checkpoint is reached. Looks like this data set is thorny exposing supposedly fixed flaws in the GARLI science application. I have not seen the checkpointing flaw since we went to the new Garli 2.0 algorithm a few versions ago.

Edit 2:looking at the Lattice forums some more and the latest WU are Win 64 only due to them needing more than 4GB of free memory to run. New Linux/OSX 64 bit builds of the science apps are going to be coming real soon now (as soon as Adam has some free dev time).

Adam just posted on a new bug in he has a patch deployed but some WU may still throw and error and fail if they are older. http://boinc.umiacs.umd.edu/forum_thread.php?id=392&nowrap=true#5222Looks like an output file is written to more often with the new supper large WU and the max size bound for the output file is being reached and triggered. This bound is in place to prevent runaway WU and has never been triggered before this. the trigger is being triggered by having the WU resume several times from a checkpoint pushes it over the edge. If you keep the WU in memory when not running you can finish them if you are being bitten by this bug. Adam said the patch is in for any new work.

I don't know if Adam pulled the plug on WU with the write bound file as they most likely will fail at the halfway point depending on how often the are written out of memory. Adam did say he was looking into fixes. I have seen nothing posted on the forums so have no specific info. I'm also sick as a dog so will not be posting much for the next few days. I recommend asking Adam over on the lattice forums during the workweek he usually responds within a few hours. That is working hours here on the east coast of the USA. From talking to Adam he has something else eating his time in a big way for the last week. He said he wants to do a 64 bis science app for OSX and Linux but does not have the time right now.

I got 10 Garli WUs this morning, and word of caution: they're 1.5 GB RAM, or more, each

Yeah from what I see the PI for the Garli project has started a whole new phylogenetic tree for Lepidoptera (moths and Butterflies) with a lot of genes being used to make the tree. This makes for a very large DB used in the WUs for the start of the project. The last project is out for publication and this is the start of the new one.

Also Adam said the last batch had a few runtime errors that he is trying to pull back with cancelled by server commands. They have a misconfigure that results in a termination several hours into the run. Adam implied this was a auto data formatting error when the WU where auto generated from the raw data files. I think it is a WU size issue. I am getting the vibe that the PI may be pushing the limits of the New Garli 2.0 algorithm. The algorithm was updated to handle larger data sets but the new trees seem to be pushing the limits of the new Science app.

More work. I'm full with a Prime gride challenge WU so can't check on requirements to run. Most likely it will be another jumbo set.

Edit I have a PM in with Adam for updates on the status of 64 science apps for Linux/OSX. Along with an update on several other things that have been under development that seem to have stopped in the last year.

Ok just heard back from Adam. He had some nice updates in his PM. I'm going to post the PM here. Hope you all enjoy it.

Quote:

AdamThis last round has a memory requirement < 2GB, so does not require a 64 bit app.

Currently there has been no progress on a GPU version of GARLI, but the possibility for the future is not completely ruled out.

Some of this work is for a project we're calling "LepTree II" - which we are currently pursuing funding for. It is an extension of the successful LepTree project (http://www.leptree.net), which will use next generation transcriptome sequencing.

Quote:

BerndJust updating the Ars technica team as to what is working on Lattice. We have several folks who crunch on Linux and OSX and the last several rounds of work have kept them on the sidelines due to the lack of 64 bit apps. Also does this last round of work require a 64 bit science application?Second question is more open ended. Are you still working on a GPU Science application? What is the status of the work? Any new projects besides Garli on the horizon? Any thing you want to comment on?

Watch out for the gnarly Garli's. I had an i7 with 16g ram thrashing when I came home tonight with 8 of these monsters. I aborted two and things got out of swap.

I had the same thing happen - they were using 2.7 GB each. I just suspended a few of them to calm it down, and re-enabled them when the others finished. Wish there was a way to limit how many run simultaneously...

Watch out for the gnarly Garli's. I had an i7 with 16g ram thrashing when I came home tonight with 8 of these monsters. I aborted two and things got out of swap.

I had the same thing happen - they were using 2.7 GB each. I just suspended a few of them to calm it down, and re-enabled them when the others finished. Wish there was a way to limit how many run simultaneously...

Just a a limit to the total amount of memory to be used by BOINC. It is in the website preferences to control at the account level or you set it at the individual computer level. I believe the setting is "Use at most xxx% of memory when the computer is in use (idle)". It is not super precise, but at least you can keep your machine from thrashing in swap. I haven't used the function since the move to 7.x.x, but it worked in 6.x.x.

Watch out for the gnarly Garli's. I had an i7 with 16g ram thrashing when I came home tonight with 8 of these monsters. I aborted two and things got out of swap.

I had the same thing happen - they were using 2.7 GB each. I just suspended a few of them to calm it down, and re-enabled them when the others finished. Wish there was a way to limit how many run simultaneously...

There is with newer versions of BOINC. Use an app_config.xml file in your project directory. Here's one I use for Yoyo on low memory machines:

Sorry folks been away for a bit. I am down for now on crunching. Finances. I have only interim ant net access and have put all expenses in the deep freeze. Hope all is well with everyone. Hope someone else can watch over this project as I happened to like this one. Hope to be back on my feet soonish. Bernd

Sorry folks been away for a bit. I am down for now on crunching. Finances. I have only interim ant net access and have put all expenses in the deep freeze. Hope all is well with everyone. Hope someone else can watch over this project as I happened to like this one. Hope to be back on my feet soonish. Bernd

Your posts are appreciated.Sorry to hear about that - good luck (may the situation be short-lived)!

I'm sorta back folks. Looks like I have a semi reliable net access again. I will still need to watch my bills so I will be crunching very little for a few more weeks but It is starting to get better. Just looked at the Lattice website earlier this week. They said the just updated all their back end software and at the time I looked they had work again.

Good to be back and see all the familiar names still here. Best wishes for the holidays for folks and hope to see more of you in the new year. Bernd

Oops. Made the mistake of running this on a bunch of machines blindly assuming the issues of yore were bygone and had been addressed. I was naive, unjustifiably hopeful and wrong. I have probably 100 WUs that have monopolized time on over 10 machines for a week and they all appear to be bad WUs that would continue crunching indefinitely were I not to abort them (http://boinc.umiacs.umd.edu/forum_thread.php?id=751&sort_style=&start=20 is one among many threads dealing with this problem).

Talk about thread necromancy. I had to go back quite a while to find this sucka. I believe the DCA is exempt from necromancy rules ... or at least I hope it still is.

I built a machine last year to play with VM projects, atlas@home and others. Many of my current builds have 1GB RAM per core. A few have two. Atlas and friends really wanted four. Since getting 32GB for my i7's was kinda expensive I found myself a Dell with an i5 on fleabay pretty cheap and stuck 16 GB RAM in it and played with atlas@home and a few other projects. Then, when I was done, I dropped it on Lattice.

Lattice still needs hand holding and daily monitoring. It is easy to get all of your cores locked onto never ending work. A good third of the work units fail to download putting your client in a 24 hour back off which you need to manually update. You need to go through your list of ready to start or running and abort anything that starts with 1974. Too bad, that was a good year. Oh, and most work finishes in seconds, some takes hours. Combined with the download errors it is very easy to get an idle machine.

Most work takes a few hundred KB of RAM. Some takes 1 GB. Rarely do you see a 4GB. Occasionally you see a 6GB. Life sucks when you get four big ones at a time on a 16GB machine. syslog says it has happened a few times this year.

Imagine my surprise when I looked today and noticed that my crappy little i5 was the #2 machine in terms of RAC on the project!

Well, maybe not so surprising when you consider how the project has been run for so many years.

Talk about thread necromancy. I had to go back quite a while to find this sucka. I believe the DCA is exempt from necromancy rules ... or at least I hope it still is.

I built a machine last year to play with VM projects, atlas@home and others. Many of my current builds have 1GB RAM per core. A few have two. Atlas and friends really wanted four. Since getting 32GB for my i7's was kinda expensive I found myself a Dell with an i5 on fleabay pretty cheap and stuck 16 GB RAM in it and played with atlas@home and a few other projects. Then, when I was done, I dropped it on Lattice.

Lattice still needs hand holding and daily monitoring. It is easy to get all of your cores locked onto never ending work. A good third of the work units fail to download putting your client in a 24 hour back off which you need to manually update. You need to go through your list of ready to start or running and abort anything that starts with 1974. Too bad, that was a good year. Oh, and most work finishes in seconds, some takes hours. Combined with the download errors it is very easy to get an idle machine.

Most work takes a few hundred KB of RAM. Some takes 1 GB. Rarely do you see a 4GB. Occasionally you see a 6GB. Life sucks when you get four big ones at a time on a 16GB machine. syslog says it has happened a few times this year.

Imagine my surprise when I looked today and noticed that my crappy little i5 was the #2 machine in terms of RAC on the project!

Well, maybe not so surprising when you consider how the project has been run for so many years.

The never ending WUs tend to get stuck at (IIRC) exactly 11.750% - make it easy to spot if you know what to look for.Congrats on #2 machine