I have released a new version of Nbody. This version corrects some bugs in the algorithm. Also, we performed some optimization on the initialization code which allows it to run much faster. We will be looking to change the time estimate to fully account for this initialization in the future. As always, thank you for your patience.

Oh, I was wondering why it was only using one thread out of the 15 of 16 I have allocated for CPU tasks. Cycling BOINC and setting it to use all cores and threads had no effect in getting the app to use everything.

Dont know if the "stalling to 100%" is in effect like what was going on with 1.46, but there's a whole new can of worms opened with this one.

App is obviously v1.48, workunit that it's chewing through is ps_nbody_2_13_orphan_sim_1_1422013803_462792_0, BOINC client is 7.4.36 x64.

If there's any more information you need, let me know and when I wake up you'll get it to the best of my abilities.

Quick edit: About 8 minutes in running single threaded the program finally realized that there's more than one thread available, and started running on another 10 threads. Still not fully utilizing what's been made available, but its better than nothing. I'll let it chew through units throughout the night and see how it goes.

This was expected. The initialization routine is longer than before since we are doing it by a more rigorous and scientifically accurate method. However, we have not yet multi-threaded it as of yet. We attempted to previously but it made the program indeterminate. This is something we will work on.

We had multithreaded the assignments of radii and velocities to bodies. Both of these were done through rejection sampling, using random numbers. However, when that code ran with multiple threads the assignment of radii and velocity were different between runs even with the same random number seed and parameters. This was because which thread ran in which order was indeterminate, meaning which body was assign what radii and velocity was unpredictable. This was a very nasty bug, made nastier because it did not present itself. Runs would complete normally. However, because of the indeterminate nature of the algorithm, a poorer likelihood was reported than would be expected with a set of parameters, even if they were close. Therefore, overall, it led to poor convergence.

Therefore, for now, we removed all multithreading of the initialization of the dwarf galaxy. This is why it will run only on one thread until the initialization completes. However, after performing a speed profile on the code, it was determined that a majority of the run time of the previous code was spent on a single function (thanks to Roland Judd for catching that!). We optimized this function, leading to a great decrease in the run time.

Thank you very much for the explanation - it helps more than you know. I was really trying to determine if it was the cause of the problems I've been noticing.

For 1.46 nbody mt tasks, I noticed on multiple Windows PCs that they would get to 100%, and then continue to crunch in the "Running" state well beyond 100%, single-threaded, leaving the PC's other CPUs idle. Also, that same task's "Elapsed" value would be over 24 hours, yet "CPU time at last checkpoint" would be blank, indicating that it never checkpointed at all, during those 24+ hours.

Do you know if that behavior (going single-threaded, going past 100%, going without checkpoint)... is expected? And would that task ever complete? And is any of it a possible bug? And might any of it be fixed with the 1.48 version?

I was just a bit shocked to see idle resources on my PCs, all due to an nbody task that wasn't checkpointing, and was in fact restarting entirely every time BOINC was restarted.

We had multithreaded the assignments of radii and velocities to bodies. Both of these were done through rejection sampling, using random numbers. However, when that code ran with multiple threads the assignment of radii and velocity were different between runs even with the same random number seed and parameters. This was because which thread ran in which order was indeterminate, meaning which body was assign what radii and velocity was unpredictable. This was a very nasty bug, made nastier because it did not present itself. Runs would complete normally. However, because of the indeterminate nature of the algorithm, a poorer likelihood was reported than would be expected with a set of parameters, even if they were close. Therefore, overall, it led to poor convergence.

Therefore, for now, we removed all multithreading of the initialization of the dwarf galaxy. This is why it will run only on one thread until the initialization completes. However, after performing a speed profile on the code, it was determined that a majority of the run time of the previous code was spent on a single function (thanks to Roland Judd for catching that!). We optimized this function, leading to a great decrease in the run time.

Cheers,
Sidd

So, as I understand it, what was happening was on startup, 1.46 was multi-threading some of the starting parameters for the dwarf galaxy's constituent stars. However, the way the code was working meant that the starting parameters would differ between machines even with the same settings due to the multi-threading part being random. This in turn would result in poor overall results even though the run itself completes.

Right/wrong?

And is there anything you can do to help lower the amount of time resources are spent idle? On my machine the initialization stage takes a good 5-10 minutes, during which there is only a single thread in use with the other 14 allocated threads sitting there twiddling their thumbs and unable to do anything else. When it does go to multi-threaded mode, it only uses about 10 threads, leaving 5 free to sit there doing nothing. Thats about a third of my CPU sitting there doing nothing.

I think I might have a problem here. This project has been stuck on 100% for almost 40 hours. Yet it won't complete and run the next job. Should I abort so another job can start? Here's a screenshot of my BOINC.

I have read the thread to try to understand what is going on, but I don't understand all of it fully. But I gather from what has been posted, that this is not acceptable behavior for this project. It's using all my CPU, but yet not getting anything done.

Having similar problems here on my servers using Mint 17.1 and PNY [GeForce 6x0] series graphics card.

Have one now crunching over 193 hours. Using all 4 CPU's and the GPU in rotation. Says it's due in Mon FEB 16th but it's well past that and shows 100% complete.

Looking at properties though I get something different...

CPU Time at last Checkpoint ---
CPU Time 194:13:54
Elasped Time 193:19:20

But shows 100% complete!

CPU Time and Elasped time - chase each other but never meet!

Please whomever is doing this to us, please! make sure you have completion as the "Done factor" else the work results will never get returned to you!

I think I might have a problem here. This project has been stuck on 100% for almost 40 hours. Yet it won't complete and run the next job. Should I abort so another job can start? Here's a screenshot of my BOINC.

I have read the thread to try to understand what is going on, but I don't understand all of it fully. But I gather from what has been posted, that this is not acceptable behavior for this project. It's using all my CPU, but yet not getting anything done.

Andrew-PC and cowboy2199 check the version of the CPU clients you are running. If it reads version 1.46, thats the bugged version and those units will never finish. Abandon those units then force an update for milkyway@home to get v1.48.

Heads up though, v1.48 has a lengthy initialization period in which it runs single threaded but ties up all available cores.

If it reads version 1.46, thats the bugged version and those units will never finish. Abandon those units then force an update for milkyway@home to get v1.48.

Did we ever get absolute confirmation from a project admin, that some of the 1.46 units would never complete?

You should ask whether the tasks will ever complete *successfully*. They are guaranteed to complete, because the BOINC client will kill them eventually for over-running.

I'm not sure how long they will run on for. The traditional setting for a BOINC project in general is ten times the initial estimated runtime, but I wouldn't be surprised if this project had given themselves a bit of extra headroom to protect themselves from what they might see as premature exits.

If you're feeling masochistic, check the ratio between <rsc_fpops_est> and <rsc_fpops_bound> for the tasks in question.

Therefore, for now, we removed all multithreading of the initialization of the dwarf galaxy. This is why it will run only on one thread until the initialization completes. However, after performing a speed profile on the code, it was determined that a majority of the run time of the previous code was spent on a single function (thanks to Roland Judd for catching that!). We optimized this function, leading to a great decrease in the run time.

Cheers,
Sidd

To put some figures on that. I'm running on an i5 laptop (2 cores, 4 threads). Current task was estimated at 5 hours 26 mins. The single-threaded initialisation phase lasted for 18 minutes. (I think the initialisation lasted roughly the same time for the previous task, estimated at 50 minutes, but I didn't have Process Explorer open to monitor). Would I be right in assuming that the initialisation would be expected to have a constant duration for any given host, no matter how long the expected task duration?

During initialisation, the application doesn't checkpoint, and doesn't report any progress %age. I'm running the current recommended BOINC v7.4.36, which estimates and reports a 'pseudo progress %age' to reassure the casual observer if the application fails to supply any real %age.

But when the application switches to true multithreaded mode after initialisation, it

a) checkpoints
b) report the true progress to that stage, which it reports as zero%

So, the quite significant 'pseudo progress' (5.4%, for this task), is thrown away, and the progress bar regresses to the origin.

That's all 'by the book' - just reporting it as a 'boincification artefact' that might catch some users unawares.

Therefore, for now, we removed all multithreading of the initialization of the dwarf galaxy. This is why it will run only on one thread until the initialization completes. However, after performing a speed profile on the code, it was determined that a majority of the run time of the previous code was spent on a single function (thanks to Roland Judd for catching that!). We optimized this function, leading to a great decrease in the run time.

Cheers,
Sidd

To put some figures on that. I'm running on an i5 laptop (2 cores, 4 threads). Current task was estimated at 5 hours 26 mins. The single-threaded initialisation phase lasted for 18 minutes. (I think the initialisation lasted roughly the same time for the previous task, estimated at 50 minutes, but I didn't have Process Explorer open to monitor). Would I be right in assuming that the initialisation would be expected to have a constant duration for any given host, no matter how long the expected task duration?

I think the initialization period varies between workunits. IMO the big problem right now is that the initialization period is blocking the other cores/threads from doing meaningful work. I have an 8c/16t CPU that i use in my main rig (E5-2690, 2.9ghz stock, 3.3ghz all core turbo, very powerful chip) that i have BOINC set to use 15 of those 16 threads explicitly for CPU tasks, with the remaining thread set aside to power the GPUs. What happens when BOINC has Nbody 1.48 running is that it allocates the 15 threads like it should, but then runs in single threaded mode for about 10-30 minutes depending on the unit.

IMO the way Milkyway runs needs further tweaking. I had a couple of ideas for helping to prevent excessive idle time, but getting them implemented would require alot of extra work on the part of the devs. If anyone's interested I'll post those ideas, but they may seem a bit outlandish since I know little about programming.

Interesting thoughts. I wonder if Sidd can enlighten us where the parallelisation 'sweet spot' is, or if they'd like us to try and find it. Considering the MT phase only, I'd imagine that there comes a point where the overhead of managing and synchronising multiple threads exceeds the benefit - but I wouldn't know whether the tipping point is above or below 15 threads.

I'm currently verifying that the Application configuration tools available in BOINC v7.4.36 allow thread control - initially, limiting the active thread count to 3, so that other projects can continue to make progress on one core while nbody runs.

The next test is to run a bundle of tasks to the first checkpoint with an app_config thread limit of one, with the intention of changing app_config when they've all been prepped, and running the MT phase with a 3-thread app_config. Very labour intensive, and not amenable to scripted automation, but might be an interesting proof-of concept. If it works, the project might consider splitting the app at the end of the initialisation phase.

a) Send out a single threaded task to perform initialisation
b) Return the initialisation data generated as an output file
c) Send out the initialistion data as an input file to a new, multithreaded, simulation task.

Interesting thoughts. I wonder if Sidd can enlighten us where the parallelisation 'sweet spot' is, or if they'd like us to try and find it. Considering the MT phase only, I'd imagine that there comes a point where the overhead of managing and synchronising multiple threads exceeds the benefit - but I wouldn't know whether the tipping point is above or below 15 threads.

I'm currently verifying that the Application configuration tools available in BOINC v7.4.36 allow thread control - initially, limiting the active thread count to 3, so that other projects can continue to make progress on one core while nbody runs.

The next test is to run a bundle of tasks to the first checkpoint with an app_config thread limit of one, with the intention of changing app_config when they've all been prepped, and running the MT phase with a 3-thread app_config. Very labour intensive, and not amenable to scripted automation, but might be an interesting proof-of concept. If it works, the project might consider splitting the app at the end of the initialisation phase.

a) Send out a single threaded task to perform initialisation
b) Return the initialisation data generated as an output file
c) Send out the initialistion data as an input file to a new, multithreaded, simulation task.

I think with the new version the parallelization of the work tops off at about 10 active threads. Once my CPU gets going, it only uses about 2/3ds of the available core count.

As for your tests, that is precisely what I was thinking of. Heres what I cooked up for an earlier post but cut it out:

What I would like to know is if the initialization and actual computation of a run can be split for more effective use of available resources. Instead of doing it like this:
Download work units
Initialize one on one thread (blocking all other cores/threads from useful work)
Process the workunit
Send in completed work
Repeat

I propose the work be done like this (2 methods):
Batch mode
1.Download workunits
2.Initialize a number of workunits simultaneously. Each unit that is initialized has its state saved to await open resources, and forms a work queue in the order its prep completes. This way the execution resources arent sitting there doing nothing.
3A.When there are sufficient workunits ready, the app switches to multi-threaded mode, grabs an initialized unit, and begins processing.
3B. When complete, the unit is turned in and another initialized unit is pulled from the client side work queue for processing.
4.When the initialized work queue is exhausted, it switches back to initialization mode and preps another batch of work.

Stream mode
1.Download workunits
2. Initialize a workunit, then begin processing the moment it is done using the specified thread like how its currently done, but minus 1 thread.
3. While the ready unit is getting chewed through, the open thread is used to prep another unit for processing. Units that are ready before resources open up have their states saved to disk. The open thread is then used to ready another unit.
4. As each unit completes, a ready unit is slotted in and begins processing. This keeps going until the work dries up (either from nothing coming from the project, or the user has set no new work for the project).

In either case, 2 programs are necessary. One to initialize, and one to actually process.

Now, I'll admit I know very little about programming and thus dont know the viability of switching back and forth between prepping a batch of workunits and chewing through them one at a time when they're ready.

A variation of this would have it going from Batch to stream mode as workflow gets moving, and instead of one monolithic MT unit, it does 2 or 3 at once depending on thread allocation count. On my rig for example, 2 blocks of 7 threads would run MT, with the 15th thread initializing units.

Batch mode would be suitable for systems that run under the effective thread count (11), while batch to stream mode would be better suited for machines with high core counts. Machines with extremely high thread counts (>=24 threads) would dedicate more than one thread for maintaining the work queue (1 init thread per every work block). So a crazy person running an i4P loaded with 16c/32t Xeons (120T) could end up with 14 8-thread work blocks, with the remaining 8 threads feeding the beast so to speak. Tuning that for optimal workflow would take some time though.

I wonder if BOINC allows running apps to have their own daughter processes.[/quote]

It looks as if the new v1.48 application has been deployed as a 64-bit application only.

32-bit Windows computers are still being allocated work, but are trying to run it with the old v1.46 application - and failing. That causes unnecessary waste and delay in validating the results from hosts which are still 'unreliable' and need the full 3-way validation.