I may be misunderstanding something here, but why don't you use futures? Create a thread pool with e.g Executors.newFixedThreadPool() and then add some anonymous objects of type Callable to it. That gives you a list of Future objects and then you simply call get() on all of them to block until all threads are finished.That way you don't need to synchronize anything and you don't need volatile either. As long as the data you pass into your Callables is immutable and your callables are side effect free, you are guaranteed to never run into any kinds of threading problems.

Some notes: This whole "happens before" and related formal speak...don't bother with it. You need it if you're going to write a compliant VM or if you need to have a formal conversion. For actually writing your average to not so average multi-threaded code it just needless complicates everything and will be more confusing than helpful. Atomics and volatiles are straight forward (and you're not going to run out and make your own custom concurrent data-structures 'cause you're too smart to do that).

On volatiles. Probably most people should completely ignore their existence...at least until your feet are wet with multithreading and have a basic low-level understand of how a program is translated into code. You don't really need them as you can use the more flexible atomic wrappers instead to do the same thing and much more. They (volatiles) are pretty limited in usage without very careful thought and that extra thought isn't worth the extra memory usage the atomic wrappers will cost (heck it can even be a good thing by lower the chance of false sharing). And for those that care...performance-wise atomic-wrappers and volatiles are about the same...on intel-a-likes they both use the 'lock' prefix on the instructions (this locks the memory bus for the duration of the op execution and flushes all pending memory operations).

Inexactly marking a field volatile says:

1) Every time I specify a read to this field: you must read it (and not use a local copy, register or otherwise)2) Every time I specify a write to this field: you must write it..even if it seems pointless or redundant3) Don't move this around and make sure memory is up-to-date before doing the read or write.

If this isn't perfectly clear: Use the atomic wrappers (just to repeat myself).

Roquen, I understand your PoV, given the intended audience, but I'll have to slightly disagree.

- The atomic classes are nothing more than wrappers around a volatile field, plus CAS-based utility methods.- The most common use of atomic classes is to concurrently count or sum values. For those usages, atomic classes have been superseded by LongAdder/LongAccumulator and DoubleAdder/DoubleAccumulator in JDK 8, which have much higher throughput under high contention.- Then it's usually (ab)using atomic classes for concurrency control. CyclicBarrier, CountDownLatch and Phaser are much better options here.

There's not much utility left for atomics, except for people writing libraries. I think a novice programmer would be better off sticking with volatile and all those methods with the funny names in the atomic classes would just be confusing. The real trouble is getting familiar and understanding when and how to use the java.util.concurrent classes though. That should always be the first option.

For those wanting to learn more, these 3 posts are interesting: 1st, 2nd, 3rd. Not Java specific, but it's better to understand what happens at a lower level.

3) Don't move this around and make sure memory is up-to-date before doing the read or write.

meaning that totalTasks should be uncached on every iteration, yes?

This is assuming calls to add() and take() are properly paired up and there is no way the totalTasks value can be changed during the time a task is being processed. Just saying I think something else is the cause of this issue, not in any way saying that I think this code is a good idea!

I may be misunderstanding something here, but why don't you use futures?

That's what I said earlier, though it's complicated by the need for task dependencies. You could possibly look at tasks blocking on the Futures of subtasks, assuming tasks are added in depth first order this might work without creating too many threads? The other thing I'd look at is using two queues - ie. a queue for completed worker tasks passed back to the main thread - this way actual task counting / management is single threaded.

I think a novice programmer would be better off sticking with volatile and all those methods with the funny names in the atomic classes would just be confusing. The real trouble is getting familiar and understanding when and how to use the java.util.concurrent classes though. That should always be the first option.

Yes, personally I think the novice, and usually everyone else, is better working with higher-level constructs such as producer-consumer queues or executors from java.util.concurrent No shared state ftw!

An atomic operation A that performs a release operation on an atomic object M synchronizes with an atomic operation B that performs an acquire operation on M and takes its value from any side effect in the release sequence headed by A.

That would explain why my original multithreading code was working and should also prove that it should always work, assuming that what's written in that link also applies to Java.

Anyway, the race condition only started occurring when I first ran a TaskTree with a single task (totalTasks = 1), and then ran a different TaskTree with 10 tasks. When the second TaskTree was run, a worker thread would sometimes process a single task, see that (taskCounter = totalTasks = 1) and signal that all tasks had been completed.

i.e. it works correctly only the first time through the loop, then the optimization kicks in.

This makes sense and fits very well what I saw. I usually ran the game single-threaded for a short time by simply calling all task.process() on all tasks from the main thread. Then I'd press a button to run the game multithreaded and it'd explode. Sometimes it'd "just" corrupt the task queue by finishing early and finishing early, effectively corrupting the whole executor and task tree. Rarely it would also freeze, which could be explained by the executor waiting for 10 tasks to finish when there was only 1 task to run.

3) Don't move this around and make sure memory is up-to-date before doing the read or write.

meaning that totalTasks should be uncached on every iteration, yes?

This is assuming calls to add() and take() are properly paired up and there is no way the totalTasks value can be changed during the time a task is being processed. Just saying I think something else is the cause of this issue, not in any way saying that I think this code is a good idea!

I'm a bit worried about this though. What could it possibly have been in that case? >_< If add() and take()s aren't properly causing synchronization between the threads, wouldn't that break their rules? If we imagine a computer with an infinite (or sufficiently large) number of registers wouldn't that mean that literally any variable not marked as volatile would be cache-able by the compiler if it always assumes that only a single thread can modify each variable.

Here's a live example from my engine (which works of course). I do frustum culling on all 3D models for each perspective (camera + shadow maps) to determine which models that need their bone animation skeletons updated.

Task 1-8: Loop through all models, setting boolean needsUpdate = false for each model. Each task processes 1/8th of all models. (It also does other things, like updating bounding boxes and stuff.)Task 9-16: Go through all perspective, checking which models are visible to which perspectives. Once a model is visible to a perspective, I set needsUpdate = true for that model. Each task handles a subset of the perspectives. Task 17: This task simply loops over all models, counting them and adding them to a list. This list then holds all models that need to have their skeletons updated.Task 18-25: Loop over all models that need to be updated, computing the interpolated bones and generating matrices for each skeleton.Task 26 (run on OpenGL thread): Maps a VBO, stores all matrices in it and unmaps it again. (Runs in parallel to terrain culling.)

If the compiler is legally allowed to ignore synchronization by storing variables in a register, wouldn't that effectively mean that the compiler could pull all model.needsUpdate variables registers on the first run? If it can't do that since it's a loop over multiple models, what if there's only one model? The resetting to false done by the first 8 tasks wouldn't be visible to the subsequent tasks. The code that sets which tasks that need updating to true wouldn't be visible, and the counting task would also fail since it wouldn't know which ones had been set to true.

Isn't this what we can call a compiler or VM bug?

EDIT: I ran out of medals to hand out. More to come when I get new ones!

It's much more efficient (and convenient) to push all models that are 'in view' in a queue, which can be popped by other threads immediately, without waiting for all models to be frustum-tested. Your 'barrier' causes 7 out of 8 threads/cores to be idling prior to the last thread finishing. The whole needsUpdate field can be removed, as the state of a model is implicit to whether it was pushed in queue.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

It's much more efficient (and convenient) to push all models that are 'in view' in a queue, which can be popped by other threads immediately, without waiting for all models to be fustrum-tested. Your 'barrier' causes 7 out of 8 threads/cores to be idling prior to the last thread finishing. The whole needsUpdate field can be removed, as the state of a model is implicit to whether it was pushed in queue.

I'm not so sure. Each model may be visible by lots of different perspectives at the same time, but I only want to compute the skeleton for it once. If the threads write to a shared queue they'd need some kind of mechanic to ensure that the model is only added once. That kind of synchronization will result in worse performance I think.

I'm not quite sure what you mean by saying that 7/8 threads will idle before it finishes. Since the perspectives have very different shapes and sizes, the amount of computations they need varies a lot. My first implementation simply divided up the list of perspectives into equal shares for each core, but that resulted in widely different loads on the cores. I instead I'm now using an atomic integer which allows the threads to grab the next perspective that needs culling to achieve some basic load balancing. With 8 threads on an i7 (quad-core + Hyperthreading) I've gotten around 4.1x scaling when benchmarking these 8 tasks only.

EDIT: Oh, I think I misunderstood you slightly. You mean that I'd want to run both the culling AND the skeleton computation tasks at the same time, right? I think my point still stands though. The synchronization to check if a model is already being or has been updated would scale worse than what I have now.

@DrZoidberg: the issue is that frustum culling is only efficient when looking at it from the Perspective point of view (no pun intended). You don't ask each model whether it is in view for a certain perspective, but look at the scene from each Perspective and gather all the models it intersects with.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

Riven is correct. By using for example a tree you can reduce the number of checks per perspective drastically by being able to cull large amount of objects per test. Your code requires (perspective * models) tests.

I'd get rid of the need to synchronize (through volatile fields or locks) for every model, by adding the intersecting models to the Perspective, which does the culling on a single thread. The advantage here is that you don't have these expensive memory flushes and CPU pipeline stalls.

You'd only need to sync when gathering/merging the culling results of each Perspective.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

Well, I was trying to solve the problem, instead of kind of figuring out why the JVM trips over bytecode that is clearly incorrect. To me it's not really interesting to understand each step towards the state where the code failed (regarding how/why the JIT optimizes). At this point it is (with all due respect), "garbage in, garbage out": the totalTasks field was supposed to be volatile, or you'd have to use the concurrency classes to do the heavy lifting for you.

So... I thought about moving on to working at improving the solution, suggesting more efficient ways to reduce the amount of stalling induced latency and contention

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

I'd get rid of the need to synchronize (through volatile fields or locks) for every model, by adding the intersecting models to the Perspective, which does the culling on a single thread. The advantage here is that you don't have these expensive memory flushes and CPU pipeline stalls.

You'd only need to sync when gathering/merging the culling results of each Perspective.

Why would I need volatile fields in this case? When all tasks finishes the changes will be visible to the following tasks that explicitly required that task since that's what the concurrent collections guarantee.

I can agree that writing to the same field from multiple threads even without synchronization will have bad performance due to cache flushes, but why are you assuming that this is such a problem in the first place? It scales well and this culling code is not a very big performance hog in the first place.

Well, I was trying to solve the problem, instead of kind of figuring out why the JVM trips over bytecode that is clearly incorrect. To me it's not really interesting to understand each step towards the state where the code failed (regarding how/why the JIT optimizes). At this point it is (with all due respect), "garbage in, garbage out": the totalTasks field was supposed to be volatile, or you'd have to use the concurrency classes to do the heavy lifting for you.

So... I thought about moving on to working at improving the solution, suggesting more efficient ways to reduce the amount of stalling induced latency and contention

That's a bit aggressive, isn't it? What problem are you trying to solve? Because I did not ask for advice on optimizing my frustum culling, I wanted to know why it works but not my totalTask counter. I AM having the concurrency classes doing the heavy lifting. You seem to know exactly what the problem is (garbage bytecode), so why don't you enlighten me by telling me (from my limited unenlightened perspective) the compiler is breaking the synchronization that my PriorityBlockingQueue<Task> is supposed to guarantee according to the specs. And if the problem is indeed what Spasi said, that the totalTask variable is only read once, how come that that does not happen at all anywhere else. Because if it can happen anywhere, then you are right; my threading code is not thread safe, but more importantly (from how my unenlightened mind is seeing this) the concurrent collection synchronization is not to be trusted since the compiler can break it and ALL fields that at any point is read or written more by more than one thread at any time regardless of the synchronization from concurrent collections have to be volatile. Frankly, I have no use of knowing how to fix the code in my first post, because 1. I've already fixed it myself, and 2. The goal is to avoid this issue in the future.

And I'm sorry for going slightly off topic myself now. Anyway, on Riven's not-so-nicely-conveyed tip I decided to us a bytecode viewer for the first time and check out my .class files. By looking at the run(TaskTree) method, I can see that all synchronization happens inside calls to "invokevirtual" commands. Is it possible that the JIT compiler would not know that a command inside a call to invokevirtual puts a restriction on ordering (like ArrayBlockingQueue.take()), so it would optimize the load instruction to outside the loop? I'm sorry, I'm running out of ideas here...

There are two misconceptions: I didn't mean to be aggresive in conveying what I did, and I didn't actually mean that the bytecode was garbaged. The bytecode will be functionally idential to your sourcecode. I meant to say that the 'input' to the JIT was wrong, be it the java sourcecode or the derived java bytecode. Don't expect to find any surprises in the bytecode, javac is very straight forward and barely optimizes anything, let alone promoting fields to localvars.

If you make assumptions outside of the guarantees made outside of the specification of the Java Memory Model, then it can very well work, but it might trip up once in a billion cases. That your unit tests or manual test work as expected is sadly no way to deem code thread-safe.

and ALL fields that at any point is read or written more by more than one thread at any time regardless of the synchronization from concurrent collections have to be volatile

It is indeed true that you simply must not read & write fields from multiple threads, unless they are volatile or are in a critical section, especially when more than 1 thread is writing, because at that point you actually must resort to reading and writing solely inside critical sections (think: ReentrantReadWriteLock). You basically have to inform the JIT about every field that is accessed concurrently using volatile, to disable its full blown, super aggresive optimisations, like promoting fields to variables, which gets even more complex when inlining is involved and you can't assume that method boundaries are also boundaries of these promotions. So spell it out for the JIT, however clear you think your sourcecode is, that the JIT should take into account access for specific fields, from multiple threads.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

and ALL fields that at any point is read or written more by more than one thread at any time regardless of the synchronization from concurrent collections have to be volatile

It is indeed true that you simply must not read & write fields from multiple threads, unless they are volatile or are in a critical section, especially when more than 1 thread is writing, because at that point you actually must resort to reading and writing solely inside critical sections (think: ReentrantReadWriteLock). You basically have to inform the JIT about every field that is accessed concurrently using volatile, to disable its full blown, super aggresive optimisations, like promoting fields to variables, which gets even more complex when inlining is involved and you can't assume that method boundaries are also boundaries of these promotions. So spell it out for the JIT, however clear you think your sourcecode is, that the JIT should take into account access for specific fields, from multiple threads.

Just to make this clear: The JIT does not care that there is an explicit memory barrier/fence/whatever-that-synchronization-point-from-take()-is-called is called in the loop when it decides whether or not to move the read out of the loop? What's the point of doing such optimizations if the result is that I have to make everything volatile to counter it?

The JIT simply adheres to the Java Memory Model, and makes the most aggresive optimisations within those rules. Maybe there guarantees are too narrow for your taste, pushing the burden to the developer, but this is the environment we operate in.

Given the JIT team is only advancing their technology, code that used to work (guaranteed with older JITs) on incorrect assumptions, will break in future JREs.

Just to make this clear: The JIT does not care that there is an explicit memory barrier/fence/whatever-that-synchronization-point-from-take()-is-called is called in the loop when it decides whether or not to move the read out of the loop? What's the point of doing such optimizations if the result is that I have to make everything volatile to counter it?

This optimization will work in all cases whenever there isn't any kind of take()-style barriers:

for(inti = 0; i < objects.size(); i++){objects++; }//there's a memory barrier, so it needs to be read each iteration.//however it can obviously be accumulated in a register and written out once at the end since it doesn't violate any synchronization.

if(taskCounter.incrementAndGet() == totalTasks){ ... }//The read for totalTasks must be done each iteration for the same reason as objects.

As the quote says, the memory *will be* synced/updated for the other thread - you just can't see it, because you're not reading back that specific memory location, due to *potential* field->localvar promotion, so better disable that, by flagging it volatile.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

Often the JIT optimizes methods when they are executed N times, and then again much later, more aggresively. As you never leave your methods (N=0), it might hold the JIT optimisations back.

Making a test that simply pulls the loop body outside it, and putting it into its own method, might or might not show field->localvar promotion, as the JIT might already have inlined it back into the loop, or some other condition applied that I'm not aware of, disabling the promotion.

Hence using the word 'lucky' earlier. You just have to account for it, because what works today, might break tomorrow. </cheesy>

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

As the quote says, the memory *will be* synced/updated for the other thread - you just can't see it, because you're not reading back that specific memory location, due to *potential* field->localvar promotion,

hmm .. I fail to see how localvar promotion in that case wouldn't break the Java Memory Model. Any localvar caching should be invalidated by crossing the memory barrier or that would appear to be a bug?

It doesn't surprise me that this short example is working. I still believe the bug in the original code is elsewhere (probably in the signalling of task completion) and you're chasing the rabbit down the wrong hole!

I can't explain it either, but using ArrayBlockingQueue(1) instead of using SynchronousQueue got rid of the deadlock

SynchronousQueue would be more accurately equivalent to ArrayBlockingQueue(0)! ie., it has no storage at all.

The issue is the use of offer() vs put() - non-blocking vs blocking. Check the return value of offer() and it should be false. The offer() in Thread 1 does not block and cannot return successfully unless Thread 2 is already in the take() method, and then Thread 1 blocks on a take() that can never be serviced. Thread 2 is doing the reverse, hence deadlock.

The use of an ArrayBlockingQueue with a single storage slot would allow offer() to work.

Any localvar caching should be invalidated by crossing the memory barrier or that would appear to be a bug?

Remember we're talking about the compiler seeing the value of a field as being constant across the lifetime of a method and not a main-memory value that's stored in the cache hierarchy. Because the programmer implicitly states that nobody can change its value behind the thread in questions back. No amount of synchronization or memory barriers will address this issue. The compiler is performing a perfectly legal transformation which doesn't consult main-memory.

(EDIT: by "No amount of synchronization..." I meant wrapping in a mutex)

hmm .. I fail to see how localvar promotion in that case wouldn't break the Java Memory Model. Any localvar caching should be invalidated by crossing the memory barrier or that would appear to be a bug?

It doesn't surprise me that this short example is working. I still believe the bug in the original code is elsewhere (probably in the signalling of task completion) and you're chasing the rabbit down the wrong hole!

The question then would be, why does the bug go away when counting backwards? A bug elsewhere (e.g. calling run(TaskTree) before the current batch of tasks has completed) would still be problematic in that case. Making totalTasks volatile would also make no difference. Btw, theagentd, have you actually tried that with the original code?

Any localvar caching should be invalidated by crossing the memory barrier or that would appear to be a bug?

Remember we're talking about the compiler seeing the value of a field as being constant across the lifetime of a method and not a main-memory value that's stored in the cache hierarchy. Because the programmer implicitly states that nobody can change its value behind the thread in questions back. No amount of synchronization or memory barriers will address this issue. The compiler is performing a perfectly legal transformation which doesn't consult main-memory.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org