Using task isolation will likely have some performance impact, but hopefully it's negligible with the right choice of implementation.

The tricky part is that mutex locking in one part of the code causes problems in other parts of the code quite far away from it. It is hard to tell if a function call from within a mutex locked region uses parallelization or not, and then ensure that assumption remains correct as the code changes.

I can see a few ways of doing this:

Use task isolation inside every task being executed. This would have the most overhead for small tasks I expect, likely too much. But it would probably solve the entire problem without changes in other code (except for potential performance issues).

A variation of the above, with the ability to turn off task isolation in specific cases where we know there will be no further nested parallelism (e.g. in sculpting). Still would require it to be enabled for all depsgraph tasks for example, which may be too much.

Use task isolation in BLI_task_pool_work_and_wait and BLI_task_parallel_range. The downside of this is that any other usage of TBB will not be handled automatically. For example OpenVDB uses TBB, and any OpenVDB processing in a mutex locked region will need code for task isolation.

Abstract mutex locking / lazy initialization type code, and make task isolation part of it. This would be most efficient in doing task isolation just when necessary, but it's also easy to forget to use this in all the right places.

Based on the Brecht's backtrace I wonder if the following would be a better long-term solution to this particular problem.

It seems, we lock cache_rwlock whenever a new bvh tree is build (e.g. in bvhtree_from_mesh_looptri_ex). So we can only ever build a single bvh tree in parallel when BKE_bvhtree_from_mesh_get is used. I don't see why it has to be this way. It feels like it should be possible to change the locking mechanism so that multiple bvh trees can be build in parallel. Once we have that, the deadlock shown in Brecht's backtrace should never happen.

This is how I think it might work. Not sure how to do that more precisely. I haven't done any thread synchronization for quite a while. All of the locking logic could probably be abstracted way by some nice cache data structure.