Hello, I've been trying to use the unified binary for accelerators in the 13.9 release with OpenACC and running into some confusing results. While the GPU code is generated and works fine, the resulting CPU code is serial. Is that intentional? Specifically it seems odd that it wouldn't generate the necessary OpenMP result for the CPU to run in parallel as well since the compiler currently prints an error when including both "acc parallel" and "omp parallel <for/do>" on the same construct.

If not, perhaps it's the way I'm building our codes, make output is below with full commands and ouput.

Based on the user guide example, showing that when the unified binary is produced two statements are printed by -Minfo one for each of the GPU and the CPU devices, I'm thinking it's just not generating what I want. Is there an option missing perhaps?

While the GPU code is generated and works fine, the resulting CPU code is serial. Is that intentional?

"-ta=host" targets a serial host and is meant for portability. We discussed internally about targeting a multi-core system as if it were an accelerator, but it hasn't been added to our road-map as of yet. Once support for AMD is out (Open beta will be in 13.10), we'll decide on our next target. I'll let management know you're looking for multi-core CPU support.

In the meantime is there some way to use both omp and acc parallel constructs in conjunction with it to get that effect? I mean just supplying both such that the host version uses the OMP directives and the NVIDIA version uses the ACC directives. My attempts resulted in errors, but it seems like it would be a simple way to at least let users do it when necessary.

In the meantime is there some way to use both omp and acc parallel constructs in conjunction with it to get that effect?

You can use OMP and ACC in combination to utilize multiple GPUs (or a single GPU if your device supports Hyper-Q). Basically, you're adding an additional layer of parallelism above OpenACC. So if your algorithm can take advantage of this extra layer (or needs it due to memory limits on a single GPU), then it's a way to go.

In my opinion, what you don't want to do is try to have part of the work done by the GPU and part by a Multi-core CPU via OpenMP. It's certainly possible to do, but makes things very complex as you then need to balance work performed by different resources (it becomes a hard scheduling problem unless you know your exact workload and compute resources)

Personally, I much prefer using MPI over OpenMP for multi-GPU programming. It's much easier to write since the domain decomposition is natural and data is already discrete between MPI processes. In OpenMP, typically domain decomposition is done by the compiler and global data is often shared. When putting OpenACC under OpenMP, you now need to have each OpenMP thread mange it's GPU and the GPU's data. Very possible to do, but it isn't how OpenMP is typically programmed.

Also, if you did decide to distribute work across both multi-core and GPUs, I'd still recommend using MPI at a higher level. Each MPI process would then either run it's work on the GPU using OpenACC or multi-core via OpenMP. You'd still have a scheduling problem, but it would be a bit easier to manage.

my question goes into the same direction. I want to ship my application as a single binary to my customer, but don't know whether he is going to execute it on a multicore host with or without an attached GPU. Previously I was using an MPI-OpenMP hybrid implementation, i.e. domain decomposition and for each domain OpenMP parallel for loop constructs to parallelize work within a domain.
Now, if the customer would run the application on a cluster where each node has a GPU attached to it, I would like disable OpenMP parallel for and instead offload the work to the GPU.

I understand that the if-clauses don't help too much at compile time and the compiler probably cannot figure out that I want either OpenMP or OpenACC. As you mentioned before there are good reasons to use both at the same time e.g. for handling multiple GPUs.

Do you know of any workaround to achieve my goal or will I always have to produce two binaries, i.e. one compiled with -mp only and the other compiled with -acc only?