The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

How to avoid double allocation on CPU

I am newbie to OpenCL, using my CPU with OpenCL for processing remote sensing images. As per the OpenCL documentation, we need to allocate host buffer and transfer the data to device memory. But when using OpenCL for CPU both host and device memories are same, is there any way to avoid this double allocation and use the buffer allocated on host can be directly used in OpenCL Kernel?

Re: How to avoid double allocation on CPU

I have a similar problem. I am using an Intel CPU and I'm using the Intel OpenCL SDK (version 1.5). I try to omit CPU-CPU memory copies, since that would be a waste of resources. In my case I assume that the data is allocated in a different section of the program - I don't want to modify that part.

What I currently have is found below:

For each input and output array, I create a new pointer as follows using CL_MEM_USE_HOST_PTR:

Both of these calls seem to take significant time and it appears some actual copying is involved.

Furthermore, I've tried using the 'clEnqueueMapBuffer' and 'clEnqueueUnmapMemObject', instead of the 'clEnqueueReadBuffer', but that doesn't seem to work (or I'm not doing it at the right place or with the right arguments).

I've also read something about alignment of the original memory allocation, but I don't want to modify that part. Is it possible at all to omit the CPU-CPU copy without modifying the original memory allocation scheme?

Re: How to avoid double allocation on CPU

You have to use either read buffer or map/unmap buffer to get the data back. ReadBuffer obviously has to copy the data since you provide it with a pointer, and the api must honour the same semantics which make it work on all devices.

Try using enqueueMapBuffer instead.

Note that the api provides a certain view of the world but internally even the cpu driver could do things very differently so you can't really expect to be able to process in-place.

And what do you mean by 'significant time'? If copying the data twice is a significant amount of time verses the algorithm execution, then opencl simply wont provide you any benefit and worrying about it is pointless.

Do not base your timing on a single run; the first-run time (which can not be changed as it depends on the operating system) will probably be meaninglessly very large.

Re: How to avoid double allocation on CPU

Originally Posted by notzed

You have to use either read buffer or map/unmap buffer to get the data back. ReadBuffer obviously has to copy the data since you provide it with a pointer, and the api must honour the same semantics which make it work on all devices.

Try using enqueueMapBuffer instead.

Thank you for your response.

As I've said in the earlier message, I have tried the map/unmap scheme, but it doesn't seem to work for me. I have one input array and one output array, both of the same size. My complete code now looks like this (using map/unmap):

Note that the api provides a certain view of the world but internally even the cpu driver could do things very differently so you can't really expect to be able to process in-place.

So what you are saying is that with the map/unmap scheme and the 'CL_MEM_USE_HOST_PTR' I might be able to omit the memory copies, but it is not certain that it would work for every case / every machine / every OpenCL driver instance. If that is true though, I should still get correct results with the above code, right? It might be that because the allocated memory is not properly aligned to 128-byte boundaries for example (as Intel states in their documents).

Originally Posted by notzed

And what do you mean by 'significant time'? If copying the data twice is a significant amount of time verses the algorithm execution, then opencl simply wont provide you any benefit and worrying about it is pointless.

I understand the problem with memory-copy versus kernel execution time. However, what I'm working on is just a toy example, in which I have 16M integers in an input array and the same amount in an output array. My kernel runs with 16M threads each incrementing the input by 3. What I'm trying to do here is to omit the CPU-CPU memory copies, because I believe that should be possible somehow!

What I mean with significant time is that it takes more time than the kernel (which also reads/writes from and to the same memory), while it should take a small amount of CPU cycles in the ideal case, since it should simply create a new pointer and point it to the corrrect place in memory!

Originally Posted by notzed

Do not base your timing on a single run; the first-run time (which can not be changed as it depends on the operating system) will probably be meaninglessly very large.

I understand this. I run some memory copy and a kernel before I start the useful part (and the timers).

The problem is in the last line. There's no guarantee that the most up-to-date data will be in original_output once you have destroyed device_output. OpenCL only guarantees that you will see the most up-to-date bits in the period when the buffer is mapped.

Re: How to avoid double allocation on CPU

Originally Posted by david.garcia

Code :

...

The problem is in the last line. There's no guarantee that the most up-to-date data will be in original_output once you have destroyed device_output. OpenCL only guarantees that you will see the most up-to-date bits in the period when the buffer is mapped.

In other words, you would need to do this:

Code :

...

OK. That would mean I have to use the 'clEnqueueUnmapMemObject' and the 'clReleaseMemObject' at the very end of my program, just before I use 'free' on the original arrays (if it was malloc-ed). What will OpenCL do when the unmap function is called on the array? It will not free-up the memory space allocated for the original array, will it? This also constrains larger projects that split over multiple files. It would be nice to have the OpenCL part in a single file and not have to use OpenCL API calls somewhere else to clean-up a pointer.

Also, as far as I understand, I don't have to map/unmap the input array, right?

I've tried your suggestion by simply printf-ing a value from the 'original_output' array, but it gives 0 (which it shouldn't). Only after re-enabling the 'clEnqueueReadBuffer' it prints the correct value. I am printing straight after the kernel has ended.

Re: How to avoid double allocation on CPU

So what you are saying is that with the map/unmap scheme and the 'CL_MEM_USE_HOST_PTR' I might be able to omit the memory copies, but it is not certain that it would work for every case / every machine / every OpenCL driver instance. If that is true though, I should still get correct results with the above code, right? It might be that because the allocated memory is not properly aligned to

Obviously a discrete gpu will need to copy the data across the pci bus anyway: either on an 'as needs' basis as the cpu requests it, or at map/unmap time via DMA - which options are available depend on the hardware, driver and os.

The specification clearly states that 'implementations are allowed to cache the buffer contents' - i.e. they may be copied anyway. You need to read the specific vendor documentation for details of their implementation - i've not read the intel stuff but the amd stuff has a fair bit about their gpu stuff and how to get 'zero copy' access.

And no you shouldn't expect correct results above: you're using the api incorrectly, as David followed up on. More on that in another post.

128-byte boundaries for example (as Intel states in their documents).

If the intel SDK states that buffers must be aligned (for zero copy?), and you're not aligning them: you have to expect that the driver will need to copy the data. It's as simple as that.

If their cpu's work much faster on aligned data, then it would usually be a win to copy the whole buffer to align it, under the assumption that you're going to be doing more than a trivial amount of work, and accessing that data more than once on the 'device side'. For all you know it could be copying it regardless: e.g. to match the workgroup topology more closely in an SMP context.

If you are doing a trivial amount of work and accessing the data only once to/from the host (i.e. outside opencl), it will be faster just using the cpu in all cases, so such tests should not be used as a guide for opencl usage.

What I mean with significant time is that it takes more time than the kernel (which also reads/writes from and to the same memory), while it should take a small amount of CPU cycles in the ideal case, since it should simply create a new pointer and point it to the corrrect place in memory!

Well, 'significant time' has no meaning if you're not doing any work - i.e. it isn't significant no matter it's absolute value. OpenCL has a lot more overhead than a simple function call: that overhead is worth it (and insignificant) when you're doing a lot of work, but will still be there when you're doing simple work.

Don't get caught up on worrying about memory copies before you've even started ...

Re: How to avoid double allocation on CPU

Originally Posted by CNugteren

Originally Posted by david.garcia

Code :

...

The problem is in the last line. There's no guarantee that the most up-to-date data will be in original_output once you have destroyed device_output. OpenCL only guarantees that you will see the most up-to-date bits in the period when the buffer is mapped.

In other words, you would need to do this:

Code :

...

OK. That would mean I have to use the 'clEnqueueUnmapMemObject' and the 'clReleaseMemObject' at the very end of my program, just before I use 'free' on the original arrays (if it was malloc-ed). What will OpenCL do when the unmap function is called on the array? It will not free-up the memory space allocated for the original array, will it? This also constrains larger projects that split over multiple files. It would be nice to have the OpenCL part in a single file and not have to use OpenCL API calls somewhere else to clean-up a pointer.

On your first question: Not really, unless you're only generating one result for the whole application.

It must be mapped/unmapped every time you access the data. This makes sure you do not have stale data either on the host or device when either side needs it (it's pretty simple to understand why this is needed). You cannot enqueue tasks which write to a mapped buffer anyway: they need to be unmapped before being used. i.e. your proposed suggestion is invalid, and when taking into account this bracketing the opencl code will be grouped and not scattered all over the place.

opencl 1.1 spec 5.4.2.1: "Behaviour of opencl commands that access mapped regions of a memory object" has the (simple) rules.

Also, as far as I understand, I don't have to map/unmap the input array, right?

You can do whatever you want. But if it's only being initialised with one set of fixed data either using USE_HOST_PTR or COPY_HOST_PTR (if you want to free the pointer straight away) are probably the easiest.

I've tried your suggestion by simply printf-ing a value from the 'original_output' array, but it gives 0 (which it shouldn't). Only after re-enabling the 'clEnqueueReadBuffer' it prints the correct value. I am printing straight after the kernel has ended.

Inside the map buffer/unmap buffer bracket?

I noticed in the spec that when using MapBuffer, with a CL_MEM_USE_HOST_PTR buffer, it will make sure the data is in the buffer you provided. So this will probably mean a copy: for a gpu it will because you cannot map device memory into the application heap, and for a cpu device it might because of alignment restrictions (depending on what the specific vendor documentation says).

What you possibly want (i'm speculating here as i haven't done this so far): just let opencl do the allocations, you know then that whatever it chooses will be optimal for the processing stages (which is the important bit). Then use map to access the area it has allocated and write directly to that - and hope that it does the best it can - i.e. zero copy. If you have control over the allocations it just changes the way you allocate/access the memory.

But If you need to interact with another library in which you cannot control the allocation it's probably going to have to be copied anyway (in general at least), so it's up to you if you want to do it yourself or let opencl do it.

Re: How to avoid double allocation on CPU

First of all, thank you for your very long and thorough responses, it is greatly appreciated.

Originally Posted by notzed

...

Obviously a discrete gpu will need to copy the data across the pci bus anyway: either on an 'as needs' basis as the cpu requests it, or at map/unmap time via DMA - which options are available depend on the hardware, driver and os.

The specification clearly states that 'implementations are allowed to cache the buffer contents' - i.e. they may be copied anyway. You need to read the specific vendor documentation for details of their implementation - i've not read the intel stuff but the amd stuff has a fair bit about their gpu stuff and how to get 'zero copy' access.

And no you shouldn't expect correct results above: you're using the api incorrectly, as David followed up on. More on that in another post.
[quote:35iolxyh]...

If the intel SDK states that buffers must be aligned (for zero copy?), and you're not aligning them: you have to expect that the driver will need to copy the data. It's as simple as that.
[/quote:35iolxyh]

I understand. I'm targeting an Intel CPU only, no GPUs involved at any point. I'm therefore also reading through the Intel OpenCL document, in particular the guide found here: http://software.intel.com/file/39189. In particular, I am trying to follow section 3.1 of this document, where they give the programmer two choices: either you use OpenCL's memory allocation system and map/unmap, or you make use of the 'CL_MEM_USE_HOST_PTR' flag:

Originally Posted by http://software.intel.com/file/39189 section 3.1

If your application uses a specific memory management algorithm, or if you want
more control over memory allocation, you can allocate a buffer and then pass the
pointer at clCreateBuffer time with the CL_MEM_USE_HOST_PTR flag. However, the
pointer must be aligned to a 128-byte boundary. Otherwise, the framework may
perform memory copies.

Although my originally allocated arrays might be not properly aligned, it should then perform the copies and give me correct results.

Originally Posted by notzed

If their cpu's work much faster on aligned data, then it would usually be a win to copy the whole buffer to align it, under the assumption that you're going to be doing more than a trivial amount of work, and accessing that data more than once on the 'device side'. For all you know it could be copying it regardless: e.g. to match the workgroup topology more closely in an SMP context.

If you are doing a trivial amount of work and accessing the data only once to/from the host (i.e. outside opencl), it will be faster just using the cpu in all cases, so such tests should not be used as a guide for opencl usage.

...

Well, 'significant time' has no meaning if you're not doing any work - i.e. it isn't significant no matter it's absolute value. OpenCL has a lot more overhead than a simple function call: that overhead is worth it (and insignificant) when you're doing a lot of work, but will still be there when you're doing simple work.

Don't get caught up on worrying about memory copies before you've even started ...

I understand all that, I do have some experience programming CUDA and OpenCL for GPUs. The only thing I'm trying to do now is to omit the memory copies on the CPU.

Originally Posted by notzed

Originally Posted by CNugteren

OK. That would mean I have to use the 'clEnqueueUnmapMemObject' and the 'clReleaseMemObject' at the very end of my program, just before I use 'free' on the original arrays (if it was malloc-ed). What will OpenCL do when the unmap function is called on the array? It will not free-up the memory space allocated for the original array, will it? This also constrains larger projects that split over multiple files. It would be nice to have the OpenCL part in a single file and not have to use OpenCL API calls somewhere else to clean-up a pointer.

On your first question: Not really, unless you're only generating one result for the whole application.

It must be mapped/unmapped every time you access the data. This makes sure you do not have stale data either on the host or device when either side needs it (it's pretty simple to understand why this is needed). You cannot enqueue tasks which write to a mapped buffer anyway: they need to be unmapped before being used. i.e. your proposed suggestion is invalid, and when taking into account this bracketing the opencl code will be grouped and not scattered all over the place.

opencl 1.1 spec 5.4.2.1: "Behaviour of opencl commands that access mapped regions of a memory object" has the (simple) rules.

Also, as far as I understand, I don't have to map/unmap the input array, right?

You can do whatever you want. But if it's only being initialised with one set of fixed data either using USE_HOST_PTR or COPY_HOST_PTR (if you want to free the pointer straight away) are probably the easiest.
[quote:35iolxyh]
I've tried your suggestion by simply printf-ing a value from the 'original_output' array, but it gives 0 (which it shouldn't). Only after re-enabling the 'clEnqueueReadBuffer' it prints the correct value. I am printing straight after the kernel has ended.

Inside the map buffer/unmap buffer bracket?

I noticed in the spec that when using MapBuffer, with a CL_MEM_USE_HOST_PTR buffer, it will make sure the data is in the buffer you provided. So this will probably mean a copy: for a gpu it will because you cannot map device memory into the application heap, and for a cpu device it might because of alignment restrictions (depending on what the specific vendor documentation says).[/quote:35iolxyh]

This is what I understand, please correct me if I'm wrong:
* If I have a memory object which is created using 'CL_MEM_USE_HOST_PTR', it is meant to be accessed by the accelerator only (read/write).
* After I've created such an object, I should not access the host version of it, as it contains undefined data.
* If I map the memory object, it is accessible by the host from that point on (either for read or write, specified as a flag to the API call), but the accelerator should not access it anymore, as it contains undefined data from accelerator perspective.
* If I unmap the memory object, it goes back to the state it previously was (accessible by the accelerator, not by the host).

Therefore, I print inside this map/unmap region (and at various other places), but it does not seem to work. I've made a link to the full version of the code here: http://dl.dropbox.com/u/26669157/opencl-cpu.tar.gz (I'm not asking you to go through the code, but maybe somebody is interested anyway - the printf is in line 247 of the example6_host.c file).

Originally Posted by notzed

What you possibly want (i'm speculating here as i haven't done this so far): just let opencl do the allocations, you know then that whatever it chooses will be optimal for the processing stages (which is the important bit). Then use map to access the area it has allocated and write directly to that - and hope that it does the best it can - i.e. zero copy. If you have control over the allocations it just changes the way you allocate/access the memory.

But If you need to interact with another library in which you cannot control the allocation it's probably going to have to be copied anyway (in general at least), so it's up to you if you want to do it yourself or let opencl do it.

I do understand that writing everything (including the mallocs) using OpenCL is the best choice, but I'm not allowed to touch the existing part. At first I wanted to get the zero-copies anyway, but that doesn't seem to be working, is it? Now, I'm just trying to get the map/unmap part working, so that _if_ the original mallocs would meet certain requirements (e.g. aligned to 128 byte for Intel SDK) it would omit this copy. However, that just doesn't seem to work right now.

A small question to end with: why do I want to 'unmap' the object? After my OpenCL kernel has ran I will do a lot of computations on the resulting data outside of OpenCL, no kernels anymore. Ideally I just want to 'map' the object directly after the kernel has ended to give it back to the CPU and never 'unmap' it. That doesn't seem to be possible, as the 'map' requires you to specify either read or write.