Data persistence between offloads

Data persistence between offloads

Is it possible to have data persistence between offloads in different functions without using global variables? How do you let the MIC know to reuse the same memory? I know about using "nocopy" in the offload pragma, but I only seem to be able to get this to work when using the same variable name as in the previous offload (and not, for example, when using a pointer to the same memory in the host).

If my question needs more clarification, below is a simple stand-alone program that illustrates my question. It has three offloads. The intention is to initialize an array in the first offload, do some computation in the second offload without copying any data to or from the MIC, and read back the values (after additional modification) in the third offload. I want to time just the middle offload, in order to get just the time for computation without any data transfers to or from the card. Presumably I could add a chain of such operations before eventually transferring the data back to the host.

There are two #define's at the top of the program to illustrate three cases:1) When none of the #defines are uncommented, all three offloads are in the main function, and the middle offload takes about 0.2 seconds.2) When just the #define FUNCTION_CALL is uncommented, the middle offload occurs in another function, with the "nocopy" clause, and it results in an error: "offload error: process on the device 0 was terminated by signal 11".3) When FUNCTION_CALL and GLOBAL_ARRAY are uncommented, the middle offload occurs in another function, but it uses the same global variable for the array as the main function (rather than a pointer passed to it as a parameter). This has no error but the middle offload is much slower than Case 1 (2.3 seconds).

It seems a little awkward to have to use global variables (or confine all offloads the same function) in order to let the MIC know to reuse memory from a previous offload. I am also unclear as to why the offload is slower in another function even when the global variable is used. Of course, I may be doing something wrong, or there may be a way around this that I have not realized.

I was also recently trying to implement data persistence on the coprocessor in an offload from a function. The only method that I was successful with is declaring a static array in the scope of the function to hold the data, and marking it with "__attribute__((target(mic)))" (or __declspec(target(mic)) ). It actually makes sense that "nocopy" fails for non-static and non-global pointers: the variable that holds persistent data must survive between multiple function calls. So static or global variables seem like the only way to go. Here is a (dirty) code that illustrates the approach that I ended up with:

Your code works well for what it was intended to do: allow you to call the same function consecutively with the same input data, reusing the MIC memory. However, it still seems subject to the constraint that MIC memory can only be reused when the offload pragma is referenced using the same literal variable name. Is this an unavoidable constraint? Is there no way to refer to an address in the MIC memory?

What if I wanted to call your function with several different input arrays, multiple times each, interspersed with each other? What if I wanted to allocate the MIC memory in one function and then reuse it in another?

Below is another simple program to illustrate my question another way. If I have "#define USE_ARRAY array1", it works correctly. However, if I have "#define USE_ARRAY array2", it crashes on the offload, even though array1 and array2 are set to be exactly the same thing (at least in host memory). Thus, it seems that MIC memory is only associated with a variable name, rather than with a variable value. I understand what you were saying about how the variable needs to be global or static to maintain persistence (as opposed to a stack variable in a function), but I don't see why it does not seem possible to refer to MIC memory allocated using the name of one global variable with another global variable. I guess there must be a good reason for this, but it seems to make designing clean, efficient code almost impossible.

Based on the "Local Pointers Versus Pointers Used Across Offloads" section, I got the modified version of the code from my previous post to work by changing the offload that reuses previously allocated data to be an in with length 0 rather than a nocopy of length n. I don't fully understand this yet, but maybe I am getting on the right track. If I comment out the line "array2 = array1;", it crashes, so apparently maybe there is a connection between the value of a pointer and the memory with which it is associated on the MIC.

Anyway, I will continue reading that link and experimenting, but if anyone thinks they can clear up my confusion more quickly, please post.

I am sorry for all this confusion. The host pointer value is used as a key in a table of host-coprocessor pointer data associations. The association is not by name, but by pointer value. If you had two pointer variables on the host with the same value, then, yes, they will be mapped to the same allocated memory on the coprocessor.

This statement simply created the pointer 'array2' on the coprocessor but did not update it's value to point the previously allocated array. By using in with length(0), you told the compiler to create the pointer 'array2' but also to update (refresh) the pointer value to point to the previously allocated memory address. That is why your second code worked.

Hopefully the following code will further clarify the ultility of in with length(0):

void f()

{

int * p = malloc(…);

…

// 100 elements are transferred to MIC, from p on the CPU

// In subsequent offloads the MIC memory for the transferred data can be located on the CPU using the value of p

#pragma offload_transfer in(p[0:100] alloc_if(1) free_if(0))

g(p);

}

void g(int *q)

{

// This pragma does not do data transfer; previously sent data is reused

// That’s why the length is 0, meaning don’t send data

// However, we used the “in” clause instead of “nocopy”, to update the MIC pointer q with the previously allocated memory address

// Because p (at the time of MIC memory allocation) and q in this offload have the same value, the MIC memory is located

Chris, in my code I had to call the function with multiple arrays, all of which had to be persistent. In order to pull this off, I had a single pointer to hold all my data on the card in a single big chunk. I did memory management within that big chunk of data "manually". That is, when I had to use one of the data sets, I used a local pointer in the MIC code to point to the respective section of the big array.

Thanks, Sumedh and Kevin! It is good to know that this task can be done without the acrobatics that I described.

I'm trying to add MIC support to a large DFT code that does a lot of memory management between main memory and GRAM (currently in CUDA). I couldn't glean from the documentation so far whether this was possible to do in an organized manner in offload mode - I was almost considering doing it over MPI with native mode instead. This thread has been particularly useful for me to figure out how to achieve this correctly - thanks to all involved!

The code is possible assuming you specify that the offload runtime transfers the data pointed to by 't', but it is incorrect in a few ways.

(int *)malloc(3) is an unusual length

You need to specify something like inout(t:length(x)) but length(x) doesn't make much sense with respect to your malloc(3)

t=(int*)malloc(3) creates a memory leak on the card since the heaps on the host and the coprocessor are separate and unique

I believe what you want to do is to malloc and persist a specified amount of data on the card. To do that, first malloc the data on the host, and then pass the pointer and the amount of data to the offload statement. For example,

Outside of minor syntax errors, this will (1) allocate an array on the host, (2) allocate that same amount on the coprocessor and copy the data over, (3) persist the data to the next offload. You can find examples of this under the Samples directory for the C++ compiler.

What Taylor suggests seems to be only working when the offloads are being called inside one function. If the offloads are being called from different C function a segmentation error is returned. I experimented this with the following code:

This returns a segmentation fault. However, if I had the second offload (in MMult_mic) done inside alloc_mic function (after its offload is done) there would be no problem. So the question is what is the problem? I have included the source file as an attachment.

In the first case, when we specify alloc_if(0) free_if(0), why will the compiler assume that we are resizing (when we explicitly say no-alloc). In fact the second syntax suggests that we want the compiler to copy the array from CPU to Phi using the previously allocated size (since we dont specify a size but we do say in(...) ).

Intel please fix this messed up notation!! Its been over a year and there are still many such issues!!

@Dhairya - I posted in your other thread here: http://software.intel.com/forums/topic/499257. We apologize for the incorrect examples in the 13.1 documentation. I believe we have corrected those that appeared in the 13.1 C++ User Guide in our 14.0 (Composer XE 2013 SP1) release. Is it possible you might have referenced the 13.1 version of the guide?

I would highly appreciate it if you would clarify when/where we should use nocopy? According to Taylor's post above nocopy should have worked but it is not working. So are you suggesting that in Taylor's code above one should have used in with length(0)? Then what is the point of this nocopy clause?

I suspect that there is a bug in intel's nocopy implementation. The reason I am saying this, is that if two pragmas are in the same function nocopy works but when the second pragma is called, it fails to work.

Since your pointer is on the host stack (passed as argument to the function MMult_mic) you need to update the corresponding stack pointer variable on the card with memory which was allocated on the host in an earlier pragma directive.

In Taylor’s post, nocopy is correct because he shows all the pragmas appearing within the same local scope. In other words, the first pragma with in() updates the pointer on the target AND allocates memory AND transfers the current data; therefore, the second, third and any subsequent pragmas within the same scope need not update the target’s copy of the pointer; therefore, he can use nocopy as he did in the second pragma which was simply intended to reuse the data that was transferred to the card by the previous pragma. His nocopy does not require length() and in his context (with alloc_if(0)) it is ignored.

With respect to what Marco was asking, the key restriction to explicit allocation within offloaded code is that memory allocated by the user using malloc or some such API cannot participate in the data transfer pragmas. For the pragmas to be usable, the allocation must be done using the pragmas also. In other words, Marco would have to transfer the values of “p” to the card and memcpy into “t” himself. I discussed this recently related to another user’s post here: http://software.intel.com/en-us/forums/topic/499631

Your example uses pointers within different scopes; therefore, as Ravi indicated, when you change scope you must update the new scope’s corresponding stack pointer variable on the target; therefore the first reference to the pointer in the new scope must not be a nocopy. It must be an in() with an alloc_if(0) and either a non-zero or zero length. Within that same scope, after this in() pragma you would then use nocopy as in Taylor’s code.

So, the use of nocopy depends on the scope and what is needed in terms of allocation, no data transfer, and/or pointer refresh. The FAQ in post #5 contains a discussion about the Local Pointers Versus Pointers Used Across Offloads.

As a final note, all the focus/discussion here relates to local pointers. For global pointers, once created/updated, there is no need for in() with length(0) when used within different functions since the value is retained in global(static) memory on the target.

So Ravi, what you are saying is that for array we should ALWAYS use in(...) because the CPU pointer will always be sent from CPU to MIC and never the other way. We can use in(...) even when in the same scope because the CPU pointer value is the same.

So what do we do when we need to transfer data from a persistent MIC array back to the CPU? From what you said out(...) will just over write the CPU pointer variable. Or should we first do in(A: length(0) REUSE) and then out(A: length(N) REUSE)