problem with parallelisation with Intel cilk plus,

problem with parallelisation with Intel cilk plus,

I am working on a parallelisation of MJPEG Encoder using intel cilk plus framework. I have some data races problems while using "cilk_for" construct in this piece of code:

#include <stdio.h>

#include <stdlib.h>

#include <cilk/cilk.h>

TBlocks block; TPackets stream;

cilk_for (int t = 0; t < NumFrames; t++){

for(int p = 0; p < VNumBlocks; p++) {

for( int k = 0; k < HNumBlocks; k++) {

mainVideoIn(&block);

mainDCT(&block,&block);

mainQ(&block, &block);

mainVLE(&block, &stream);}}}

When I have checked the code with "Memory inspector" I figured data races are related to some globale variables such as i,j, and isFirst in mainVideoIn() function. So I have tried to use "cilk holder" to give a different view of each variable to each thread, but the problem still is there! could somebody help me solve this issue please? any clue!!

I have attached 2 files, one named "mainvideoIn" is containing the function implementation, and the other "videoin.h" which is the library used by this function.

i, j, isFirst are golobal variables along with compY,compU, and compV. Also if you notice in "Videoin" , I have tried to substitute these variables with cilk holders, but I dont know if its the correct way using holders, especially array notation with holders!!!

Adjuntos:

In the code fragment above, the "stream" variable looks rather suspicious --- are you sure it is a read-only parameter, as opposed to being modified by some code inside the cilk_for loop? Could the race be related to file I/O, i.e,. either reading in the input, or writing out the output?

I know that if you use Cilk screen for race detection, and if your program is compiled with symbols, then the race detector will give you a stack trace and line information when it reports a race. I would assume though that Inspector has a similar feature?

If you are interested in reading more about Cilk screen, Barry has written a good article on the topic.

I have posted a comment with two files attached one "mainvideoIn" function implementation another the library used by this function. Also I have modified the top level loops with holders in "Main" file, is it the right way for doing that?

If you use a holder for isFirst and compY, then in theory, this reading of the file could happen once for every frame, if the runtime managed to process each frame in parallel. That behavior is not as efficient as you want, but I'm not sure whether that would cause a bug or not.

What I suspect is causing a bug, however, is the use of a holder for i and j. In the code, it looks like each call to mainVideoIn is incrementing i and j.

These increments introduce a serial dependence between different instances of mainVideoIn.

When you parallelize the code and use a holder for i and j, you now effectively get separate copies of i and j for each instance of mainVideoIn that is running in parallel. Therefore, you don't get a race when you update i and j.

However, a holder for i and j may get reset to the default value for its type (i.e., 0) when it starts executing on a new parallel strand.Since the value of i and j is sometimes getting reset in the parallel execution, you probably have parallel instances of mainVideoIn indexing into the wrong places in compY, compU, and compV. That could give you nondeterministic and incorrect output for parallel execution.

To fix the problem, you may need to figure out how to calculate the correct value for i and j to pass it as an argument into mainVideoIn, instead of relying on a global variable?

Thanks jim for your explanation, there is also some issues when I try to use an array of integers (ex. comY) as cilk holders. It significantly slows down the execution time with respect to the serial version. I wanted to ask you if the following is the correct way of using arrays as holders?

The statement "cilk::holder<int> compY[100];" declares an array of 100 holder objects. In general, having a large number of reducer objects can slow down your program considerably compared to a serial execution.

What should be faster, is to say something like "cilk::holder<int[100]> compY;" to declare a single holder object whose view is an array of 100 ints. This approach still has the problem, however, that creating new views of the holder object might be somewhat expensive. When a holder gets accessed inside a parallel loop, each iteration of the loop may potentially create a new view for the holder, for use inside that loop iteration. For example, in the following loop, up to n new views of x might be created.

cilk::holder<T> x;
cilk_for(int i = 0; i < n; ++i) {
f(x);
}

In your code, if the type T is a large object (e.g., a large array), then creating new views can potentially be an expensive operation. You also still have the issue that I mentioned earlier --- if you meant for every iteration of the parallel loop to be sharing the same object x, rather than having every iteration have its own copy of x, then using a holder is incorrect.

Do you mean to create new copies of the arrays for each frame that are not shared between frames, or do you mean to have a single array that is shared between the frames?