No parallelism in my cilk code.

No parallelism in my cilk code.

I am very new to Cilk++ and I have read the manual a few times. So far I have not hadproblemsconverting C++ code into Cilk++ or compiling them into executable files. However I have not been able to achieve anyparallelism yet. So I must be doing something wrong.If anyone could kindly provide any advice, I would really appreciate it.

The work below was done exclusively ina 64-bit Unix workstation. The workstation is equipped with two quad-core CPUs (AMD Opteron Processor 2354) and 64Gb memory. Cilk++ was installed in the path /usr/local and therefore to call the compiler, I use/usr/local/cilk/bin/cilk++; to use the header file, I call /usr/local/cilk/include/cilk++/cilk.h.

This is practically the same as the data race example from pp87-88 in the Cilk++ manual. I am expecting data race here. That means I want to see in the second line of the resultn = 2 or n = 1 depending on whether strand2() or strand1() wins the race. However theresult above stays the same after many runs and compilations. The result from cilkscreen though indicates data race.

I also wrotea program inCilk++ to createFibnacci numbers and, oddly Cilk++ running time was slower then the C++ code. However, when I checked with cilkview it does show the speedup. I guess it's also because I have not been able to achieve parallelism in that cilk code either.

I am not sure what went wrong and will really appreciate any directions and advice.

The use of cilk_spawn does not command parallelism. It's a note to the runtime that there is an opportunity for parallelism. When you start your Cilk++ program, "worker threads" are started to run parallel sections of your code. Each of the threads is randomly choosing another worker to try to steal from.

Meanwhile, your main thread has continued executing. There is so little work being done in strand1(), I'm guessing that it's completing before it can be stolen. Try adding something more time consuming. But be careful if you add a loop that the compiler doesn't get clever and optimize it away.

As barry said, the spawned function is not doing enough work for you to see parallelism. Even if the parent is stolen, the child will almost certainly finish before the continuation has fully begun. Thus, strand2() will always be the last thing that executes before the sync.

If you add a busy loop to strand1() (e.g., increment a volatile counter 10000 times) before setting n to 1, you will likely end up with the reverse behavir. The child will always complete after the continuation, and you will see a final result of n = 1. With a trivial example like this one, it is almost impossible to see nondeterministic behavior manefest as random output.

If you want to see random output caused by a race, try a large cilk_for loop like the following:

unsigned x; cilk_for (int i = 1; i <= 1000000; ++i) x ^= x << 1 ^ i;

I haven't tried this, but I suspect that you will get different results each time you run it on a multicore machine.

As for fibonacci, you would need to post the code for us to be able to analyse it. I'm guessing that you've structured it such that the overhead is overcoming the natural parallelism.

The results above indicate that the C++ version of this code is much faster, andthere seems to be a linearrelationship between C++ and Cilk++ running time, with C++about 8 times faster than Cilk++. However, cilkview shows the following and suggests that Cilk++ is faster.

This is strange, indeed. If you look the Real vs. User time for the last run, the results are telling:

Fib(50) = 12586269025 computed in C++real 3m14.171suser 3m14.170s

Fib(50) = 12586269025 computed in Cilk++real 3m14.776suser 21m7.700s

In the serial C++ case, the real (wall clock) time is almost identical to the user (cpu) time. In the Cilk++ case, the real time is only 1/6.5 as much as the user time. This suggests that the Cilk++ code is getting a real parallelism of about 6.5 vs. the single-threaded case. You can test this by setting the environment variable CILK_NPROC=1 and re-running the Cilk++ code. The real time will be the same or larger than the user time on one worker.

The question, then, is why there is so much more CPU time being expended in the Cilk++ code than in the serial C++ code. Are you sure that you are using identical algorithms? (You might want to post your serial fib code.) Are you using the same compiler optimization switches?