Description

In Visual Studio 2010, the Parallel Computing team has delivered APIs and tools for developers wanting to build applications that take advantage of multiple cores. This video provides a glimpse on the managed APIs, debugging windows and profiler support.

The Discussion

One thing though, when matrix multiplication was executed on a small data set, sequential was actually faster than ParallelFor (@20:48). Are there any insights on estimating an overhead of setting up parallel execution machinery, so, application could attempt
guessing whether to process data set sequentially or in-parallel (assuming that application knows the size of the data set)?

We haven't published any insights. Currently, you would have to measure in yoru app and decide at what kind of workload you are getting speedup from parallelism. You could then have a conditional path that is serial for smaller workloads and a parallel path
for larger workloads.

As the narrator mentions, the Debug.Writeline causes the threads to queue/block on writing to the log/IDE. If this operation is significantly more costly than the work in the task (he has some simple math) then you're running essentially synchronously and
the parallel code just becomes a burden.

Also, I always test by running the methods thousands or even millions of times and taking an average. OS noise amongst other things can severely impact a single pass of a method so you can sometimes get a misleading result.

Edit -- the tools far exceed all expectations. I'd like to see in-code-editor warnings about compiler optimisation and reordering pitfalls; for example, if I add an attribute [MultiThreaded] to a method, then (after compilation) the IDE highlights execution/reads/writes
that has moved from the order it was coded.

Not significantly. The point of the video was to show the tools and make you aware of the rich Task API and the ease of us of PFor as well as to show the pitfalls for no perf gains of directly using the ThreadPool. Even though I am microbenchmarking, if
you run each approach separate to each other, multiple times and with varying workloads, I do not expect the results will vary significantly.

Thanks. Regarding your wish list, those items are not under what we consider debugging and performance tools. We think of those as Correctness tools (or some say Analysis tools and other say Safety tools) - regardless, they are very important too and things
we are thinking about for future (post-VS2010) releases.

Luke, for the sake of correctness, Debug.Writeline call was added to the MulTask() method, which was executed after the sequential and PFor multiplication. So, PFor loop ran without any blocking on screen output, and it was slower than sequential presumably
because of all the extra work associated with priming up parallel execution environment.

If 90% of the input for my app on any given day happens to be small, it's better to process those 90% sequentially and use parallel execution only when appropriate, but for that it would be nice to know, where (approximately) is that cutting point.

I can run tests and collect some stats on what overhead of firing up parallel execution is, but assuming that this work might have been already done while developing the parallel framework, it would be preferrable for me to look at the stats collected by
the PF development team than spend time and efforts myself.

I can't remember my exact wording but the precise statement is that "for this benchmark, on my runs the PFor beats my naive tasks implementation <insert more disclaimers here>".

To actually answer your question: PFor uses tasks in an intelligent manner (partitioning the range amongst a much smaller number of tasks) instead of using one task per iteration. So, of course you could code with tasks yourself an equally (or even more
performant version), but look at the simplicity of the PFor API.

TIP: To get an insight into the PFor implementation place a breakpoint in the body of the for loop and look at the Parallel Tasks window.

Seva. You are correct, my apologies. Although the problem with using PF dev team stats is that while they may know that cranking up the other threads takes, say, 2,000 cycles per thread (I've really no idea - could be 20,000), you'd still have to test your
task/code to see how much work it is in comparison.

Although if the profiler could try P.For and normal For and suggest when you're actually degrading performance, that'd be excellent. That said, I did think that P.For uses some kind of ramping up technology so it ran synchronously for small loads - and thus,
what we saw in the video shouldn't happen. P.For is so easy to use but so easy to use in the wrong place.

My problem is that most of the code I write I think is too small to be parallelised. Like arrays of 12 items. By the time I've finished I've got a huge chain of little bits of work adding up to a large amount of work, so then I start to look for logic that
can run independently of each other and kick off some branch of execution on another thread and re-join/wait further down. The problem being that in the future I won't be able to find enough simultaneous work to spread across all cores!