From C to LLVM Bitcode

Here we use the LLVM tools to read in the C code and emit the
bitcode file containing the LLVM IR. When there are multiple C files,
they are linked together into one bitcode file. Also, the code is
profiled to obtain basic block execution counts.

Automatic Parallelization

Here is where we actually run the parallelizing compiler. This
includes building the Program Dependence Graph, running both the DSWP
and PS-DSWP partitioners, and running the Multi-Threaded Code
Generator, which emits multi-threaded LLVM IR that implements parallel
execution. At the end, this output is linked with the queue library
and the thread pool library to create a native executable.

Program Dependence Graph

During the Automatic Parallelization step, a number of graphs are
generated to help the developer visualize the program. One of these is
the Program Dependence Graph (PDG), which shows data and control
dependences between instructions in the loop to be
parallelized. Different colored edges represent different types of
dependences.

Graph of Strongly Connected Components

In this graph, each node is a Strongly Connected Component (SCC)
and consists of one or more nodes from the original PDG. The
partitioner operates on these SCCs rather than on PDG nodes
directly. Some SCCs are marked DOALL, indicating that they have no
inter-iteration dependences and can be placed in a parallel stage by
the PS-DSWP partitioner.

Partitioning the Graph Into Stages

This graph shows the actual partition that was computed by
PS-DSWP. The red stage is the sequential stage, while the green stage
is parallel. PS-DSWP is successful in creating as large a parallel
stage as possible to maximize performance.

Sequential Execution

To see the baseline performance, the original C code is compiled
using gcc and executed. (At this point, you may start playing the next
video to visually compare the speeds of sequential vs. parallel
execution.)

Parallel Execution

The parallelized code runs in about 16.5 seconds, while the
sequential code took 1:14. This is a speedup of about 4.5x on an
8-core machine. Thus, we can achieve real, significant speedups on
unmodified source code.