The PARSEC Project

From SuperTech Wiki

We are interested in creating serial versions and Cilk parallelizations of the benchmarks in PARSEC, to understand the limitations and challenges in parallel computing regrading modern applications. The PARSEC benchmark suite comes with 13 benchmarks, most parallelized with Pthreads and a few parallelized with OpenMP and TBB. The program pattern spans from simple "embarrassingly parallel" patterns to complicated pipeline parallelism and stencil-like graph computation. The goal of this project is to understand and isolate interesting problems in parallel computing, and so far we have parallelized, to certain levels, 12 benchmarks with Cilk.From the 12 benchmarks we parallelized, we have isolated a few interesting problems: pipeline parallelism and "chromatic scheduling" (an approach to solve graph problems with local updates in parallel), which have spawned into their own problems. We believe that the community will benefit from a complete treatment of PARSEC using the Cilk language, and even expand the suite with more benchmarks. This project is close to finish, and here is a preview of the results. Please stay tuned for the final results.

Notes

We are interested in bringing Cilk parallelization to PARSEC, an industry-strength benchmark suite for emerging projects. The suite contains 13 projects: x264, a H.264 standard-compliant video stream encoder; facesim, a realistic human facial muscle simulation module; raytrace, a ray-tracing engine; vips: an large-image-oritneded image processing engine; freqmine, a data mining engine that can operate online; fluidanimate, a fast fluid animation module in 3D (faster than Joe Stam's algorithm); streamcluster, a clustering algorithm that can operate online; bodytrack, an algorithm to reconstruct the human pose in different video sequences providing a prior knowledge of the skeleton; canneal, a program to improve the circuit routing planning of a particular circuit with the help of the simulated annealing algorithm; swaption, an option pricing simulation engine; blackscholes, a stock market analysis algorithm; dedup, a compression engine that can operate online over a continuous data stream; ferret, an image retrieval engine. We are also interested in finding common patterns in these programs, and make them standard features of the Cilk language.

This is a brief description of our understanding and progress on each projects. If you have questions, please email to jzz@mit.edu for replies.

x264: x264 simulates the H.264 protocol. It concerns video. Many image frames comes together, they are played continuously, this is called video. A video can be separated into images, and an image can separated into blocks. A block can easily find a similar block in the video stream. 90%+ of the time in the x264 module is spent in finding similar blocks for blocks. The one line pseudocode looks like: for each frame f { for each block b in f { find a way to encode b; write encoded b to disk; }}. The "find a way to encode b" is to find similar blocks of b. There are several ways to encode a block b: 1. as JEPG; 2. as an increment over a block from the same video frame; 3. as an increment over a block from a previous video frame; 4. as an increment over a linear combination of two blocks from potentially different previous video frames. The simplest way to parallelize is to parallelize the inner loop, although that might involves removing the second type of encoding. This is the parallelization we are trying to implement. [Project status:] code mostly read, parallelization strategy made, no implementation. [INS]: understand the "find a way to encode b" part of the code, try to remove the in-frame dependency and parallelize the blocks loop as it is, comparing the compression ratio.

facesim: This is a realistic simulation of the human face. The results are supposed to be a starting point for realistic human face effects in movies, games, medical, and other areas requiring realistic face simulation. The face is modeled as a triple (bones, flesh, muscles). In this triple, the bone are two triangular meshes representing the cranium and the jaw, totaling 370k triangles. Their interaction is modeled as a simple hinge. The flesh consists of 850k tetrahedrons, governed by elastic mechanics. The totally 32 muscles are abstract stripes that have their ends attached at bones, and their bodies passing through the flesh. In the simulation process, each muscle is fed an activation level ranging in [0,1], with 1 being the most tense level. These contractions of the stripes in turn affect the bones and the flesh, which are governed by solid and elastic physics. The actual simulation loop is quite complicated and we are still trying to understand it. It has four hot spots, three of them of the same nature. We have cilk_for-ed all of them, and got moderated speedup (like 2-4 times on 8 cores). Let me see the source code and see what the simulation loop actually does. (Collaborate with Tim). [Project status]: code not fully read, parallelization strategy made, first parallelization done, parallelism <10. [INS]: might need to ask other people to parallelize it

raytrace: Raytracing is an algorithm to visualize a geometric scene. It works by casting rays into the geometric scene, recording each ray's sensed color, and recording them back into a single image. From a high level, this algorithm is trivially parallelizable, as all rays can be parallelized independently. One caveat is that neighboring rays might hit similar parts of the geometric scene, and thus can use some cache sharing and are better handled by the same processor. That's the parallelization we have in mind. One complication is that this program segfaults under ICC but not under GCC. This has to be solved, for example, by shifting to GCC or shifting to a newer version of ICC. The single threaded version just go in LRT/render.cxx renderFrame.... renderTile_With_StandardMesh. It seems perfectly find to replace the outer for with a cilk_for. However the scene might have some temporary variables facilitating traversal, that needs to be multiplied. We need to do the experience to see if there is segfaults. Better look at the output image. If things go really bad, we might need to figure out how to duplicate the whole scene P times. But for a first cut, let's just focus on duplicating whatever temporary variables are there. [Project status]: parallelized almost same as they do, trivial, but cache miss 1.2x and sleep up half of theirs. [INS]: find out bottleneck.

vips: [vips configure error: the problem is solved by commenting out the following lines:

8815: ;;
8816: esac

8828: )

in the generated configure file. ] It is an imaging processing engine with simple operations like scaling and skeletonization but with large image in mind. The source invokes at src/iofuncs/vips.c, by calling im_run_command in libsrc/iofuncs, which in turn call dispatch_function in the same file. In dispatch function, we find the parsec_roi_pair. It will call im_benchmark in libsrc/other, as given in the purpose of the parsec benchmarking. The dispatch_function function actually use a lookup table built_in, which consists of many libraries, one of which, im__other, defined in libsrc/other/other_dispatch.c, contains im_benchmark, to find the im_benchmark function. The im_benchmark function calls the following functions: im_open_local_array, im_LabQ2Lab, im_extract_area, im_affine, im_extract_band, im_moreconst, im_lintra_vec, im_Lab2XYZ, im_recomb, im_lintra_vec, im_lintra, im_XYZ2Lab, im_lintra_vec, im_black, im_lintra_vec, im_ifthenelse, im_Lab2LabQ, im_sharpen. Each shall be parallelized on its own, and they all share the same pattern. Break down:

\{\{\{\{\{ | im_open_local_array, include/vips/util.h defined to be libsrc/iofuncs/util.c - im_local_array, which calls im_local. a serial function to allocate 18 copies of local storage images. wonder how it works with super huge images. no parallelism. | im_LabQ2Lab: libsrc/colour/im_LabQ2Lab.c, calls im_wrapone, which dissect image into chunks and processes each chunk in parallel using a function called imb_LabQ2Lab. im_wrapone calls im_wrapmany, which seems in turn to call the im_generate to do the actual parallel processing. im_generate calls eval_to_memory, which breaks the image into stripes, each worth 1/100 of the image height, and call eval_to_region for each. eval_to_region spawns small tasks to process the smaller bits of stripes. The thread group is created and finalized in im_generate; they are triggered in eval_to_region. Strategy: first serialize the program, make a serial version where the program loops over each chunk (small chunks), and process them in turn. no speedup | im_extract_area, ./libsrc/conversion/im_extract.c, im_extract_area calls im_extract_areaband, which is actually im_generate-ed. It "extract an area and band from an image", seems to be a function that gives us an area of an image, it can also isolate a channel (they called band) only. | im_affine, ./libsrc/mosaicing/im_affine.c, this one is complicated. About calling, im_affine calls im__affine which calls affine which uses im_generate to do "the thing" in parallel. kernel affine_gen, serialization should be simple. | im_extract_band, ./libsrc/conversion/im_extract.c, on im_extract_areaband. | im_moreconst, ./libsrc/relational/relationa.c, im_moreconst calls im_more_vec which calls im_less_vec which calls im_wrapmany (thread kernel lessvec_buffer). im_more_vec also calls im_eorconst, ./libsrc/boolean/boolean.c, im_eorconst is XOR with a constant image. it calls im_eor_vec, which im_wrapone eorconst_buffer, which has a two-level loop in it which is defined in EORCONST. | im_lintra_vec, ./libsrc/arithmetic/im_lintra.c, linear interpolation an image. double *a and *b are both parameters, with < 10 numbers in each, presumably, each number corresponds to a band. I guess. im_lintra_vec wrapone lintra1_gen, lintrav_gen, lintran_gen, that's how it looks. good. lintra*_gen-s have a loop inside. | im_Lab2XYZ, libsrc/colour/im_Lab2XYZ.c, calls ~_temp wraps imb_~ loop inside to n just changing color. | im_recomb, libsrc/conversion/im_recomb.c, recombine bands linearly, warps ~_buf. | im_lintra, linearly transform only one band, use im_lintra_vec | im_XYX2Lab, libsrc/colour/im_XYZ2Lab, c ~temp wrap ~b~ loop 2 n, XYZ 2 Lab colour space | im_black, conversion/~, directly gen black_gen c mosaicing/im_lrmerge's im__black_region loop black out a region, the whole function generate a black image of a particular size of one band | im_ifthenelse, relational/im_ifthenelse, this is iffy gens ~_gen, pretty simple, the core part is looping over image to either check wether all images are 0s or 1s reducer, or multiplexing the values from two images, | im_Lab2LabQ, colour/~, wraps ~b~ which loops over pixels, quantifies image don't know what does that mean. | im_sharpen, convolution/~, this is the most complicated one, in the beginning it does a small recursion to transform the format if it is not supported. then it builds a bunch of prerequisites to get the sharpening ready. finally it wraps buf_difficult to do the work. has one loop in it. uses im_LabQ2LabS, im_LabS2LabQ, im_LabQ2LabS : colour/~, nothing tricky, unpacks coded image into three shorts, loop in ~b~ which wrapped by ~. im_LabS2LabQ: colour/~, convert back, which wraps ~b~ and which has a simple loop in it. }}}}}}

What needs to be done is simply strip out the wraps and gens, and plug in cilk_for. and in some sense reducers and such. *** It already come fully parallelization, which cuts the image into regions and processes them independently. Most image operations can be parallelized this way. There might be operations in vips that are not parallelizable using this manner, but I haven't read how they are parallelized, or if there are any such operations. For these operations that can be parallelized this way, we can simply go in and circle through the regions using a parallel for. That's the parallelization I have in mind. I will go to implement this meted. [Project status]: benchmarking code, scheduling code, and code of a few operations have been read, haven't read other operations code, parallelization strategy made, no implementation. [INS]: just parallelize it

freqmine: This is a project to discover patterns in data, i.e., doing data mining. It uses a genius data structure called frequent-pattern tree (FP tree) to reveal the patterns with low cost. I haven't quite understood the algorithm, but it is a quite complicated algorithm. The parallelization pattern is not quite clear yet, but it is parallelized by OpenMP with a few parallel for and OpenMP critical regions. The OpenMP version has at least a bug and segfaults under ICC and sometimes under GCC. Several fixes are posted in the mailing list but I haven't tried them. [Project status]: OpenMP version running. Cilk version waiting for GCC bug fix. Should work. [INS]: wait

fluidanimate: This is a simulation of the fluid in 3D. It consists of a few particles and mimics their interaction as in the real world. The interaction between the particles are local, but there seems to be a global computation procedure to remove extra energy accidentally generated during (float point) computation. This global computation is harder to parallelize than the local one, but could occupy only a fraction of the time. This needs to be confirmed. We focus mainly on the local computation. The local computation is a simple loop to calculate how each particle is affected by its neighbors. This procedure can be parallelized. We might need to use a double buffer as used in many stencil computations. One caveat is that there are many parcels and they can travel in space. So finding a particle's neighbors is not a trivial problem. In fact, they are stored in a kd-tree or octree (needs to make sure) to allow fast localization. Our parallelization strategy is to simple parallelize the loop over particles. Finer tuning is needed concerning cache, grain size, data dependency. [Project status]: code slightly read, algorithm not very clear, strategy made, trivial implementation ready, parallelism ?. trivial parallelization runs fast but with barriers [INS]: remove barrier

bodytrack: This is a program that reconstructs human skeletal movements from video sequences captured from multiple cameras (all facing the subject.) It uses a particle based approach and has acceptable results. The major charm is that it does not require the subject to wear attachable balls or etc. The algorithm has not been fully understood, but the program comes with an OpenMP parallelization and a parallel Cilk parallelization has been made. Minor tuning and investigation of better parallelization needs to be done. [Project status]: algorithm not understood, code not fully understood, parallelization done, speedup ok. [INS].

canneal: This is an implementation of the classic simulated annealing algorithm to find good routing solutions for the circuit routing problem. At the beginning, the algorithm provides a random routine of the circuit. Then randomly it swaps node pairs, such that it arrives at a different (and might be better) routing, called a permuted routing. If this is better, then it accepts it, otherwise it denies it. This process continues until certain number of node swaps have been tried or the solution has been good enough. According to the nature of the problem, I don't see a clear path to divide it into parallelizable modules. It is possible to run several simulated annealing algorithms simultaneously, on the same problem but with different states, and at the end pick the result from the algorithm execution that has the best result. However it seems to be not effective (seems that there are some results on this) and it can complicate the measurement of the parallelization methodology. So I pick the simplest parallelization technology, to perform node swaps using a parallel for and use fine grained locks to remove the data races. This parallelization has not been implemented. [Project status]: trivial parallelization implemented, speedup not good, algorithm understood, code understood, parallelism = TO_BE_FILLED. [INS]: get the numbers.

swaption: It is a simple program to estimate option prices. It is trivially cilkified and works great. [Project status]: coding part done. [INS].

blackscholes: It is a simple program to estimate stock market. The parallelization is done and is very trivial. The cache misses is high for unexpected reasons. If the cache misses is done, then the project is done. [Project status]: cilkified, no better cilkification needed, however cache miss needs debugging. [INS]: wait

dedup: This is an online compression package. Data stream is fragmented into packets, and repeated packets are encoded using a protocol. This is encoded using a special technique which will be introduced in the paper. However the speedup is not good and we need to know why. The writing phase, which takes up to 30% of the running time in serial mode, can be the reason for its slow down. But this still needs to be nailed. [Project status]: understood, cilkified, parallelism good (>40), need performance debugging on the writing stage or in general. bottlenecked on writing stage. Same as given solution. After removed writing, faster. [INS].

ferret: This is a simple pipeline-style image retrieval engine. The engine consists of a reader, an abstractor, a matcher, a sorter, and a writer. This is parallelized and achieves good speedup. However, a different testing on a different data set reveals poor speedup. Need to retest and make sure the performance numbers are good. [Project status]: out speedup, already quite good with a simple circular buffer parallelization, is still not as good as the given reference. We want to figure out what's holding it back. [INS]: find out bottleneck

Summarization

x264

facesim

raytrace

vips

freqmine

fluidanimate

streamcluster

bodytrack

canneal

swaption

blackscholes

dedup

ferret

Ready to write

pseudo code + speed up

pseudo code + speed up

√

pseudo code

√

tune like fluidanimate

pseudo code

√

√

√ speed up

√

√ speed up

Data size

43M video

370k vertices, 850k tetras

1Mx1M image, 3 frames

56M image

31M input

300K particles, 5 frames

16K 128-dim points

4 frames, 4k particles

400k nodes, 128 steps

64, 20k

64K stocks

185M data stream

256 queries

A (any) calcification solution ready

√

√

√

√

√

√

√

√

√

√

√

√

Parallelism

weird, it outputs 0

error "Not a valid ELF binary"

same bug

8.93 (cilk region 3792.82, burdened 730.48)

weird, 0.07

64

3.25

56

40

Their speedup at 8 cores

6.95

7.44 (icc)

6.58

5.51

7.21

4.06

8.24

6.97

7.89

1.92

7.35

Speedup at 8 cores GCC

3.43

will not work with gcc

3.60

Speedup at 8 cores ICC

5.74

7.38

7.28

3

7.27

6.42

2.16

2.01

4.44

Notes

Our parallel versions's cache misses is 1.2X of theirs.

coloring, two scheme swaps. how few colors can you get in 3D. (2D 1/4 to 1/5 for tie breaking. instead of being 1/9. (use 3D package 1/14th.)

other charts:
1. graphics (chart) work on what the graphic are | bar graph for speed up | graph comparing the effort put in | graph the techniques to parallelize them
2. think about how this paper is gonna work; baker's dozen. think about how someone is gonna read this paper. many people will not read this from beginning to end. they are gonna .... dive in and read the one they are interested in it. | index: each one can be read by its own; total summarization; chart showing data synchronization. (none, locks, software pipelining, histogramming. chart.
3. start to make some of these charts up..... make up some charts. want to have charts that illustrates your story. charts just for the pipeline . pseudo code. (for benchmarks) ... vignette ... of each story. (methodology of using things like cilkprof. cilkview. perf. | silkscreen, gotta run silkscreen |. maybe a little more than a couple paragraphs in the introduction. quickly realize it is this routine. then.. - definitely do the cilkprof - bradley- making make. type make, it goes in it, runs all the program, not transcribing them by hand. (to automate... ) dumps a line into ... a line of latex. (automately write ... into file. (pseudo-code: reason for including the pseudo-code. illustrate the algorithm. )
4. do some meta programming, might help (part of the question of being organized ..)