Abstract

The paper presents the work towards implementation of a technique to enhance parallel execution of auto-generated OpenMP programs by considering the architecture of on-chip cache memory, thereby achieving higher performance. It avoids false-sharing in 'for-loops' by generating OpenMP code for dynamically scheduling chunks by placing each core’s data cache line size apart. It has been found that most of the parallelization tools do not deal with significant issues associated with multicore such as false-sharing, which can degrade performance. An open-source parallelization tool called Par4All (Parallel for All), which internally makes use of PIPS (Parallelization Infrastructure for Parallel Systems) - PoCC (Polyhedral Compiler Collection) integration has been analyzed and its power has been unleashed to achieve maximum hardware utilization. The work is focused only on optimizing parallelization of for-loops, since loops are the most time consuming parts of code. The performance of the generated OpenMP programs have been analyzed on different architectures using Intel® VTune™ Performance Analyzer. Some of the computationally intensive programs from PolyBench have been tested with different data sets and the results obtained reveal that the OpenMP codes generated by the enhanced technique have resulted in considerable speedup. The deliverables include automation tool, test cases, corresponding OpenMP programs and performance analysis reports.

References

No relevant information is available
If you register references through the customer center, the reference information will be registered as soon as possible.