br Acknowledgments This work has been supported by the Calif

2018-11-15

Acknowledgments
This work has been supported by the California Institute for Telecommunications and Information Technology (CALIT2) under Grant number 2014CSRO 136.
Experimental design, materials and methods
Acknowledgments
This work has been supported by the California Institute for Telecommunications and Information Technology (CALIT2) under Grant Number 2014CSRO 136.
Value of the data
Data, experimental design, materials and methods
Tomographic reconstruction implemented in Tomo3D 2.0 includes a new blocking mechanism to take advantage of the processor cache and reduce cache misses (Fig. 1). Slices and sinograms are divided into small blocks that fit into the different levels of cache memory. The blocks in cache memories are re-used as much as possible before proceeding with others, thus minimizing the data exchange with main memory. Cache blocking turns out to be of paramount importance to maximize performance of tomographic reconstruction with Advanced npy receptor eXtensions (AVX), where sets of eight slices of the volume are reconstructed simultaneously thanks to these vector instructions [1].
As shown in Fig. 1, our new cache mechanism takes advantage of both the first (L1) and the last (LLC) level of cache. On the one hand, sinograms are divided into blocks of projections whose size is chosen to fit in the LLC. On the other hand, the different rows of a slice are broken in smaller parts that fit in the L1 cache, depending upon an integer split factor denoted by ‘splitf’. A part of a row is then kept in L1 while being processed with all projections in the current block of projections, which in turn is kept in the LLC, hence maximizing the use of cache memory. This splitting of sinograms and slices is applied to the Forward and Backward projection steps of the SIRT iterative reconstruction algorithm [1]. To evaluate this cache blocking mechanism, we carried out a thorough study of the performance by varying the block sizes for the LLC and L1 cache memories. The results are reported in the following section.
For the evaluation, we used two platforms based on the Sandy Bridge Intel microarchitecture. The first one, referred to as ‘Platform 1’, was a standard desktop computer with an Intel Core i7-2600 (quad core) at 3.4GHz, with 32kB of L1 cache per core and 8MB of LLC (third level of cache, shared by the four cores). The second platform, ‘Platform 2’, was a node of a cluster. It had two Intel Xeon E5-2650 (octo core) at 2GHz, with a total of 16 cores, with 32kB of L1 cache per core and 20MB of LLC per CPU (i.e. shared by the eight cores). We used datasets of representative sizes of current structural studies by electron tomography. Thus, we selected tilt-series of 140 images of sizes 2048×2048 and 4096×4096 pixels, in the tilt range [−70°, 69°], to generate reconstructed volumes of 2048×2048×256 and 4096×4096×256 voxels, respectively. In the following, they are denoted by 2K and 4K datasets.
The cache mechanism is included in Tomo3D 2.0 [1], which was compiled with the Intel C/C++ Compiler and was run under Linux. The evaluation was based on 15 iterations of SIRT using AVX instructions (i.e. eight slices reconstructed simultaneously) and all combinations of platforms and datasets were covered. To perform a more general analysis, we evaluated two situations. Firstly, threads were created to use all cores available in a chip (denoted by 4T in Platform 1 and 8T in Platform 2); secondly, only half (2T and 4T, respectively). All the experiments were carried out five times, and the average computation times were then calculated.
Tuning the cache memory usage
Figs. 2 and 3 show the processing time obtained with the 2K and 4K datasets, respectively. The time is represented in % with regard to the slowest one in each plot. The results from Platform 1 are shown on the top whereas those from Platform 2 are on the bottom. The results from the use of all or half of cores in a chip are presented on the left and right columns, respectively. The plots include a curve for the L1 block size corresponding to the native row size of a slice, which is equivalent to a ‘splitf’ of 1. This curve is 128kB for the 2K dataset and 256kB for the 4K one (i.e. as many vectors of eight components as the row size, and also adding those for the symmetric pixels, using 32-bit floating point numbers). Furthermore, the plots include curves for L1 block sizes of 32–4kB. These represent ‘splitf’ factors of 4, 8, 16 and 32 for the 2K dataset and 8, 16, 32 and 64 for the 4K dataset. Note that 32kB is the size of the L1 cache available in each core.