Intel now allows the use of non-destructive three-operand operations in their SSE and AVX instruction sets.

33

As shown in Figure \ref{insmix}, the total number of non-bitwise logic SIMD operations, which involve many memory movements is 32\% to 34\% less.

34

Simply enabling three-operand form on the existing 128-bit SSE instructions reduced the overall cycle count by between 11.7\% and 13.5\%, which is shown in Figure \ref{avx}.

35

While this is a one-time savings, it provided a significant performance improvement that traditional parsers cannot leverage since they cannot benefit from the three-operand form designed for SIMD instruction set and as shown in Figure \ref{insmix}, the total number of non-vector instructions does not change.

28

In addition to the introduction of 256-bit operations, AVX technology

29

also makes a change in the structure of the base SSE instructions,

30

moving from a destructive 2-operand form long used with SSE technologies

Thus, whenever the subsequent instructions used the value of both $a$ and $b$,

35

one of them had to be copied beforehand, or reconstituted or reloaded

36

afterwards in order to recover the value.

37

With 3-operand form, output may be directed to a third register independent

38

of the source operands, as reflected by the assignment $c = a~\texttt{[op]}~b$.

39

By avoiding the copying or reconstituting of operand values, a considerable

40

reduction in instruction count may be possible.

41

AVX technology makes available the 3-operand form both with the new 256-bit

42

operations as well as with base 128-bit operations of SSE.

36

43

37

44

\subsection{256-bit Operations}

38

45

39

The AVX instruction set provided on the Sandy Bridge allows the use of 256-bit SIMD registers.

40

Ideally, we only need half of the SIMD instructions compared with the version that uses SSE instruction set (three-operand form).

41

Therefore, Parabix2 should be able to achieve 50\% performance improvement on SIMD operations, which means 26\% to 38\% improvement of total processing time simply by using AVX intruction set instead of SSE instruction set.

42

However, Intel focused on implementing floating point operations as opposed to the integer based operations, we only gain from bitwise logic operations and SIMD loading operations.

43

As shown in Figure \ref{insmix}, the total number of SIMD instructions executed with AVX instruction set is 71\% to 79\% of the SIMD instructions with SSE instruction set.

44

The number of bitwise logic operations, which is expected to be 50\% less, only goes down by 33\% to 39\% because they are used to simulate some other 256-bit operations that exsit on SSE but is not provided by AVX instruction set.

45

As the total number of instructions goes down by 11\% to 23\%, we should be able to see less processing time and better performance.

46

However, as shown in Figure \ref{avx}, the processing time is longer except the one with 23\% less instructions.

47

The reason is that AVX instruction has longer latency. (cite Agner Fog?)

46

With the introduction of 256-bit SIMD registers with AVX technology,

47

one might ideally expect up to a 50\% reduction in the instruction

48

count for the SIMD workload of Parabix2. However, in the Sandy Bridge

49

implementation, Intel has focused on implementing floating point

50

operations as opposed to the integer based operations. That is,

51

256-bit SIMD is available for loads, stores, bitwise logic and

52

floating operations, while SIMD integer operations and shifts are

53

only available in 128-bit form. Nevertheless, with loads, stores

54

and bitwise logic comprising a major portion of the Parabix2

55

SIMD instruction mix, a substantial reduction in instruction count

56

and consequent performance improvement was anticipated.

48

57

58

\subsection{Performance Results}

59

60

We implemented two versions of Parabix2 using AVX technology. The first

61

was simply the recompilation of the existing Parabix2 source code

62

to take advantage of the 3-operand form of AVX instructions while retaining

63

a uniform 128-bit SIMD processing width. The second involved rewriting