I have found a strange IPC behavior on a test program which benchmarks matrix multiplication using the MPFR library in 53 and 113 bits. The 113 bits was always way faster (typically 20-30%) whereas it perform more computation. After analysis, I have reduced the problem to the mpfr_mul function.

Here is the assembly extract of where I think the problem is : in the mpfr_mul function on more precisely in the section which perform 1x1, 2x1 or 2x2 multiplication :

cmpq $2, %r9

jg .L21

movq 24(%r14), %rsi

leaq 8(%rbx), %rdi

movq 24(%r13), %rcx

movq (%rsi), %rax

#APP

# 324 "mul.c" 1

mulq (%rcx)

# 0 "" 2

#NO_APP

cmpq $1, %r9

movq %rdx, %r11

movq %rax, (%rbx)

movq %rdx, 8(%rbx)

je .L23

movq 8(%rsi), %rax

#APP

# 334 "mul.c" 1

mulq (%rcx)

# 0 "" 2

# 335 "mul.c" 1

addq %rax,%r11

adcq $0,%rdx

# 0 "" 2

#NO_APP

cmpq $1, -136(%rbp)

movq %rdx, 16(%rbx)

movq %r11, (%rdi)

# je .L189

movq 8(%rcx), %r9

movq (%rsi), %rcx

movq %rcx, %rax

#APP

# 346 "mul.c" 1

mulq %r9

# 0 "" 2

#NO_APP

movq %rdx, %r11

movq %rax, %rcx

movq 8(%rsi), %rax

#APP

# 347 "mul.c" 1

mulq %r9

# 0 "" 2

# 348 "mul.c" 1

addq %rax,%r11

adcq $0,%rdx

# 0 "" 2

#NO_APP

movq 8(%rbx), %rax

movq %rdx, 24(%rbx)

movq 16(%rbx), %rdx

#APP

# 350 "mul.c" 1

addq %rcx,%rax

adcq %r11,%rdx

# 0 "" 2

#NO_APP

movq %rdx, 16(%rbx)

movq %rax, (%rdi)

cmpq %r11, 16(%rbx)

setb %r11b

movzbl %r11b, %r11d

addq 24(%rbx), %r11

movq %r11, 24(%rbx)

.L23:

subq -144(%rbp), %r8

shrq $63, %r11

When I let the asm as it is (which is produced by gcc with a litlle change in - je .L189 - in order to better show the problem), I get this performance (using linux perf stat -B tool):

23431,087207 task-clock # 0,976 CPUs utilized

2 109 context-switches # 0,000 M/sec

4 CPU-migrations # 0,000 M/sec

11 888 page-faults # 0,001 M/sec

49 043 462 004 cycles # 2,093 GHz [50,06%]

<not supported> stalled-cycles-frontend

<not supported> stalled-cycles-backend

30 713 070 462 instructions # 0,63 insns per cycle [75,02%]

4 492 657 867 branches # 191,739 M/sec [74,99%]

71 968 726 branch-misses # 1,60% of all branches [74,95%]

24,008123640 seconds time elapsed

If I comment the line in bold ( je .L23) in the assembly source (which performs a jump which only skips 29 instructions), I get:

12919,383975 task-clock # 0,943 CPUs utilized

1 520 context-switches # 0,000 M/sec

15 CPU-migrations # 0,000 M/sec

11 887 page-faults # 0,001 M/sec

27 032 904 739 cycles # 2,092 GHz [50,04%]

<not supported> stalled-cycles-frontend

<not supported> stalled-cycles-backend

31 976 622 505 instructions # 1,18 insns per cycle [75,04%]

4 734 392 898 branches # 366,457 M/sec [75,03%]

64 698 800 branch-misses # 1,37% of all branches [74,93%]

13,704240040 seconds time elapsed

It performs way faster whereas it computes effectively more instruction (The IPC is nearly twice higher whereas this is the IPC of the whole program).

I can not explain such behavior. It has been seem on multiple Intel core CPU (not only mine, which is Intel Core2 Duo T6500) . Full benchmark code for Linux is available on demand.

If I replace the je .L23 by an unconditional jump, I get the slow behavior.

If I replace the je .L23 by a nop instruction (or 2, 3, 4 nop), I get the fast behavior.

We do not recommend running benchmark software’s because may show incorrect information. From our side, we have a stress test software you can run and it will diagnose all internal components of the processor.

I want to clarify one thing: I am not trying to test my CPU with a stress test software or other benchmark in order to diagnose a possible CPU failure. I am trying to improve my code to get max performance from Intel CPU, and I get this behavior which I don't explain.