Monday, June 16, 2014

Micro-benchmarking .NET Native and RyuJit

[Disclaimer] As both RyuJIT and .NET Native are improving a lot between their updates/previews, results of benchmarks in this post may no longer be relevant. Benchmarks were ran with the following versions:

RyuJIT CTP4

.NET Native developer preview 2

While .NET JIT is performing quite well on Windows, it is still behind a fully optimized C++ program, though an efficient compiled program is not only about code generation but also memory management and data locality, the .NET Team has recently introduced two new technologies that might help for the code gen part: The introduction of .NET Native, an offline .NET compiler (similar to ngen, but using the backend optimizer that is used by the C++ compiler) and the next generation of .NET JIT called "RyuJit". In this post I would like to present the results of some micro-benchmarks that are roughly evaluating some performance benefits of these two new technologies.

First of all, you may have already read about a few benchmarks results available around about RyuJit and .NET Native, here is a non-exhaustive list I have found (if you have more pointers, let me know!):

The micro-benchmark protocol

Micro-benchmarking is not the best way to give a measure of the overall benefits, but it can help to dig into some particular patterns. For this benchmark, I haven't developed a new one but instead built a "freak-benchmark" composed of some micro-benchmarks I found on Internet, mainly:

Two custom benchmarks measuring the cost of interop which is important in cases where you can possibly call lots of native methods (which is the case when using SharpDX for example)

RyuJit is also coming with SIMD, but I will reserve a dedicated post to test this new feature.

I don't claim that these micro-benchmarks are exhaustive nor they are all correctly implemented (some of the JavaGrande benchmark seems to be not robust), but as we are measuring relative performance, that should be fine. In the end we just want to know how much .NET Native or RyuJit can perform compare to the same program running on the legacy JIT.

Also as both .NET Native and RyuJit are in development, we can't really draw any definitive conclusions.

.NET Native is only available for Windows Store App while RyuJit is only available on x64, hence the platforms tested in this bench are:

.NET 32 Desktop

.NET 32 AppStore

.NET 32 AppStore Native

.NET 64 Desktop

.NET 64 AppStore

.NET 64 AppStore Native

.NET 64 Desktop RyuJit

.NET32/.NET64 using .NET Framework 4.5.1. The machine is an Intel(R) Core(TM) i7-4770 CPU @ 3.4GHz with 16Go of RAM.

Comparison .NET32 (x86)

Normalized with performance relative to desktop. Higher is better. (2.0 means that a test is 2 times faster than desktop). I checked the standard dev was in some reasonable range.

In green, results above +10%

In red, results below -10%

Name

.NET 32

(Desktop)

.NET 32
(AppStore)

.NET 32 Native
(AppStore)

00-Big int Dictionary: 1 Adding
items

1.00

0.90

0.72

00-Big int Dictionary: 2 Running queries

1.00

0.96

0.81

00-Big int Dictionary: 3 Removing
items

1.00

0.96

0.92

01-Big string Dictionary: 0 Ints to
strings

1.00

0.79

1.13

01-Big string Dictionary: 1
Adding/setting

1.00

0.91

1.02

01-Big string Dictionary: 2 Running
queries

1.00

0.83

1.07

01-Big string Dictionary: 3 Removing
items

1.00

0.85

1.05

02-Big int sorted map: 1 Adding
items

1.00

1.01

0.89

02-Big int sorted map: 2 Running queries

1.00

1.01

0.90

02-Big int sorted map: 3 Removing
items

1.00

1.01

0.77

03-Square root: double

1.00

1.00

1.00

03-Square root: FPL16

1.00

1.02

0.97

03-Square root: uint

1.00

1.03

0.97

03-Square root: ulong

1.00

1.01

0.94

04-Simple arithmetic: double

1.00

1.01

0.51

04-Simple arithmetic: float

1.00

1.01

0.86

04-Simple arithmetic: FPI8

1.00

0.99

1.23

04-Simple arithmetic: FPL16

1.00

1.01

1.55

04-Simple arithmetic: int

1.00

1.02

2.04

04-Simple arithmetic: long

1.00

0.99

0.83

05-Generic sum: double

1.00

1.01

3.34

05-Generic sum: FPI8

1.00

1.01

1.30

05-Generic sum: int

1.00

1.01

0.93

05-Generic sum: int via IMath

1.00

1.01

0.57

05-Generic sum: int without generics

1.00

1.00

1.10

06-Simple parsing: 3 Parse
(x1000000)

1.00

1.00

1.00

06-Simple parsing: 4 Sort (x1000000)

1.00

1.00

1.09

07-Trivial method calls: Interface
NoOp

1.00

1.00

0.71

07-Trivial method calls: No-inline
NoOp

1.00

1.00

1.03

07-Trivial method calls: Static NoOp

1.00

1.00

no-op

07-Trivial method calls: Virtual
NoOp

1.00

1.11

1.07

08-Matrix multiply:
[n][n]

1.00

1.00

2.43

08-Matrix multiply:
[n][n]

1.00

1.00

2.68

08-Matrix multiply:
[n][n]

1.00

0.95

0.59

08-Matrix multiply:
Array2D

1.00

1.01

0.64

08-Matrix multiply: double[n*n]

1.00

1.01

0.78

08-Matrix multiply: double[n][n]

1.00

1.01

1.04

08-Matrix multiply: int[n][n]

1.00

1.00

1.21

09-Sudoku

1.00

1.00

1.04

10-Polynomials

1.00

1.00

1.03

11-JGFArithBench

1.00

1.00

23.81

12-JGFAssignBench

1.00

0.80

1.20

13-JGFCastBench

1.00

1.00

1.27

14-JGFCreateBench

1.00

0.98

0.81

15-JGFFFTBench

1.00

1.01

0.99

16-JGFHeapSortBench

1.00

0.99

1.01

17-JGFLoopBench

1.00

1.00

1.04

18-JGFRayTracerBench

1.00

0.98

0.88

19-float4x4 matrix mul, Managed Standard

1.00

1.00

0.63

20-float4x4 matrix mul, Managed
unsafe

1.00

1.01

0.96

21-float4x4 matrix mul, Interop Standard

1.00

1.23

1.42

22-float4x4 matrix mul, Interop SSE2

1.00

1.36

1.82

23-managed add

1.00

1.01

7.00

24-managed no-inline add

1.00

1.00

1.10

25-interop add

1.00

1.01

1.21

26-interop indirect add

1.00

1.00

2.26

Quick analysis

We would probably expect a column full of green lights for the .NET Native, but this is unfortunately not the case! Some notes:

.NET Native is as efficient as a C++ compiler at coalescing arithmetic instructions (test 11, or 23). Basically the test 23 is able to reduce the addition set of x+=1, x+=2, x+=-3, x+=1, x+=2, x+=-3, x+=1 to a single x+= 1, resulting in some impressive speedup. Coalescing of instructions is probably the factor that is helping in most tests there.

Pure interop seems slightly more efficient, which is good whenever we are frequently calling native functions (like when using SharpDX/Direct3D11). Note that indirect interop (wrapping a DllImport by another function) is also faster which is great, as It was an issue with current interop that were not inlined by the JIT when they are wrapped, resulting in lots of duplicate prologue/epilogue code for unmanaged/managed transitions (while when it is correctly inlined, consecutive access to interop functions are handled in group when switching context unmanaged/managed)

Some tests are 2x times slower with .NET Native, though I haven't look at the generated x86 code.

Overall it is still promising, we can see some significant boosts in some tests while some others are performing a bit worse.

Comparison .NET64 (x64)

Comparison between:

.NET 64 Desktop

.NET 64 AppStore

.NET 64 AppStore Native

.NET 64 Desktop RyuJit

Normalized with performance relative to desktop. Higher is better. (2.0 means that a test is 2 times faster than desktop)

In green, results above +10%

In red, results below -10%

Name

.NET 64(Desktop)

.NET 64
(AppStore)

.NET 64 Native
(AppStore)

.NET 64 RyuJit
(Desktop)

00-Big int Dictionary: 1 Adding
items

1.00

1.04

1.00

1.02

00-Big int Dictionary: 2 Running queries

1.00

0.91

1.00

0.95

00-Big int Dictionary: 3 Removing
items

1.00

1.00

0.95

0.95

01-Big string Dictionary: 0 Ints to
strings

1.00

0.69

0.72

1.00

01-Big string Dictionary: 1
Adding/setting

1.00

0.85

0.84

0.99

01-Big string Dictionary: 2 Running
queries

1.00

0.82

0.90

0.95

01-Big string Dictionary: 3 Removing
items

1.00

0.81

0.91

1.00

02-Big int sorted map: 1 Adding
items

1.00

0.98

1.10

1.04

02-Big int sorted map: 2 Running queries

1.00

1.02

1.06

0.97

02-Big int sorted map: 3 Removing
items

1.00

1.01

1.02

1.16

03-Square root: double

1.00

1.00

1.00

1.00

03-Square root: FPL16

1.00

1.01

1.15

1.03

03-Square root: uint

1.00

1.00

0.97

0.94

03-Square root: ulong

1.00

1.00

1.15

0.95

04-Simple arithmetic: double

1.00

1.00

4.20

1.10

04-Simple arithmetic: float

1.00

1.00

1.36

0.99

04-Simple arithmetic: FPI8

1.00

1.00

0.91

1.42

04-Simple arithmetic: FPL16

1.00

0.96

1.21

5.19

04-Simple arithmetic: int

1.00

1.00

0.83

0.89

04-Simple arithmetic: long

1.00

1.00

0.96

0.93

05-Generic sum: double

1.00

1.00

1.34

1.33

05-Generic sum: FPI8

1.00

1.00

1.29

1.00

05-Generic sum: int

1.00

0.98

1.28

0.99

05-Generic sum: int via IMath

1.00

1.00

0.65

0.99

05-Generic sum: int without generics

1.00

1.00

1.70

1.00

06-Simple parsing: 3 Parse
(x1000000)

1.00

1.00

0.50

1.00

06-Simple parsing: 4 Sort (x1000000)

1.00

0.95

1.30

0.95

07-Trivial method calls: Interface
NoOp

1.00

1.00

0.69

0.85

07-Trivial method calls: No-inline
NoOp

1.00

0.92

0.96

0.96

07-Trivial method calls: Static NoOp

1.00

1.00

Not Applicable

0.20

07-Trivial method calls: Virtual
NoOp

1.00

1.00

0.92

0.74

08-Matrix multiply:
[n][n]

1.00

0.99

1.14

1.17

08-Matrix multiply:
[n][n]

1.00

1.00

5.01

4.95

08-Matrix multiply:
[n][n]

1.00

1.00

1.34

1.16

08-Matrix multiply:
Array2D

1.00

1.00

3.83

2.75

08-Matrix multiply: double[n*n]

1.00

1.00

1.00

1.00

08-Matrix multiply: double[n][n]

1.00

0.99

0.96

0.98

08-Matrix multiply: int[n][n]

1.00

1.00

1.19

1.12

09-Sudoku

1.00

1.00

1.38

1.48

10-Polynomials

1.00

1.00

0.94

0.99

11-JGFArithBench

1.00

1.00

1.02

1.12

12-JGFAssignBench

1.00

1.00

1.02

0.53

13-JGFCastBench

1.00

1.00

0.99

1.39

14-JGFCreateBench

1.00

0.96

0.81

0.99

15-JGFFFTBench

1.00

1.16

1.18

1.16

16-JGFHeapSortBench

1.00

1.00

1.01

0.99

17-JGFLoopBench

1.00

1.00

1.08

1.01

18-JGFRayTracerBench

1.00

1.00

0.87

1.13

19-float4x4 matrix mul, Managed Standard

1.00

0.99

1.04

1.36

20-float4x4 matrix mul, Managed
unsafe

1.00

1.01

0.90

1.00

21-float4x4 matrix mul, Interop Standard

1.00

1.20

1.46

1.03

22-float4x4 matrix mul, Interop SSE2

1.00

1.36

1.92

1.05

23-managed add

1.00

1.00

1.00

4.05

24-managed no-inline add

1.00

0.89

1.48

1.00

25-interop add

1.00

0.99

1.17

1.11

26-interop indirect add

1.00

1.00

1.28

0.38

Quick analysis

Slightly better than x86 code gen, the .NET Native x64 and RyuJit are on average performing better than their JIT counterpart. Some notes:

Unexpectedly, coalescing
of arithmetic instructions (test 11, or 23) is not happening for .NET Native, but for RyuJit.

Performance on float/double is better. Most likely SSE registers are better used.

Sudoku tests is getting a nice +40-50% faster with .NET Native and RyuJit

Comparison .NET32 Native vs .NET64 Native

Just use .NET 32 Native as a reference (1.0) and compare it to the .NET 64 Native.
Normalized with performance relative to .NET 32 Native. Higher is better. (2.0 means that a test on x64 Native is 2 times faster than x86 )

In green, results above +10%

In red, results below -10%

Name

.NET 32
Native (AppStore)

.NET 32 vs 64
Native (AppStore)

00-Big int Dictionary: 1
Adding items

1.00

1.31

00-Big int Dictionary: 2 Running queries

1.00

1.35

00-Big int Dictionary: 3 Removing
items

1.00

1.19

01-Big string Dictionary: 0 Ints to
strings

1.00

1.13

01-Big string Dictionary: 1
Adding/setting

1.00

1.22

01-Big string Dictionary: 2 Running
queries

1.00

1.17

01-Big string Dictionary: 3 Removing
items

1.00

1.16

02-Big int sorted map: 1 Adding
items

1.00

1.01

02-Big int sorted map: 2 Running queries

1.00

0.99

02-Big int sorted map: 3 Removing
items

1.00

0.96

03-Square root: double

1.00

1.00

03-Square root: FPL16

1.00

2.10

03-Square root: uint

1.00

1.12

03-Square root: ulong

1.00

2.12

04-Simple arithmetic: double

1.00

2.07

04-Simple arithmetic: float

1.00

1.40

04-Simple arithmetic: FPI8

1.00

1.00

04-Simple arithmetic: FPL16

1.00

1.49

04-Simple arithmetic: int

1.00

0.97

04-Simple arithmetic: long

1.00

7.65

05-Generic sum: double

1.00

1.01

05-Generic sum: FPI8

1.00

1.00

05-Generic sum: int

1.00

1.01

05-Generic sum: int via IMath

1.00

1.07

05-Generic sum: int without generics

1.00

1.12

06-Simple parsing: 3 Parse
(x1000000)

1.00

1.00

06-Simple parsing: 4 Sort (x1000000)

1.00

1.96

07-Trivial method calls: Interface
NoOp

1.00

1.00

07-Trivial method calls: No-inline
NoOp

1.00

1.20

07-Trivial method calls: Static NoOp

1.00

not applicable

07-Trivial method calls: Virtual
NoOp

1.00

1.16

08-Matrix multiply:
[n][n]

1.00

1.29

08-Matrix multiply:
[n][n]

1.00

1.24

08-Matrix multiply:
[n][n]

1.00

2.38

08-Matrix multiply:
Array2D

1.00

1.99

08-Matrix multiply: double[n*n]

1.00

1.30

08-Matrix multiply: double[n][n]

1.00

1.00

08-Matrix multiply: int[n][n]

1.00

2.37

09-Sudoku

1.00

1.13

10-Polynomials

1.00

0.99

11-JGFArithBench

1.00

0.61

12-JGFAssignBench

1.00

0.86

13-JGFCastBench

1.00

1.57

14-JGFCreateBench

1.00

0.93

15-JGFFFTBench

1.00

1.05

16-JGFHeapSortBench

1.00

1.08

17-JGFLoopBench

1.00

0.97

18-JGFRayTracerBench

1.00

1.07

19-float4x4 matrix mul, Managed Standard

1.00

1.34

20-float4x4 matrix mul, Managed
unsafe

1.00

1.02

21-float4x4 matrix mul, Interop Standard

1.00

1.09

22-float4x4 matrix mul, Interop SSE2

1.00

1.15

23-managed add

1.00

0.16

24-managed no-inline add

1.00

1.35

25-interop add

1.00

1.15

26-interop indirect add

1.00

1.15

Quick analysis

.NET 64 Native code gen is better than .NET 32 Native code gen. Haven't dig into code gen, but more registers for x64 might help optim while x86 is still fighting with a limited set of registers (and x86 code is not using SSE instructions, so it doesn't help). Good to see that interop is also better on x64, while it is not the case for JIT x64 where it is usually much slower.

Comparison .NET64 Native vs .NET64 RyuJit

Use .NET 64 Native as a reference (1.0) and compare it to the .NET 64 RyuJit.
Normalized
with performance relative to .NET 64 Native. Higher is better. (2.0
means that a test on x64 RyuJit is 2 times faster than x64 Native )

In green, results above +10%

In red, results below -10%

Name

.NET 64
Native (AppStore)

.NET 64 vs 64
RyuJit

00-Big int Dictionary: 1
Adding items

1.00

1.02

00-Big int Dictionary: 2 Running queries

1.00

0.95

00-Big int Dictionary: 3 Removing
items

1.00

1.00

01-Big string Dictionary: 0 Ints to
strings

1.00

1.38

01-Big string Dictionary: 1
Adding/setting

1.00

1.18

01-Big string Dictionary: 2 Running
queries

1.00

1.05

01-Big string Dictionary: 3 Removing
items

1.00

1.10

02-Big int sorted map: 1 Adding
items

1.00

0.94

02-Big int sorted map: 2 Running queries

1.00

0.91

02-Big int sorted map: 3 Removing
items

1.00

1.14

03-Square root: double

1.00

1.00

03-Square root: FPL16

1.00

0.89

03-Square root: uint

1.00

0.97

03-Square root: ulong

1.00

0.83

04-Simple arithmetic: double

1.00

0.26

04-Simple arithmetic: float

1.00

0.73

04-Simple arithmetic: FPI8

1.00

1.56

04-Simple arithmetic: FPL16

1.00

4.31

04-Simple arithmetic: int

1.00

1.07

04-Simple arithmetic: long

1.00

0.96

05-Generic sum: double

1.00

0.99

05-Generic sum: FPI8

1.00

0.78

05-Generic sum: int

1.00

0.78

05-Generic sum: int via IMath

1.00

1.53

05-Generic sum: int without generics

1.00

0.59

06-Simple parsing: 3 Parse
(x1000000)

1.00

2.00

06-Simple parsing: 4 Sort (x1000000)

1.00

0.73

07-Trivial method calls: Interface
NoOp

1.00

1.24

07-Trivial method calls: No-inline
NoOp

1.00

1.00

07-Trivial method calls: Static NoOp

1.00

0.00

07-Trivial method calls: Virtual
NoOp

1.00

0.81

08-Matrix multiply:
[n][n]

1.00

1.03

08-Matrix multiply:
[n][n]

1.00

0.99

08-Matrix multiply:
[n][n]

1.00

0.86

08-Matrix multiply:
Array2D

1.00

0.72

08-Matrix multiply: double[n*n]

1.00

0.99

08-Matrix multiply: double[n][n]

1.00

1.02

08-Matrix multiply: int[n][n]

1.00

0.94

09-Sudoku

1.00

1.07

10-Polynomials

1.00

1.05

11-JGFArithBench

1.00

1.10

12-JGFAssignBench

1.00

0.52

13-JGFCastBench

1.00

1.40

14-JGFCreateBench

1.00

1.23

15-JGFFFTBench

1.00

0.98

16-JGFHeapSortBench

1.00

0.98

17-JGFLoopBench

1.00

0.93

18-JGFRayTracerBench

1.00

1.30

19-float4x4 matrix mul, Managed Standard

1.00

1.31

20-float4x4 matrix mul, Managed
unsafe

1.00

1.12

21-float4x4 matrix mul, Interop Standard

1.00

0.71

22-float4x4 matrix mul, Interop SSE2

1.00

0.55

23-managed add

1.00

4.05

24-managed no-inline add

1.00

0.68

25-interop add

1.00

0.95

26-interop indirect add

1.00

0.30

Quick analysis

Surprisingly, RyuJit is performing quite well or sometimes even better than .NET 64 Native. Might be interesting to dig into this.

Summary

As both .NET Native and RyuJit are still in alpha/beta stages, we can't really assert any definitive conclusions here. We can see a trend of improvements in some specific areas, while some tests are still performing a bit worse than the legacy JIT. [Edit]The release of .NET Native Developer Preview 3 on June 30 2014, is showing some improvements in code gen, so .NET Native and RyuJIT are definitely being improved between updates and it is great! [/Edit]

It is good to see.NET 64 getting better and performing well with .NET Native and RyuJit, Until now I have been a bit reluctant at using it, but it looks more robust compare to x86 code gen.

While code gen can be undoubtedly improved with offline compilers or a more modern JIT like RyuJit, we probably can't expect the moon. As I said in this introduction, code gen is only a part of the overall performance cake. The other part, that is most likely not yet covered by these new compiler architectures, is data locality: things like ability to create fat objects - embed instantiation of objects instance into another instance - or creation of short live objects (not value types) on the stack instead of the heap are still areas where .NET can probably be improved. I will hopefully take more time in a next post to explain why this is an important area of improvements and what could be done.

Anyway, this is great to see .NET performance back into the ring! I'm eager also to be able to use .NET Native on desktop.

My name is Pooya Zandevakili and I am one of the developers from the .Net Native team at Microsoft. More specifically, I work on code generation and optimizations for both C++ and (now) C#.

Thank you for sharing this information. I would like to re-emphasize a point that you have mentioned as well: .Net Native is still in preview and there is still work to be done. We are actively working to make sure that it meets the high quality standards that our customers demand and we will definitely be drilling into the benchmark data you have reported. Community feedback like this is very helpful, so once again thank you. I would also like to encourage you to keep up with our (frequently-released) developer previews as we continually add new improvements. In fact, we just released our third Developer Preview today incorporating other community feedback regarding code quality, which you might find interesting:

Thanks Pooya for your feedback. Indeed, I'm glad to see that the new "Developper Preview 3" is improving some benchmarks here. I have added a disclaimer at the beginning of this post to emphasize about the preview cycle