Tuesday, March 15, 2011

Benchmarking C#/.Net Direct3D 11 APIs vs native C++

[Update 2012/05/15: Note that the original code was fine tuned to a particular config and may not give you the same results. I have rewritten this sample to give more accurate and predictible results. Comparison with XNA was also not fair and inaccurate, you should have something like x4 slower instead of x9. SharpDX latest version 2.1.0 is also x1.35 slower than C++ now. An update of this article will follow on new sharpdx.org website][Update 2014/06/17: Remove XNA comparison, as it is not fair and relevant]

If you are working with a managed language like C# and you are concerned by performance, you probably know that, even if the Microsoft JIT CLR is quite efficient, It has a significant cost over a pure C++ implementation. If you don't know much about this cost, you have probably heard about a mean cost for managed languages around 15-20%. If you are really concern by this, and depending on the cases, you know that the reality of a calculation-intensive managed application is more often around x2 or even x3 slower than its C++ counterpart. In this post, I'm going to present a micro-benchmark that measure the cost of calling a native Direct3D 11 API from a C# application, using various API, ranging from SharpDX, SlimDX, WindowsCodecPack.

Why this benchmark is important? Well, if you intend like me to build some serious 3D games with a C# managed API (don't troll me on this! ;) ), you need to know exactly what is the cost of calling intensively a native Direct3D API (mainly, the cost of the interop between a managed language and a native API) from a managed language. If your game is GPU bounded, you are unlikely to see any differences here. But if you want to apply lots of effects, with various models, particles, materials, playing with several rendering targets and a heavy deferred rendering technique, you are likely to perform lots of draw calls to the Direct3D API. For a AAA game, those calls could be as high as 3000-7000 draw submissions in instancing scenarios (look at latest great DICE publications in "DirectX 11 Rendering in Battlefield 3" from Johan Andersson). If you are running at 60fps (or lower 30fps), you just have 17ms (or 34ms) per frame to perform your whole rendering. In this short time range, drawing calls can take a significant amount of time, and this is a main reason why multi-threading batching command were introduced in DirectX11. We won't use such a technique here, as we want to evaluate raw calls.

As you are going to see, results are pretty interesting for someone that is concerned by performance and writing C# games (or even efficient tools for a 3D Middleware)

The Managed (C#) to Native (C++) interop cost

When a managed application needs to call a native API, it needs to:

Marshal method/function arguments from the managed world to the unmanaged world

The CLR has to switch from a managed execution to an unmanaged environment (change exception handling, stacktrace state...etc.)

The native methods is effectively called

Than you have to marshal output arguments and results from unmanaged world to managed one.

To perform a native call from a managed language, there is currently 3 solutions:

Using the default interop mechanism provided under C# is P/Invoke, which is in charge of performing all the previous steps. But P/Invoke comes at a huge cost when you have to pass some structures, arrays by values, strings...etc.

Using a C++/CLI assembly that will perform a marshaling written by hand to the native C++ methods. This is used by SlimDX, WindowsCodePack and XNA.

Using SharpDX technique that is generating all the marshaling and interop at compile time, in a structured and consistent way, using some missing CLR bytecode inside C# that is usually only available in C++/CLI

The marshal cost is in fact the most expensive one. Usually, calling directly a native function without performing any marshaling has a cost of 10% which is fine. But if you take for example a slightly more complex functions, like ID3D11DeviceContext::SetRenderTargets, you can see that marshaling takes a significant amount of code:

In the previous sample, there is no structure marshaling involved (that are even more costly than pure method arguments marshaling), and as you can see, the marshaling code is pretty heavy: It has to handles null parameters, transform an array of managed DirectX interfaces to a respective array of native COM pointer...etc.

Hopefully, in SharpDX unlike any other DirectX .NET APIs, this code has been written to be consistent over the whole generated code, and was carefully designed to be quite efficient... but still, It has obviously a cost, and we need to know it!

Protocol used for this micro-benchmark

Writing a benchmark is error prone, often subject to caution and relatively "narrow minded". Of course, this benchmark is not perfect, I just hope that It doesn't contain any mistake that would give false results trend!

In order for this test to be closer to a real 3D application usage, I made the choice to perform a very basic test on a sequence of draw calls that are usually involved in common drawing calls scenarios. This test consist of drawing triangles using 10 successive effects (Vertex Shaders/Pixel Shaders), with their own vertex buffers, setting the viewport and render target to the backbuffer. This loop is then ran thousand of times in order to get a correct average.

The VertexShader/PixelShaders involved is basic (just color passing between VS and PS, no WorldProjectionTransform applied), the context.Flush is used to avoid measuring flush of commands to the GPU. The CommonBench.FlushLimit value was selected to avoid any stalls from the GPU.

I have ported this benchmark under:

C++, using raw native calls and Direct3D11 API

SharpDX, using Direct3D11 running under Microsoft .NET CLR 4.0 and with Mono 2.10 (both trying llvm on/off). SharpDX is the only managed API to be able to run under Mono.

SlimDX using Direct3D11 running under Microsoft .NET CLR 4.0. SlimDX is "NGENed" meaning that it is compiled to native code when you install it.

It has been tested on a Win7-64bit, i5-750 2.6Ghz, Gfx AMD HD6950. All tests were done both in x86 and x64 mode, in order to measure the platform impact of the calling conventions. Tests were ran 4 times for each API, taking the average of the 3 lowest one.

Results

You can see the raw results in the following table. Time is measured for the simple drawing sequence (inside the loop for(i) nbEffects). Lower is better. The ratio on the right indicates how much is slower the tested API compare to the C++ one. For example, SharpDX in x86 mode is running 1,52 slower than its pure C++ counterpart.

Direct3D11 Simple Bench

x86 (ms)

x64 (ms)

x86-ratio

x64-ratio

Native C++ (MSVC VS2010)

0.000386

0.000262

x1.00

x1.00

Managed SharpDX (1.3 MS .Net CLR)

0.000585

0.000607

x1.52

x2.32

Managed SlimDX (June 2010 - Ngen)

0.000945

0.000886

x2.45

x3.38

Managed SharpDX (1.3 Mono-2.10)

0.002404

0.001872

x6.23

x7.15

Managed Windows API CodePack 1.1

0.002551

0.003219

x6.61

x12.29

And the associated graphs comparison both for x86 and x64 platforms:

Results are pretty self explanatory. Although we can highlight some interesting facts:

Managed Direct3D API calls are much slower than native API calls, ranging from x1.52 to x10 depending on the API you are using.

SharpDX is providing the fastest Direct3D managed API, which is ranging only from x1.52 to x2.32 slower than C++, at least 50% faster than any other managed APIs.

All other Direct3D managed API are significantly slower, ranging from x2.45 to x12.29

Running this benchmark with SharpDX and Mono 2.10 is x6 to x7 times slower than SharpDX with Microsoft JIT (!)

Ok, so if you are a .NET programmer and are not aware about performance penalty using a managed language, you are probably surprised by these results that could be... scary! Although, we can balance things here, as your 3D engine is unlikely to be CPU bounded on drawing calls, but 3000-7000 calls could lead to a 4ms impact in the better case, which is something we need to know when we design a game.

This test could be also extrapolated to other parts of a 3D engine, as It will probably slower by a factor of x2 compare to a brute force C++ engine. For AAA game, this would be of course an unacceptable performance penalty, but If you are a small/independent studio, this cost is relatively low compare to the cost of efficiently developing a game in C#, and in the end, that's a trade-off.

In case you are using SharpDX API, you can still run at a reasonable performance. And if you really want to circumvent this interop cost for chatty API scenarios, you can design your engine to call a native function that will batch calls to the Direct3D native API.

47 comments:

Very interesting benchmark. I actually felt somewhat surprised to see that managed calls are so cheap ! I expected something around 5x slower than native calls, so 1.52x is, to me, very affordable considering the huge amount of time saved by using C# instead of C++.

I have a question though. Why is x64 code nearly 50% slower than the x86 one ? Could the pointer size alone incur such a slowdown ?

yea I'd also like to know where the main slowdown in the 64bit world is coming from...surely the longer pointers possibly cannot have such a huge impact? Also, seeing as the ratios of the other Frameworks differ from that of sharpdx, do you think the 64bit degradation issue is 'fixable'?

Thanks! I'm out for 10 days so I will check 64bit difference as soon as I'm back.

I don't think that there is a solution to improve this in x64 mode, as the code is pretty straightforward. Although I have still an improvement that will be available in the v2 of SharpDX.

Transform from CLRCall calling convention to stdcall 64bit + difference in the 64bit jit code generation are more likely to make the difference here... but that could also come from tail call optim in 64bit that is probably not efficiently used in SharpDX (I will have to check this, as CLR4.0 changed this)

But... one thing I noticed is that it may be possible to further reduce the number of high frequency allocations and use of fixed.

For instance, I see a pattern in your API mappings where you will do a GC alloc followed by a fixed{} on a value type.

Have you checked what the IL looks like compared to a stackalloc with no fixed{} ?

Also, maybe I'm missing something, but I couldn't figure a convenient way to update resource buffers (DX11) without suffering a penalty of a couple of reference type allocations per update. The same with D2D triangle sink, where you allocate managed arrays on every call.

Maybe to avoid this, you could provide a templatized wrapper around a block of memory allocated with Marshal.AllocHGlobal(). Perhaps one for arrays and another for a singular struct (Perfect for constant buffers).

I'm pretty sure you'd get a decent perf win from removing the GC pressure, and passing void*'s as much as possible would mitigate marshaling costs in the managed/native proxies.

Yeah, I realize these kinds of optimizations may not be to everyone's taste (since it places a larger burden on the application code).

Hi Christian,Thanks for your comment!"fixed" is only used for arrays and value type passed by pointer. Fixed on value type that were allocated on the stack doesn't have a significant impact.I agree that current design can generate some unwanted allocation that would be difficult to avoid (though I'm going to review them). Some of them (for example in UpdateSubResource or Map) are allocating on the stack, so It shouldn't be a huge issue... but some of them, like arrays are allocated on the heap, even if sometimes they are transient object...

I'm highly concerned by this as well so what I'm going to provide for v2 is an access to a very low level API version on the side of the current API. This low level API won't perform any marshaling both for parameter and return values and will be considered as RAW calls, almost as fast as their native counterpart. These kind of methods could be used in very specific scenarios, where for a small part of an application, we need to process as fast as possible.

I will also provide for the current API some of the hidden methods that could avoid the allocation of transient objects for which the client will be responsible to allocate them (for example, outside of an intensive loop)

Excellent, the roadmap sounds great, can't wait to get a look at V2 nearer the time.

I have to admit to being out of touch with the latest code-gen from the compiler/jitter. Last time I paid attention was a few years back, and mostly for compact framework issues. On that, I pretty much expected the worst at all times!

Just an FYI... The function I saw creating the most garbage for me was MapSubResource() in the DeviceContext.cs. It allocs two reference types... a DataBox and a DataStream.

Unfortunately, you need that DataBox for the UpdateSubresource() later so I couldn't see a way around it without modding the code.

Agree about the two allocations, I could provide other way to handle this... but It would be easier to follow this if you could log an issue about this and describe a little bit more your use case/workflow. thanks!

Really nice project !As soon as you give the possibility to use the Effect framework under DirectX11, I will give a try to SharpDX, and adapt my project to it ! (atm under slimDX) :)Keep up the good work !

I have a question if you think that an text intensive(DirectWrite/Direct2D) application would benefit of using SharpDX over Windows API CodePack 1.01?

Its a WinForms datagrid control being used in a very busy trading/market data application (WinForms). The performance is really good as it is (much, much better than any of the commericial WinForms GDI/GDI++ based datagrid controls available) and DirectWrite draws beatifully. The grid draws alot of small/different formatted text strings.The existing soultions is to my understanding not GPU bound, it is actually spending quite a bit of the CPU especially in the hwndRenderTarget.DrawText(...) and hwndRenderTarget.EndDraw(...).If SharpDX is peforming 4x the existing Win API CodePack I would love to replace it. Of-course the only real answer is to implement something and then measure the difference. But before jumping on this I would like to know your opinon.

@krogen, It depends. I suspect that your current application is currently CPU bound because of DirectWrite itself, not CodePack API overhead. So unless you have a very chatty program with the CodePack API, you will probably won't see any benefits while switching from CodePack to SharpDX (cost of a draw is much above cost of the API overhead).

That being said, using SharpDX over CodePack is still worth if you want to have:- fullsupport for DirectWrite/Direct2D (From what I remember, CodePack doesn't support all callbacks features of these APIs)- AnyTarget assemblies, running transparently on x86/x64 without any GAC install

Hi HelloweenScot,Calli instructions is working with Mono so it should work on Linux.Though I have found that calli is not entirely well implemented on Mono (bugs with argument struct and possibly performance issues), something that I will have to check carefully.

Hi Alex,thanks for the article. I also hope there's no too much neutron in the Tokyo's water.

One question : how do you generate the c# code based on the c++ api headers ? What is your metaprogrammation method ?Do you use some public template / macro solution like Text Template Transformation Toolkit (c#) or Boost (c++).

Hi Guillaume,About the code generation process, I wrote an article about it http://code4k.blogspot.com/2010/10/managed-netc-direct3d-11-api-generated.html though It is slightly for some parts now.

The code generation process for the next version v2.0 (alpha available from https://github.com/sharpdx/SharpDX ) is:0) Read mapping.xml configuration mapping rules files from various directories.1) the code generation is using gccxml to parse the C++.2) The gccxml output is mapped to an internal C++ model (previously called XIDL). 3) Rules from configuration files are used to generate C# code.4) The template engine used is T4 engine though It's now using a simplified version of it from Mono.TextTemplating code.

With gccxml, SharpDX code generation tools is able to parse all windows API (or even any third party API) and is well suited at exposing COM interfaces. The code generation tools will probably available as part of a SharpDX tool chain (for example, in order to generate wrapper for custom C++/COM based code)

I have converted my little game project (aka Voxel landscape rendering : http://www.youtube.com/watch?v=9r93LyIJLjY) from slimDX to SharpDX, it's working fine.

While I didn't lose any FPS (didn't grap some neither - but it's because i'm not "Draw call" limited at this moment), the game has win some "smoothness". With slimDX I had from time to time a GAC collection fired (from inside slimdx) that was making my rendering to freeze for 0.01s. No more of tose with SharpDX !

What I like also, is that you are more matching the DirectX fonctions directly. As there are not a lot of tutorial for slimDX/SharpDX, it's nice to retrieve the DirectX functions nearly directly in SharpDX.

I have some questions :- What are the requirement for my game to run under mono ? (wich .Net framework ? 3.5, 4.0 ?)- I did try you alpha V2, but it's crashing nearly directly on functions reponsible to write data into memory (Like updatesubresource). Is it normal ? (I know it's still in alpha) !

Good work so far, I'm sitcking with your managed wrapper around DirectX11 !

@bubu, For X3DAudio, It was planned and I'm working on it. It shouldn't take too long as the API is very small.

@Fabian, great to see that you were able to switch to SharpDX smoothly while getting some performance gain. For your other request, as it could take more line to respond, and to keep a record of it, could you please log question/issues to https://github.com/sharpdx/SharpDX/issues?sort=created&direction=desc&_pjax=true&state=open

@Fabian, concerning the mapping between DirectX types/functions/interfaces and SharpDX types, It will be fully integrate in the upcomming SharpDX documentation system, with the ability to search with an unmanaged name directly.

The documentation will integrate probably a full listing of all the mappings as well.

@UltraHead: Still working on it: - I have recently integrated X3DAudio and XACT3.- I need to add RawInput and XInput mappings which shouldn't take too long.- Direct3D9 needs a little more love, at least on API parts that are the most frequently used.- I would like also to add WIC support for 2.0

After this:- I have to improve documentation and support better import from msdn. - Finalize the new website

inside the loop ? These things should be done per every scene rendering, not per object rendering. And the triangle seems to fill 1/8 ot the screen. Those things may be bottlenecked by the video card's fillrate, and as I see they may interfere with the number of draw calls possible.I see the same clearing of the screen in the other benchmarks as well. If they are intended, can you explain them please?Thanks.

@Rosen, indeed, those should have been moved out of the loop (and I don't remember why the SetRenderTarget end up with a null), but the main goal of this benchmark was to bench the cost of the API call. In fact for Direct3D11, I should have setup a deferred context though for XNA there is no equivalent. That's why I tried to tune correctly flush limits on my machine for all api in order to avoid any stall from the GPU, but I don't think It's going to change lots of thing for XNA (though I could have done sonmething wrong).

"This test could be also extrapolated to other parts of a 3D engine, as It will probably slower by a factor of x2 compare to a brute force C++ engine."

I seriously doubt that.Marshalling might put a huge burden on the CPU, but I've done bechmarks on basic operations for both C++ and C#, and they're very close (other than generics, who are considerably slower than templates).

@pball81, there is in fact no speculation here, as I have been working and investigating JIT code generation for the last years, and implementing a new 3D engine in C# at my work, I can tell you that in lots of cases where you ask your CPU to get the maximum out of it, the JIT is far from being efficient. There are lots of cases where the JIT won't inline your code for example... Or JIT x86 code generation is not able to use SSE/SSE2 instructions to efficiently vectorisze things (only the x64 is using it, because it is forced to) by loosing also all the benefits of having extra SSE/SSE2 registers... and the performance boost here using inline SSE/SSE2 and proper register usage is at least x2 in C++ compare to C#.

In order to achieve the best performance in C#, you really need to profile your code (avoid any kind of allocation, as you would do in C++, but It is more important here with the GC running in the background) + checking the JIT generated code... and sometimes play with some dirty JIT hacks to in order to get the best from it.

Although in some cases, you won't notice a huge difference, the 10-20% overhead of JIT code compare to highly optimized C++ code is a myth, specially when you are entering heavy CPU computing...

"the JIT is far from being efficient. There are lots of cases where the JIT won't inline your code for example..."

Not only the JITter isn't efficient in certain situations, there are many Core and System libary operations in .NET/C# with methods also far from being efficient (and for these cases, most of the time you end up writing your own code).

So, just for kicks and giggles, I decided to integrate SharpDX into my graphics library (which used SlimDX).

It was somewhat painless to do so, had to fix up some naming and re-order some parameters. I was a little thrown off by having to use the SharpDX.Direct3D namespace to use the FeatureLevel enumeration.

Anyway, that's not why I'm posting on this particular blog entry. I did some crude benchmarking by drawing a multisampled scene with a rotating square and simulating motion blur on that square by drawing it 8 times, and a single untransformed textured square. I also enabled the depth buffer. When measuring against the FPS (yeah I know, FPS is not a good metric, bear with me...)

FYI, This is on a Win7 x64 box, i7 2600, Radeon 6870:For SlimDX -> At the highest, I was getting about 2200 FPS (about 0.45 ms), and at the lowest about 1900 FPS (about 0.52 ms).For SharpDX -> At the highest, I was getting about 2200 FPS, and at the lowest, about 1700 FPS (about 0.56 ms).

This was consistent between x86 and x64. Since I knew I wasn't using a good metric (FPS) to measure draw time, and I really wasn't taxing the card in any way I decided to draw that same scene, but this time I drew the single square 65536 times and transforming it a little on each iteration.

When I did the first test I was a little skeptical about your claims, but after the second... not so much.

I did find it interesting that SharpDX didn't appear to perform as well with so little on the screen (mind you, if we're looking at delta, it's very minor performance difference, approximately 0.04 ms, hardly worth worrying about in my opinion).

You've done an amazing job with this. I may consider switching Gorgon to use this as its API once I get some more testing out of the way.

Thanks TapeWorm for this valuable feedback and glad to see that you have found some interesting results! :)

Since this article was posted, I worked a little bit more on some core methods on Direct3D11 to improve performance, so I expect to check this and benchmark it again in a near future. There are also a couple of things I'm slowly working on it that are not released yet but that will improve performance as well. I'm confident that SharpDX could go under the x1.5 performance penalty against the C++ version.

About the API difference you had, sorry for that: If you think that some of those changes are not relevant, drop an issue on SharpDX and I will have a look at it. For the FeatureLevel for example, as It is shared effectively between Direct3D10.1 and Direct3D11 in C++, I had to map the correct C++ behavior.

And concerning the small differences (0.04ms) in a simple case, It seems indeed not enough critical to worry about it.

Hello Alexandre!I wanted to ask if you've already developed 4k demos in C # with SharpDX? Is that possible in size at all? I have been programming in C # and have programmed since the C64 is no longer intro and would like to make fun times, again a project.

Hi Marco. I have already investigated small intro with C# and this is not really feasible. The smallest assembly that you can get in .NET is 1536 bytes, but it requires already a bit of PE hacking. Then the major drawback is that the assembly is storing long names to reference other assemblies or any used API (like System.IO.FileStream), so more than the half of the executable would be already occupied by PE headers + metadata.

Hello Alexandre!I've thought about something. When I made my first experiments with Java and ​​lwjgl, I was not even under 11kb and that with just a cube! My Experiments with C# and GDI+ were at 7kb on a text scroller and a copper bar. Pure Java would be enough even for an old school intro (37kb with images), but one wants more and more.^^ Thank you for your answer! I will now learn c++ and see what comes. Once been in the demoscene they can never get rid of any one ^^. Even after such a long time not.