Tuesday, October 26, 2010

Ever wanted to implement a C++ interface callback in a managed C# application? Well, although that's not so hard, this is a solution that you will probably hardly find over the Internet... the most common answer you will get is that it's not possible to do it or you should use C++/CLI in order to achieve it... In fact, in C#, you can only implement a C function delegate through the use of Marshal.GetFunctionPointerForDelegate but you won't find anything like Marshal.GetInterfacePointerFromInterface. You may wonder why do I need such a thing?

In my previous post about implementing a new DirectX fully managed API, I forgot to mention the case of interfaces callbacks. There are not so many cases in Direct3D 11 API where you need to implement a callback. You will more likely find more use-cases in audio APIs like XAudio2, but in Direct3D 11, afaik, you will only find 3 interfaces that are used for callback:

ID3DInclude which is used by D3DCompiler API in order to provide a callback for includes while using preprocessor or compiler API (see for example D3DCompile).

ID3DX11DataLoader and ID3DX11DataProcessor, which are used by some D3DX functions in order to perform asynchronous loading/processing of texture resources. The nice thing about C# is that those interfaces are useless, as it is much easier and trivial to directly implement them in C# instead

So I'm going to take the example of ID3DInclude, and how It has been successfully implemented for the SharpDX.

Saturday, October 23, 2010

(Edit 8 Jan 2011: Update protocol test with Buffer.BlockCopy)(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...

The memcpy protocol test in C#

When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:

In this test, I'm going to compare this implementation with 4 challengers :

The cpblk IL instruction

A handmade memcpy function

Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.

Marshal.Copy, same as Array.Copy

Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.

The test is performing a series of memcpy with different size of block : from 4 bytes to 2Mo. The interesting part is to run this test on a x86 and x64 mode. Both tests are running on the same Windows 7 OS x64, same machine Intel Core I5 750 (2.66Ghz). The CLR used for this is the Runtime v4.0.30319.

The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):

Results

For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :

BlockSize

x86-cpblk

x86-memcpy

x86-CustomCopy

x86-Array.Copy

x86-Marshal.Copy

x86-BlockCopy

4

146

458

470

85

81

150

8

294

843

1122

168

167

298

16

587

1628

1904

306

327

577

32

950

1876

3184

631

558

1079

64

1451

3316

4295

1205

1059

1981

128

2245

5161

4848

2176

1933

3386

256

4353

7032

5333

3699

3386

5333

512

8205

13617

5517

5663

6666

7441

1024

13617

20000

6666

7710

12075

9275

2048

18823

24615

7191

9142

16842

9552

4096

2922

7529

5663

10491

7032

11034

8192

2990

7804

5714

11228

7441

11636

16384

2857

7901

5614

9142

7619

10322

32768

2379

6736

5333

8101

6666

8205

65536

2379

6808

5470

8205

6808

8205

131072

2509

17777

5818

8101

17777

8101

262144

2500

11636

5423

7032

11428

7111

524288

2539

11428

5423

7111

11428

7111

1048576

2539

11428

5470

7032

11428

7111

2097152

2529

11428

5333

7032

11034

6881

For the x64 architecture:

BlockSize2

x64-cpblk

x64-memcpy

x64-CustomCopy

x64-Array.Copy

x64-Marshal.Copy

x64-BlockCopy

4

583

346

599

99

111

219

8

1509

770

1876

212

224

469

16

2689

1451

3316

417

422

903

32

4705

2666

5000

802

864

1739

64

8205

4812

7272

1568

1748

3350

128

13333

8101

9014

3004

3184

6037

256

18823

11428

10000

5470

5245

8648

512

22068

16000

10491

9014

9552

13913

1024

22857

19393

7356

13333

13617

16842

2048

23703

21333

7710

17297

17777

20645

4096

23703

22068

7804

19393

20000

21333

8192

23703

22857

7619

22068

22068

22857

16384

23703

22857

7804

17297

21333

18285

32768

16410

16410

7710

12800

16000

12800

65536

13061

14883

7710

13061

14545

13061

131072

14222

13913

7710

12800

13617

12800

262144

5000

5039

7032

7901

5000

7804

524288

5079

5000

7356

8205

5079

7804

1048576

4885

4885

7272

7441

4671

7529

2097152

5039

5079

7272

7619

5000

7710

Graph comparison only for cpblk, memcpy and CustomCopy:

Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing and copying around different 4k pages.

Conclusion

Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs Array.Copy vs Buffer.Copy)

The most surprising result here is the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:

On x86, you should better use a memcpy function

On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.

You may wonder why the x86 version is so unoptimized? This is because the x86 CLR is generating a x86 instruction that is performing a memcpy on a PER BYTE basis (rep movb for x86 folks), even if you are moving a large memory chunk of 1Mo! In comparison, a memcpy as implemented in MSVCRT is able to use SSE instructions that are able to batch copy with large 128 bits registers (with also an optimized case for not poluting CPU cache). This is the case for x64 that seems to use a correct implemented memcpy, but the x86 CLR memcpy is just poorly implemented. Please vote for this bug described on Microsoft Connect.

One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.

Of course, the memcpy is platform dependent, which would not be an option for all...

Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.

Tuesday, October 19, 2010

I have been quite busy since the end of august, personally because I'm proud to announce the birth of my daughter! (and his older brother, is somewhat, asking a lot more attention since ;) ) and also, working hard on an exciting new project based on .NET and Direct3D.

What is it? Yet Another Triangle App? Nope, this is in fact an entirely new .NET API for Direct3D11, DXGI, D3DCompiler that is fully managed without using any mixed assemblies C++/CLI but having similar performance than a true C++/CLI API (like SlimDX). But the main characteristics and most exciting thing about this new wrapper is that the whole code marshal/interop is fully generated from the DirectX SDK headers, including the MSDN documentation.

The current key features and benefits of this approach are:

API is generated from DirectX SDK headers : the mapping is able to perform "complex transformation", extracting all relevant information like enumerations, structures, interfaces, functions, macro definitions, guids from the C++ source headers. For example, the mapping process is able to generated properties for interfaces or inner group interface like the one you have in SlimDX : meaning that instead of having a "device.IASetInputLayout" you are able to write "device.InputAssembler.InputLayout = ...".

Full support of Direct3D 11, DXGI 1.0/1.1, D3DCompiler API : Due to the whole auto-generated process, the actual coverage is 100%. Although, I have limited the generated code to those library but that could be extended to others API quite easily (like XAudio2, Direct2D, DirectWrite... etc.).

Pure managed .NET API : assemblies are compiled with AnyCpu target. You can run your code on a x64 or a x86 machine with the same assemblies.

API Extensibility The generated code is in C#, all the types are marked "partial" and are easily extensible to provide new helpers method. The code generator is able to hide some methods/types internally in order to use them in helper methods and to hide them from the public api.

C++/CLI Speed : the framework is using a genuine way to avoid any C++/CLI while still achieving comparable performance.

Separate assemblies : a core assembly containing common classes and an assembly for each subgroup API (Direct3D, DXGI, D3DCompiler)

Lightweight assemblies : generated assemblies are lightweight, 300Ko in total, 70Ko compressed in an archive (similar assemblies in C++/CLI would be closer to 1Mo, one for each architecture, and depend from MSVCRT10)

API naming convention very close to SlimDX API (To make it 100% equals would just require to specify the correct mapping names while generating the code)

Raw DirectX object life management : No overhead of ObjectTable or RCW mechanism, the API is using direct native management with classic COM method "Release". Currently, instead of calling Dispose, you should call Release (and call AddRef if you are duplicating references, like in C++). I might evaluate how to safely integrate Dispose method call.

Easily obfuscatable : Due to the fact the framework is not using any mixed assemblies

DirectX SDK Documentation integrated in the .NET xml comments : The whole API is also generated with the MSDN documentation. Meaning that you have exactly the same documentation for DirectX and for this API (this is working even for method parameters, remarks, enum items...etc.). Reference to other types inside the documentation are correctly linked to the .NET API.

Prototype for a partial support of the Effects11 API in full managed .NET.

If you have been working with SlimDX, some of the features here could sound familiar and you may wonder why another .DirectX NET API while there is a great project like SlimDX? Before going further in the detail of this wrapper and how things are working in the background, I'm going to explain why this wrapper could be interesting.

I'm also currently not in the position to release it for the reason that I don't want to compete with SlimDX. I want to see if SlimDX Team would be interested to work together with this system, a kind of joint-venture. There are still lots of things to do, improving the mapping, making it more reliable (the whole code here has been written in a urge since one month...) but I strongly believe that this could be a good starting point to SlimDX 2, but I might be wrong... also, SlimDX could think about another road map... So this is a message to the SlimDX Team : Promit, Josh, Mike, I would be glad to hear some comments from you about this wrapper (and if you want, I could send you the generated API so that you could look at it and test it!)

[Updated 30 November 2010]
This wrapper is now available from SharpDX. Check this post.[/Updated]

This post is going to be quite long, so if you are not interested by all the internals, you could jump to the sample code at the end.