In this article, I shall describe one of the methods that can be used to transform C/C++ code into C# code with the least amount of effort. The principles laid out in this article are also suitable for other pairs of languages, though. I want to warn you straight-off that this method is not applicable to porting of any GUI-related code.

What is this useful for? For example, I have used this method to port libtiff, the well-known TIFF library, to C# (and libjpeg too). This allowed me to reuse work of many people contributed to libtiff along with the .NET Framework Class Library in my program. Code examples in my article are taken mainly from libtiff / libjpeg libraries.

The "one-click" build and test runs requirement is there to speed up the "change - compile - run tests" cycle as much as possible. The more time and effort goes into each such cycle, the fewer times it will be executed. This may lead to massive and complex roll-backs of erroneous changes.

You can use any version control system. I personally use Subversion - you may pick up whatever you're comfortable with. Anything instead of a set of folders on the hard disk will do.

Tests are required to make sure that the code still retains all of its features at any given time. Being safe in the knowledge that no functional changes are introduced into the code is what sets my method apart from the "let's rewrite it from scratch in the new language" approach. Tests are not required to cover a 100% of the code, but it's desirable to have the tests for all the main features of the code. The tests shouldn't be accessing the internals of the code to avoid constant rewriting of them.

Here's what I used to port LibTiff:

A set of images in TIFF format

tiffcp, the command-line utility that converts TIFF images between different compression schemes

A set of batch scripts that use tiffcp for conversion tasks

A set of reference output images

A program that performs binary comparison of output images with the set of reference images

To grasp refactoring concepts, you only need to read one book. Martin Fowler's Refactoring: Improving the Design of Existing Code. Be sure to read it if you still haven't. Any programmer can only gain by knowing refactoring principles. You don't have to read the entire book, first 130 pages from the beginning is enough. This is the first five chapters and the beginning of the sixth, up to the "Inline Method".

It goes without saying, the better you know the languages that are being used in your source and destination code, the easier the transformation will go. Please note that a deep knowledge of the internals of the original code is not required when you begin. It's enough to understand what the original code does, a deeper understanding of how it does it will come in the process.

The essence of the method is that the original code is simplified through a series of simple and small refactorings. You shouldn't attempt to change a large chunk of code and try to optimize it all at once. You should progress in small steps, run tests after every change cycle and make sure to save every successful modification. That is, make a small change - test it. If all is well, save the change in the VCS repository.

Transfer process could be broken down into 3 big stages:

Replacement of everything in the original code that uses language-specific features with something simpler, but functionally equivalent. This frequently leads to slower and not so neat looking code, but let it not concern you at this stage.

Modification of the altered code so that it can be compiled in the new language.

Transferal of the tests and making the functionality of the new code match the code in the source language.

Only after completing these stages, you should look at the speed and the beauty of the code.

The first stage is the most complex. The goal is to refactor C/C++ code into "pure C++" code with syntax that is as close to C# syntax as possible. This stage means getting rid of:

First of all, we should get rid of the unused code. For instance, in the case of libtiff, I removed the files that were not used to build Windows version of the library. Then, I found all the conditional compilation directives ignored by Visual Studio compiler in the remaining files and removed them, as well. Some examples are given below:

Frequently, conditional compilation is used for creating specialized versions of the program. That is, some files contain #define as a compiler directive, while code in other files is enclosed in #ifdef and #endif. Example:

I would suggest selecting what to use straight away and get rid of conditional compilation. For example, should you decide that BMP format support is necessary, you should remove #ifdef BMP_SUPPORTED from the entire code base.

If you do have to keep the possibility to create several versions of the program, you should make tests for every version. I suggest leaving around the most full version and work with it. After the transition is complete, you may add conditional compilation directives back in.

But we are not done working with preprocessor yet. It's necessary to find preprocessor commands that emulate functions and change them into real functions.

To make a proper signature for a function, it is necessary to find out what are the types of all the arguments. Please note that BitAcc, BitsAvail, EOLcnt, cp and ep get assigned within the preprocessor command. These variables will become arguments of new functions and they should be passed by reference. That is, you should use uint32& for BitAcc in the function's signature.

Programmers sometimes abuse preprocessor. Check out an example of such misuse:

In the code above, PEEK_BITS and DROP_BITS are also "functions", created similarly to HUFF_DECODE. In this case, the most reasonable approach is probably to include code of PEEK_BITS and DROP_BITS "functions" into HUFF_DECODE to ease transformation.

You should go to the next stage of refining the code only when most harmless (as seen below) preprocessor directives are left.

You can get rid of goto operators by introducing boolean variables and/or changing the code of a function. For example, if a function has a loop that uses goto to break out of it, then such construction could be changed to setting of a boolean variable, a break clause and a check of the variable's value after the loop.

My next step is to scan the code for all the switch statements containing a case without a matching break.

Everything that I described until now is not supposed to take up much time - not compared to what lies ahead. The first massive task that we're facing is combining of data and functions into classes. What we're aiming for is making every function a method of a class.

If the code was initially written in C++, it will probably contain few free (non-member) functions. In this case, a relationship between existing classes and free functions should be found. Usually, it turns out that free functions play an ancillary role for the classes. If a function is only used in one class, it can be moved into that class as a static method. If a function is used in several classes, then a new class can be created with this function as its static member.

If the code was created in C, there'll be no classes in it. They'll have to be created from the ground up, grouping functions around the data that they manipulate. Fortunately, this logical relationship is quite easy to figure out - especially if the C code is written using some OOP principles.

It's easy to see that the tiffstruct begs to become a class and the three functions declared below - to be changed into public methods of this class. So, we're changing struct to class and the three functions to static methods of the class.

As most functions become methods of different classes, it'll become easier to understand what to do with the remaining non-member functions. Don't forget that not all of the free functions will become public methods. There are usually a few ancillary functions not intended for use from the outside. These functions will become private methods.

After the free functions have been changed to static methods of classes, I suggest getting down to replacing calls to malloc/free functions with new/delete operators and adding constructors and destructors. Then static methods can be gradually turned into full-blown class methods. As more and more static methods are converted to non-static ones, it'll become clear that at least one of their arguments is redundant. This is the pointer to the original struct that has become the class. It may also turn out that some arguments of private methods can become member variables.

Now that a set of classes replaced the set of functions and structs, it's time to get back to the preprocessor. That is, to defines like the one below (there should be no other ones remaining by now):

#define STRIP_SIZE_DEFAULT 8192

Such defines should be turned into constants and you should find or create an owner class for them. The same as with functions, the newly-created constants may require creating a special new class for them (maybe, called Constants). As well as the functions, the constants may have to be public or private.

If the original code was written in C++, it may rely upon multiple inheritance. This is another thing to get rid of before converting code to C#. One way to deal with it is to change the class hierarchy in a way that excludes multiple inheritance. Another way is to make sure that all the base classes of a class that use multiple inheritance contain only pure virtual methods and contain no member variables. For example:

Before going over to the next big-scale task (getting rid of pointer arithmetic), we should pay special attention to type synonyms declarations (typedef operator). Sometimes these are used as shorthand for proper types. For instance:

typedef vector<command* /> Commands;

I prefer to inline such declarations - that is, locate Commands in the code, change them to vector<command* />, and delete typedef.

Mind the names of the types being created. It's obvious that typedef short int16 and typedef int int32 are somewhat of a hindrance, so it makes sense to change int16 to short and int32 to int in the code. Other typedefs, on the other hand, are quite useful. It's a good idea, however, to rename them so that they match type names in C#, like so:

Special attention should be paid to the declarations similar to following one:

typedefunsignedchar JBLOCK[64]; /* one block of coefficients */

This declaration defines a JBLOCK as array of 64 elements of the type unsigned char. I prefer to convert such declarations into classes. In other words, to create JBLOCK class that serves as a wrapper around array and implements methods to access the individual elements of the array. It facilitates better understanding of the way array of JBLOCKs (particularly 2- and 3-dimensional ones) are created, used and destroyed.

Such functions are to be rewritten, since pointer arithmetic is unavailable in C# by default. You may use such arithmetic in unsafe code, but such code has its disadvantages. That's why I prefer to rewrite such code using "index arithmetic". It goes like this:

The resulting function does the same job, but uses no pointer arithmetic and can be easily ported to C#. It could also be somewhat slower than the original, but again, this is not our priority for now.

Special attention should be paid to the functions that change pointers passed to them as arguments. Below is an example of such a function:

void horAcc32(int stride, uint* & wp, int wc)

In this case, changing wp in function horAcc32 changes the pointer in the calling function as well. Still, introducing an index would be a suitable approach here. You just need to define the index in the calling function and pass it to horAcc32 as an argument.

In this case, functionality of the Calc method will vary depending on which one of CreateSummator and CreateMultiplicator methods was called to create an instance if the class. I prefer to create a private enum in the class that describes all possible choices for the functionality and the field that keeps a value from enum. Then, instead of a function pointer, I create a method that consists of a switch operator (or several ifs). The created method selects the necessary function based on the value of the field. The changed code:

This situation is best resolved by turning vsetfield/vgetfield/printdir into virtual methods. Code that has used vsetfield/vgetfield/printdir will have to create a class derived from TIFFTagMethods with required implementation of the virtual methods.

An example of the third case (function pointers are created by users and passed into the program):

Delegates are best suited here. That is, at this stage, while the original code is still being polished, nothing else should be done. At the later stage, when the project is transferred into C#, a delegate should be created instead of PROC, and the DoUsingMyProc function should be changed to accept an instance of the delegate as an argument.

The last change of the original code is the isolation of anything that may be a problem for the new compiler. It may be a code that actively uses standard C/C++ library (functions like fprintf, gets, atof and so on) or WinAPI. In C#, this will have to be changed to use .NET Framework methods or, if need be, p/invoke technique. Take a look at www.pinvoke.net site in the latter case.

"Problem code" should be localized as much as possible. To this end, you could create a wrapper class for the functions from C/C++ standard library or WinAPI. Only this wrapper will have to be changed later.

This is the moment of truth - the time to bring the changed code into the new project built using C# compiler. It's quite trivial, but labor-intensive. A new empty project is to be created, then the necessary classes should be added to that project and the code from the corresponding original classes copied into them.

You'll have to remove the ballast at this stage (like various #includes, for instance) and make some cosmetic modifications. "Standard" modifications include:

combining code from .h and .cpp files

replacing obj->method() with obj.method()

replacing Class::StaticMethod with Class.StaticMethod

removing * in func(A* anInstance)

replacing func(int& x) with func(ref int x)

Most of the modifications are not particularly complex, but some of the code will have to be commented out. Mostly the problem code that I discussed in part 2.9 will be commented out. The main goal here is to get C# code that compiles. It most probably won't work, but we'll come to that in due time.

After we made converted code compile, we need to adjust the code till the functionality matches the original. For that, we need to create a second set of tests that uses the converted code. The methods, commented out earlier, need to be carefully revised and rewritten using .NET Framework. I think this part needs no further explaining. I just want to expand on a few fine points.

When creating strings from byte arrays (and vice versa), a proper encoding should be selected carefully. Encoding.ASCII should be avoided due to its 7-bit nature. It means that bytes with values higher than 127 will become "?" instead of proper characters. It's best to use Encoding.Default or Encoding.GetEncoding("Latin1"). The actual choice of encoding depends on what happens next with the text or the bytes. If the text is to be displayed to the user - then Encoding.Default is a better choice, and if text is to be converted to bytes and saved into a binary file, then Encoding.GetEncoding("Latin1") suites better.

Output of formatted strings (code related to the family of printf functions in C/C++) may present certain problems. Functionality of the String.Format in .NET Framework is both poorer and different in syntax. This problem can be solved in two ways:

Create a class that mimics functionality of printf functions

Change the format strings so that String.Format shows the same result (not always possible).

When all the tests that use the converted code will pass successfully, we can be sure that the conversion is completed. Now we can return to the fact that the code does not quite conform to the C# ideology (for example, the code full of get/set methods instead of properties) and deal with refactoring of the converted code. You may use profiler to identify bottlenecks in the code and optimize it. But that's quite a different story.

Share

About the Author

Vitaliy Shibaev is a developer and co-founder of Bit Miracle, company developing the Docotic.Pdf and LibTiff.Net libraries.

Docotic.Pdf Library allows to easy create, protect, modify PDF documents in .NET. It also can be used for text extraction and PDF forms filling. It has clean and powerful PDF API which will help you to create professional quality documents.

LibTiff.Net is a fully managed port of well-known TIFF library. It is fully open source and free for use. API almost the same as in original libtiff. There is a also Silverlight version.

I must admit that porting of some old C/C++ code to C# might be useful in some cases. However the reason you've put for this is a bit awkward: "This allowed me to reuse work of many people contributed to libtiff along with the .NET Framework Class Library in my program". I've been using different C libraries in C# applications in the past and honestly never considered the way of porting them. In my opinion writing a managed C++ wrapper might be easier in many cases.

For example, let’s suppose you may want to use FFMPEG library to access video from your .NET application. Will you port it as well? And when there is a new version of the library, will you always go and merge changes to your ported version? This would be support nightmare to me. Instead you could just leave the C/C++ library as is allowing it to evolve its own way and just introduce some limited wrapper of the functionality you need. If the library keeps backward computability of public API, then with new version of that library you would just need to rebuild your wrapper.

While doing the porting work, did you do any type of profiling to see if those "new realities" provide same/similar performance? Recently I made a check of the same code built in C# and in C (GCC) and it so happened that C# lost in performance. So it would be interesting to see some performance comparison in order to understand if this porting effort worth of it (even with those new realities there are still applications out there which require performance).

First thank you for having shared your experience with us.
If I had to migrate a C++ library to C# .NET. I would proceed in 3 steps:

1) Wrap the C++ library in C++/CLI and write non-regression test cases in C#:
[C# test code] -> [C++/CLI wrapper] -> [C++ library code]
At that stage the final C# API along with test cases are more or less defined.

The main advantages of this strategy are that:
- you define your C# library API as you want it early,
- you write test code early that you can keep and that serves you as a safety net all along the migration process,
- you could stop at the end of step #1 and be already able to call your library from C#.

Thanks for your response! Using of C++/CLI is also a good way, we've considered it before starting.
In any case, I think, the most important parts of migration process are reliable test set and hard refactoring.

Having gone through the process of converting C++ to C# I concur with most of the cases you've mentioned but I think you missed a lot of the more common issues that someone will run across. Here are the ones that cause untold issues during a conversion:

1) Bitfields - very common in C and not supported in C#. Unfortunately it is not a simple matter of converting them to bools since bitfields can be any size and code often relies on the fact that bitfields are combined together in the generated code. I worked around this by introducing a BitField class that basically acted like a bit array.
2) Unions - Generally solvable by using LayoutKind.Explicit but there are quite a few cases that won't solve such as the whole, doesn't actually impact managed code problem. But there are other issues such as strings in the union.
3) Non-standard datatypes like time_t which are implementation specific and poorly documented.
4) Reference variables inside classes. Generally these are set in a ctor and then any changes to one impacts the other. Not easily reproducible for value types.
5) Any UI code whether it be Win32, MFC or otherwise. This also includes the migration of any resources.
6) Strings - It is not always sufficient to convert to a C# string because often the string size is a constant that code relies on. Therefore the migrated code either has to remove this limit altogether or somehow enforce it. I originally created a FixedString type to wrap such strings with error handling but I've since moved to standard strings with an attribute so that I can remove the limit down the road.
7) Memory function - Very common in C but not doable in C#. There is no alternative here but to do manual assignment and this gets complicated when code does memset.
8) Equality operators - In C# it is a bad idea to define these ops on reference types but C++ uses them all the time. There is no easy way to handle this short of finding all references in the code. The compiler won't help here.
9) Copy constructors/assignment - In C there are many hidden calls to copy ctors and assignments that just won't work right in C#. In C we often rely on different instances being used but in C# this doesn't happen for reference types. A common approach is to replace these calls with a CopyFrom call but that only works for base types. Derived types are still broken.

For my conversion process I used Instant C++ from Tangible Software to do the bulk of the conversion. It covered most of these issues but there are still lots of places where I had to make manual changes. Overall I don't know that direct conversion from C to C# is the best approach.

You right, I didn't cover all possible cases, but my target was not this. I think overviewing of general method with some examples is more fundamental than listing of tips & tricks like "how to convert unions to C#"
The point of the method - we don't make direct conversion from C to C#, but previously convert C code to C++ without any problem constructions. Then we convert such C#-like C++ code to pure managed C#. Main thing - have all tests passed at all times.

Great post, Vitaliy.
I'm developing from scratch a jpeg codec in C#. I haven't seen your libjpeg.net while searching for jpeg or jpeg C# in Google. If I had known libjpeg.net earlier, it would have saved me lots of efforts.

Could you please post libjpeg.net, libtiff.net on CodePlex, SourceForge, or Google Codes? They make your code more visible to other developers.

I enjoyed reading your article and gave it a positive vote. It seems very useful for someone who wishes to port C/C++ libraries to managed code.

Just a quick comment/opinion regarding old C preprocessor macros. In your article you give some examples of what you call preprocessor abuse. However the reason many of these old C libraries are using preprocessor macros are for speed optimizations. In some cases inlined functions can speed up an algorithm by many factors.

You might be thinking that the author could have just used the inline keyword. However the inline keyword is only a hint that the compiler could potentially completely ignore. The only way to guarantee that these C99 functions are actually inlined is to use a preprocessor macro.

Congratulations, it looks like you guys did a great job converting those libraries to managed code. Keep up the good work.

Yes, sure, all you said about usage of macros is correct. But I also sure that:
1) often similar complex macros were implemented without any profiling (speed measurement comparing with functions) before, so it's a kind of premature optimization.
2) such macros may be better structured, well commented and much more human readable