4 Answers
4

A lot depends on how you write your code in Mathematica. In my experience, the rule of thumb is that the generated code will be efficient if the code inside Compile more or less resembles the code I would write in plain C (and it is clear why). Idiomatic (high-level) Mathematica code tends to be immutable. At the same time, Compile can handle a number of higher-level functions, such as Transpose, Partition, Map, MapThread, etc. Most of these functions return expressions, and even though these expressions are probably passed to the calling function, they must be created. For example, a call to ReplacePart which replaces a single part in a large array will necessarily lead to copying of that array. Thus, immutability generally implies creating copies.

So, if you write your code in this style and hand it to Compile, you have to keep in mind that lots of small (or large) memory allocations on the heap, and copying of lists (tensors) will be happening.
Since this is not apparent for someone who is used to high-level Mathematica programming, the slowdown this may incur may be surprising. See this and this answers for examples of problems coming from many small memory allocations and copying, as well as a speed-up one can get from switching from copying to in-place modifications.

As noted by @acl, one thing worth doing is to set the SystemOptions -> "CompileOptions" as

in which case you will get warnings for calling external functions etc.

A good tool to get a "high-level" but precise view on the generated code is the CompilePrint function in the CompiledFunctionTools` package. It allows you to print the pseudocode version of the byte-code instructions generated by Compile. Things to watch for in the printout of CompilePrint function:

Calls to CopyTensor

Calls to MainEvaluate (callbacks to Mathematica, meaning that something could not be compiled down to C)

One not very widely known technique of writing even large Compile-d functions and combining them from pieces so that there is no performance penalty, is based on inlining. I consider this answer very illustrative in this respect - I actually posted it to showcase the technique. You can also see this answer and a discussion in the comments below, for another example of how this technique may be applied.

In summary - if you want your code to be as fast as possible, think about "critical" places and write those in "low-level" style (loops, assignments, etc) - the more it will resemble C the more chances you have for a speed-up (for an example of a function written in such a style and being consequently very fast, see the seqposC function from this answer). You will have to go against Mathematica ideology and use a lot of in-place modifications. Then your code can be just as fast as hand-written one. Usually, there are just a few places in the program where this matters (inner loops, etc) - in the rest of it you can use higher-level functions as well.

it may be worth mentioning that one may do things like SetSystemOptions[ "CompileOptions" -> "CompileReportExternal" -> True] to see some warning of calls to external functions, as well as other options in SystemOptions["CompileOptions"]
–
aclJan 27 '12 at 20:53

Also Compile does not work for highly-optimized command as SparseArray, or DiagonalMatrix. I dont think that rewrite SparseArray or DiagonalMatrix in C-style, then Compile to C is more efficient than built-in SparseArray. So sometimes, try to quit the Compile, and use the highly-optimized command.
–
DaoTRINHFeb 14 '14 at 18:01

@DaoTRINH Yes, sure, I agree. I think this discussion was (intentionally) restricted to the subset of Mathematica code / structures, which Compile can handle.
–
Leonid ShifrinFeb 14 '14 at 18:06

@LeonidShifrin :) Weekend and SE. Do you think that rewrite the Compile could solve these problems to generate more optimized C-like code and get compiled in c compiler? Have you try the MathCode add on ? wolfram.com/products/applications/mathcode
–
DaoTRINHFeb 14 '14 at 19:34

In addition to the answers given, you may tweak specific commands to give better performance. For example Part[] is a candidate for this. Part has to do bound checks. In time critical inner loops you can switch that off by using Compiler`GetElement[] instead. Very cautious with this one.

Another thing you might want to try (never needed this myself) is to give platform specific compile optimization options that your CPU supports:

You showed many interesting things! Is the expression optimizer used by Compile, or is it meant for manual use? Does the Experimental`OptimizedExpression head have any special function, or it could be a simple Hold instead?
–
SzabolcsJan 27 '12 at 22:03

Yes the expression optimizer is used inside of compile (and other functions in M-, if I am not mistaken). Presumably OptimizedExpression is used during parsing, but I do not know.
–
user21Jan 27 '12 at 22:18

2

Another option you might want to pass to gcc is -march=native: That way you allow gcc to use everything the processor on your machine has to offer, at the cost of portability of the compiled code (which in this case you probably won't care about).
–
celtschkMar 9 '12 at 12:01

I'm not certain about the inlining, and there may be other options worth tweaking to get the best speed. Also I would imagine to some extent the speed will depend on the optimization capabilities of the C compiler.

Mathematica is a registered trademark of Wolfram Research, Inc. While the mark is used herein with the limited permission of Wolfram Research, Stack Exchange and this site disclaim all affiliation therewith.