**WARNING**: Due to CPU limitations on Binder, not all the benefits of SIMD instructions will be visible. You will see better SIMD performance if you download this notebook and run on Intel Haswell / AMD Zen or later.

Most modern CPUs have support for instructions that apply the same operation to multiple data elements simultaneously. These are called "Single Instruction, Multiple Data" (SIMD) operations, and the LLVM backend used by Numba can generate them in some cases to execute loops more quickly. (This process is called "autovectorization.")

For example, Intel processors have support for SIMD instruction sets like:

SSE (128-bit inputs)

AVX (256-bit inputs)

AVX-512 (512-bit inputs, Skylake-X and later or Xeon Phi)

These wide instructions typically operate on as many values as will fit into an input register. For AVX instructions, this means that either 8 float32 values or 4 float64 values can be processed as a single input. As a result, the NumPy dtype that you use can potentially impact performance to a greater degree than when SIMD is not in use.

In [ ]:

importnumpyasnpfromnumbaimportjit

It can be somewhat tricky to determine when LLVM has successfully autovectorized a loop. The Numba team is working on exporting diagnostic information to show where the autovectorizer has generated SIMD code. For now, we can use a fairly crude approach of searching the assembly language generated by LLVM for SIMD instructions.

It is also interesting to note what kind of SIMD is used on your system. On x86_64, the name of the registers used indicates which level of SIMD is in use:

SSE: xmmX

AVX/AVX2: ymmX

AVX-512: zmmX

where X is an integer.

Note: The method we use below to find SIMD instructions will only work on Intel/AMD CPUs. Other platforms have entirely different assembly language syntax for SIMD instructions.

Numba has created two different implementations of the function, one for float32 1-D arrays, and one for float64 1-D arrays:

In [ ]:

sqdiff.signatures

This allows Numba (and LLVM) to specialize the use of the SIMD instructions for each situation. In particular, using lower precision floating point allows twice as many values to fit into a SIMD register. We will see that for the same number of elements, the float32 calculation goes twice as fast:

In [ ]:

%timeit sqdiff(x32, y32)
%timeit sqdiff(x64, y64)

We can check for SIMD instructions in both cases. (Due to the order of compilation above, signature 0 is the float32 implementation and signature 1 is the float64 implementation.)

In general, the autovectorizer cannot deal with branches inside loops, although this is an area where LLVM is likely to improve in the future. Your best bet for SIMD acceleration is to only have pure math operations in the loop.

As a result, you would naturally assume a function like this would be OK:

Notice that the constant 2 has been typed as a float64 value. Later, this causes the multiplication 2 * (x[i] - y[i] to promote up to float64, and then the rest of the calculation becomes float64. This is a situation where Numba is being overly conservative (and should be fixed at some point), but we can tweak this behavior by casting the constant to the type we want:

In [ ]:

@jit(nopython=True,error_model='numpy')deffrac_diff3(x,y):out=np.empty_like(x)dt=x.dtype# Cast the constant using the dtype of the inputforiinrange(x.shape[0]):# Could also use np.float32(2) to always use same type, regardless of inputout[i]=dt.type(2)*(x[i]-y[i])/(x[i]+y[i])returnout

In [ ]:

frac_diff3(x32,y32)

In [ ]:

%timeit frac_diff3(x32, y32)
%timeit frac_diff3(x64, y64)

Now our float32 version is nice and speedy (and 6x faster than what we started with, if we only care about float32).

The autovectorizer can also optimize reduction loops, but only with permission. Normally, compilers are very careful not to reorder floating point instructions because floating point arithmetic is approximate, so mathematically allowed transformations do not always give the same result. For example, it is not generally true for floating point numbers that:

(a + (b + c)) == ((a + b) + c)

For many situations, the round-off error that causes the difference between the left and the right is not important, so changing the order of additions is acceptable for a performance increase.

To allow reordering of operations, we need to tell Numba to enable fastmath optimizations:

In [ ]:

@jit(nopython=True)defdo_sum(A):acc=0.# without fastmath, this loop must accumulate in strict orderforxinA:acc+=x**2returnacc@jit(nopython=True,fastmath=True)defdo_sum_fast(A):acc=0.# with fastmath, the reduction can be vectorized as floating point# reassociation is permitted.forxinA:acc+=x**2returnacc

If you follow the above guidelines, SIMD autovectorization will work for all basic math operations (+,-,*,\), but generally will not work for function calls in the loop, unless LLVM can inline the function and there is only basic math in the function body.

However, we build Numba (if you get conda packages from Anaconda or wheels from PyPI) using a patched version of LLVM that supports vectorization of special math functions when Intel SVML ("Short Vector Math Library") is present. This library comes with the Intel compiler, and is also freely redistributable. We've installed it in the current conda environment using conda install -c numba icc_rt, as we can verify here:

In [ ]:

! conda list icc_rt

Thanks to this library, we can still get SIMD vectorization in a function like this:

# The distribution we are approximating is flat between -1 and 1, so we expect a KDE value of ~0.5 everywheremeans=np.random.uniform(-1,1,size=10000)# These widths are not selected in any reasonable way. Consult your local statistician before approximating a PDF.widths=np.random.uniform(0.1,0.3,size=10000)kde(0.4,means,widths)

We can see that SIMD instructions were generated:

In [ ]:

find_instr(kde,'subp')

We can also see that calls to the special Intel SVML functions for exp were generated:

In [ ]:

find_instr(kde,keyword='svml')

In [ ]:

%timeit kde(0.4, means, widths)

If we recompile the function (which is possible since the .py_func attribute holds the original Python function) with the extra flags to allow division and reductions to work, this stops all autovectorization of the loop:

In [ ]:

slow_kde=jit(nopython=True)(kde.py_func)slow_kde(0.4,means,widths)

Note that we get a slightly different answer, both due to the different order of operations, and the small differences in SVML exp compared to the default exp. We also see that there is no SIMD or calls to SVML:

Why is NumPy as fast as it is? In this case, it is because the Anaconda build of NumPy uses MKL to accelerate (with SIMD and threads) many of the individial ufuncs, so it is only when Numba can combine all the operations together that the speed boost emerges.

Incidentally, although we wrote out the iteration for kde as a for loop to highlight what was going on, you still get the benefit of SIMD in Numba when compiling array expressions. We could have compiled numpy_kde directly: