Previous topic

Next topic

This Page

Quick search

Functions compiled using jit() or vectorize()
are open-ended: they can be called with many different input types and
have to select (possibly compile on-the-fly) the right low-level
specialization. We hereby explain how this mechanism is implemented.

JIT-compiled functions can take several arguments and each of them is
taken into account when selecting a specialization. Thus it is a
form of multiple dispatch, more complex than single dispatch.

Each argument weighs in the selection based on its Numba type. Numba types are often more granular than Python types:
for example, Numba types Numpy arrays differently depending on their
dimensionality and their layout (C-contiguous, etc.).

Once a Numba type is inferred for each argument, a specialization must
be chosen amongst the available ones; or, if not suitable specialization
is found, a new one must be compiled. This is not a trivial decision:
there can be multiple specializations compatible with a given concrete
signature (for example, say a two-argument function has compiled
specializations for (float64,float64) and (complex64,complex64),
and it is called with (float32,float32)).

Therefore, there are two crucial steps in the dispatch mechanism:

infer the Numba types of the concrete arguments

select the best available specialization (or choose to compile a new one)
for the inferred Numba types

This document discusses dispatching when it is done at runtime, i.e.
when a JIT-compiled function is called from pure Python. In that context,
performance is important. To stay in the realm of normal function call
overhead in Python, the overhead of dispatching should stay under a
microsecond. Of course, the faster the better...

When a JIT-compiled function is called from another JIT-compiled
function (in nopython mode), the polymorphism is resolved at
compile-time, using a non-performance critical mechanism, bearing zero
runtime performance overhead.

Note

In practice, the performance-critical parts described here are coded in C.

The first step is therefore to infer, at call-time, a Numba type for each
of the function’s concrete arguments. Given the finer granularity of
Numba types compared to Python types, one cannot simply lookup an object’s
class and key a dictionary with it to obtain the corresponding Numba type.

Instead, there is a machinery to inspect the object and, based on its
Python type, query various properties to infer the appropriate Numba
type. This can be more or less complex: for example, a Python int
argument will always infer to a Numba intp (a pointer-sized integer),
but a Python tuple argument can infer to multiple Numba types (depending
on the tuple’s size and the concrete type of each of its elements).

The Numba type system is high-level and written in pure Python; there is
a pure Python machinery, based on a generic function, to do said inference
(in numba.typing.typeof). That machinery is used for compile-time
inference, e.g. on constants. Unfortunately, it is too slow for run-time
value-based dispatching. It is only used as a fallback for rarely used
(or difficult to infer) types, and exhibits multiple-microsecond overhead.

The Numba type system is really too high-level to be manipulated efficiently
from C code. Therefore, the C dispatching layer uses another representation
based on integer typecodes. Each Numba type gets a unique integer typecode
when constructed; also, an interning system ensure no two instances of same
type are created. The dispatching layer is therefore able to eschew
the overhead of the Numba type system by working with simple integer
typecodes, amenable to well-known optimizations (fast hash tables, etc.).

The goal of the type resolution step becomes: infer a Numba typecode
for each of the function’s concrete arguments. Ideally, it doesn’t deal
with Numba types anymore...

While eschewing the abstraction and object-orientation overhead of the type
system, the integer typecodes still have the same conceptual complexity.
Therefore, an important technique to speed up inference is to first go
through checks for the most important types, and hard-code a fast resolution
for each of them.

Each of those fast paths ideally uses a hard-coded result value or a direct
table lookup after a few simple checks.

However, we can’t apply that technique to all argument types; there would
be an explosion of ad-hoc internal caches, and it would become difficult to
maintain. Besides, the recursive application of hard-coded fast paths
would not necessarily combine into a low overhead (in the nested tuple
case, for example).

For non-so-trivial types (imagine a tuple, or a Numpy datetime64 array,
for example), the hard-coded fast paths don’t match. Another mechanism
then kicks in, more generic.

The principle here is to examine each argument value, as the pure Python
machinery would do, and to describe its Numba type unambiguously. The
difference is that we don’t actually compute a Numba type. Instead, we
compute a simple bytestring, a low-level possible denotation of that
Numba type: a fingerprint. The fingerprint format is designed to be
short and extremely simple to compute from C code (in practice, it has
a bytecode-like format).

Once the fingerprint is computed, it is looked up in a cache mapping
fingerprints to typecodes. The cache is a hash table, and the lookup
is fast thanks to the fingerprints being generally very short (rarely
more than 20 bytes).

If the cache lookup fails, the typecode must first be computed using the
slow pure Python machinery. Luckily, this would only happen once: on
subsequent calls, the cached typecode would be returned for the given
fingerprint.

In rare cases, a fingerprint cannot be computed efficiently. This is
the case for some types which cannot be easily inspected from C: for
example cffi function pointers. Then, the slow Pure Python machinery
is invoked at each function call with such an argument.

Note

Two fingerprints may denote a single Numba type. This does not make
the mechanism incorrect; it only creates more cache entries.

At the previous step, an integer typecode has been determined for each
concrete argument to the JIT-compiled function. Now it remains to match
that concrete signature against each of the available specializations for
the function. There can be three outcomes:

There is a satisfying best match: the corresponding specialization
is then invoked (it will handle argument unboxing and other details).

There is a tie between two or more “best matches”: an exception is raised,
refusing to solve the ambiguity.

There is no satisfying match: a new specialization is compiled tailored
for the concrete argument types that were inferred.

The selection works by looping over all available specializations, and
computing the compatibility of each concrete argument type with the
corresponding type in the specialization’s intended signature. Specifically,
we are interested in:

Whether the concrete argument type is allowed to convert implicitly to
the specialization’s argument type;

There are five possible kinds of implicit conversion from a source type
to a destination type (note this is an asymmetric relationship):

exact match: the two types are identical; this is the ideal case,
since the specialization would behave exactly as intended;

same-kind promotion: the two types belong to the same “kind” (for
example int32 and int64 are two integer types), and the source
type can be converted losslessly to the destination type (e.g. from
int32 to int64, but not the reverse);

safe conversion: the two types belong to different kinds, but the
source type can be reasonably converted to the destination type
(e.g. from int32 to float64, but not the reverse);

unsafe conversion: a conversion is available from the source type
to the destination type, but it may lose precision, magnitude, or
another desirable quality.

no conversion: there is no correct or reasonably efficient way to
convert between the two types (for example between an int64 and a
datetime64, or a C-contiguous array and a Fortran-contiguous array).

When a specialization is examined, the latter two cases eliminate it from
the final choice: i.e. when at least one argument has no conversion or
only an unsafe conversion to the signature’s argument type.

Note

However, if the function is compiled with explicit signatures
in the jit() call (and therefore it is not allowed to compile
new specializations), unsafe conversion is allowed.

If a specialization is not eliminated by the rule above, it enters the
list of candidates for the final choice. Those candidates are ranked
by an ordered 4-uple of integers: (numberofunsafeconversions,numberofsafeconversions,numberofsame-kindpromotions,numberofexactmatches) (note the sum of the tuple’s elements is equal to the
number of arguments). The best match is then the #1 result in sorted
ascending order, thereby preferring exact matches over promotions,
promotions over safe conversions, safe conversions over unsafe conversions.

The above-described mechanism works on integer typecodes, not on Numba
types. It uses an internal hash table storing the possible conversion
kind for each pair of compatible types. The internal hash table is in part
built at startup (for built-in trivial types such as int32, int64
etc.), in part filled dynamically (for arbitrarily complex types such
as array types: for example to allow using a C-contiguous 2D array where
a function expects a non-contiguous 2D array).