On Sat, Apr 14, 2012 at 2:00 PM, mark florisson
<markflorisson88 at gmail.com> wrote:
> On 14 April 2012 20:08, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>>>>>> Travis Oliphant recently raised the issue on the NumPy list of what
>>> mechanisms to use to box native functions produced by his Numba so that
>>> SciPy functions can call it, e.g. (I'm making the numba part up):
>>>> <snip>
>>>> This thread is turning into one of those big ones...
>>>> But I think it is really worth it in the end; I'm getting excited about the
>> possibility down the road of importing functions using normal Python
>> mechanisms and still have fast calls.
>>>> Anyway, to organize discussion I've tried to revamp the CEP and describe
>> both the intern-way and the strcmp-way.
>>>> The wiki appears to be down, so I'll post it below...
>>>> Dag
>>>> = CEP 1000: Convention for native dispatches through Python callables =
>>>> Many callable objects are simply wrappers around native code. This
>> holds for any Cython function, f2py functions, manually
>> written CPython extensions, Numba, etc.
>>>> Obviously, when native code calls other native code, it would be
>> nice to skip the significant cost of boxing and unboxing all the arguments.
>> Early binding at compile-time is only possible
>> between different Cython modules, not between all the tools
>> listed above.
>>>> [[enhancements/nativecall|CEP 523]] deals with Cython-specific aspects
>> (and is out-of-date w.r.t. this CEP); this CEP is intended to be about
>> a cross-project convention only. If a success, this CEP may be
>> proposesd as a PEP in a modified form.
>>>> Motivating example (looking a year or two into the future):
>>>> {{{
>> @numba
>> def f(x): return 2 * x
>>>> @cython.inline
>> def g(x : cython.double): return 3 * x
>>>> from fortranmod import h
>>>> print f(3)
>> print g(3)
>> print h(3)
>> print scipy.integrate.quad(f, 0.2, 3) # fast callback!
>> print scipy.integrate.quad(g, 0.2, 3) # fast callback!
>> print scipy.integrate.quad(h, 0.2, 3) # fast callback!
>>>> }}}
>>>> == The native-call slot ==
>>>> We need ''fast'' access to probing whether a callable object supports
>> this CEP. Other mechanisms, such as an attribute in a dict, is too
>> slow for many purposes (quoting robertwb: "We're trying to get a 300ns
>> dispatch down to 10ns; you do not want a 50ns dict lookup"). (Obviously,
>> if you call a callable in a loop you can fetch the pointer outside
>> of the loop. But in particular if this becomes a language feature
>> in Cython it will be used in all sorts of places.)
>>>> So we hack another type slot into existing and future CPython
>> implementations in the following way: This CEP provides a C header
>> that for all Python versions define a macro
>> {{{Py_TPFLAGS_UNOFFICIAL_EXTRAS}}} for a free bit in
>> {{{tp_flags}}} in the {{{PyTypeObject}}}.
>>>> If present, then we extend {{{PyTypeObject}}}
>> as follows:
>> {{{
>> typedef struct {
>> PyTypeObject tp_main;
>> size_t tp_unofficial_flags;
>> size_t tp_nativecall_offset;
>> } PyUnofficialTypeObject;
>> }}}
>>>> {{{tp_unofficial_flags}}} is unused and should be all 0 for the time
>> being, but can be used later to indicate features beyond this CEP.
>>>> If {{{tp_nativecall_offset != 0}}}, this CEP is supported, and
>> the information for doing a native dispatch on a callable {{{obj}}}
>> is located at
>> {{{
>> (char*)obj + ((PyUnofficialTypeObject*)obj->ob_type)->tp_nativecall_offset;
>> }}}
>>>> === GIL-less accesss ===
>>>> It is OK to access the native-call table without holding the GIL. This
>> should of course only be used to call functions that state in their
>> signature that they don't need the GIL.
>>>> This is important for JITted callables who would like to rewrite their
>> table as more specializations gets added; if one needs to reallocate
>> the table, the old table must linger along long enough that all
>> threads that are currently accessing it are done with it.
>>>> == Native dispatch descriptor ==
>>>> The final format for the descriptor is not agreed upon yet; this sums
>> up the major alternatives.
>>>> The descriptor should be a list of specializations/overload, each
>> described by a function pointer and a signature specification
>> string, such as "id)i" for {{{int f(int, double)}}}.
>>>> The way it is stored must cater for two cases; first, when the caller
>> expects one or more hard-coded signatures:
>> {{{
>> if (obj has signature "id)i") {
>> call;
>> } else if (obj has signature "if)i") {
>> call with promoted second argument;
>> } else {
>> box all arguments;
>> PyObject_Call;
>> }
>> }}}
>> There may be a lot of promotion/demotion (you likely only want the
> former) combinations, especially for multiple arguments, so perhaps it
> makes sense to limit ourselves a bit. For instance for numeric scalar
> argument types we could limit to long (and the unsigned counterparts),
> double and double complex.
>> So char, short and int scalars will be
> promoted to long, float to double and float complex to double complex.
> Anything bigger, like long long etc will be matched specifically.
> Promotions and associated demotions if necessary in the callee should
> be fairly cheap compared to checking all combinations or going through
> the python layer.
True, though this could be a convention rather than a requirement of
the spec. Long vs. < long seems natural, but are there any systems
where (scalar) float still has an advantage over double?
Of course pointers like float* vs double* can't be promoted, so we
would still need this kind of type declaration.
>> The second is when a call stack is built dynamically while parsing the
>> string. Since this has higher overhead anyway, optimizing for the first
>> case makes sense.
>>>> === Approach 1: Interning/run-time allocated IDs ===
>>>>>> 1A: Let each overload have a struct
>> {{{
>> struct {
>> size_t signature_id;
>> char *signature;
>> void *func_ptr;
>> };
>> }}}
>> Within each process run, there is a 1:1 between {{{signature}}} and
>> {{{signature_id}}}. {{{signature_id}}} is allocated by some central
>> registry.
>>>> 1B: Intern the string instead:
>> {{{
>> struct {
>> char *signature; /* pointer must come from the central registry */
>> void *func_ptr;
>> };
>> }}}
>> However this is '''not'' trivial, since signature strings can
>> be allocated on the heap (e.g., a JIT would do this), so interned strings
>> must be memory managed and reference counted. This could be done by
>> each object passing in the signature '''both''' when incref-ing and
>> decref-ing the signature string in the interning machinery.
>> Using Python {{{bytes}}} objects is another option.
>>>> ==== Discussion ====
>>>> '''The cost of comparing a signature''': Comparing a global variable
>> (needle)
>> to a value that is guaranteed to already be in cache (candidate match)
>>>> '''Pros:'''
>>>> * Conceptually simple struct format.
>>>> '''Cons:'''
>>>> * Requires a registry for interning strings. This must be
>> "handshaked" between the implementors of this CEP (probably by
>> "first to get at {{{sys.modules["_nativecall"}}} sticks it there),
>> as we can't ship a common dependency library for this CEP.
>>>> === Approach 2: Efficient strcmp of verbatim signatures ===
>>>> The idea is to store the full signatures and the function pointers together
>> in the same memory area, but still have some structure to allow for quick
>> scanning through the list.
>>>> Each entry has the structure {{{[signature_string, funcptr]}}}
>> where:
>>>> * The signature string has variable length, but the length is
>> divisible by 8 bytes on all platforms. The {{{funcptr}}} is always
>> 8 bytes (it is padded on 32-bit systems).
>>>> * The total size of the entry should be divisible by 16 bytes (= the
>> signature data should be 8 bytes, or 24 bytes, or...)
>>>> * All but the first chunk of signature data should start with a
>> continuation character "-", i.e. a really long signature string
>> could be {{{"iiiidddd-iiidddd-iiidddd-)d"}}}. That is, a "-" is
>> inserted on all positions in the string divisible by 8, except the
>> first.
>>>> The point is that if you know a signature, you can quickly scan
>> through the binary blob for the signature in 128 bit increments,
>> without worrying about the variable size nature of each entry. The
>> rules above protects against spurious matches.
Note that these two approaches need not be mutually exclusive; a
cutoff could be established giving the best (and worst) of both.
>> ==== Optional: Encoding ====
>>>> The strcmp approach can be made efficient for larger signatures by
>> using a more efficient encoding than ASCII. E.g., an encoding could
>> use 4 bits for the 12 most common symbols and 8 bits
>> for 64 symbols (for a total of 78 symbols), of which some could be
>> letter combinations ("Zd", "T{"). This should be reasonably simple
>> to encode and decode.
>>>> The CEP should provide C routines in a header file to work with the
>> signatures. Callers that wish to parse the format string and build a
>> call stack on the fly should probably work with the encoded
>> representation.
>>>> ==== Discussion ====
>>>> '''The cost of comparing a signature''': For the vast majority of
>> functions, the cost is comparing a 64-bit number stored in the CPU
>> instruction stream (needle) to a value that is guaranteed to already
>> be in cache (candidate match).
>>>> '''Pros:'''
>>>> * Readability-wise, one can use the C switch statement to dispatch
>>>> * "Stateless data", for compiled code it does not require any
>> run-time initialization like interning does
>>>> * One less pointer-dereference in the common case of a short
>> signature
>>>> '''Cons:'''
>>>> * Long signatures will require more than 8 bytes to store and could
>> thus be more expensive than interned strings
>>>> * Format looks uglier in the form of literals in C source code
>>>>>> == Signature strings ==
>>>> Example: The function
>> {{{
>> int f(double x, float y);
>> }}}
>> would have the signature string {{{"df)i"}}} (or, to save space,
>> {{{"idf"}}}).
>>>> Fields would follow the PEP3118 extensions of the struct-module format
>> string, but with some modifications:
>>>> * The format should be canonical and fit for {{{strcmp}}}-like
>> comparison: No whitespace, no field names (TBD: what else?)
>> I think alignment is also a troublemaker. Maybe we should allow '@'
> (which cannot appear in the character string but will be the default,
> that is native size, alignment and byteorder) and '^', unaligned
> native size and byteorder (to be used for packed structs).
>>> * TBD: Information about GIL requirements (nogil, with gil?), how
>> exceptions are reported
>> Maybe that could be a separate list, to be consulted mostly for
> explicit casts (I think PyErr_Occurred() would be the default for
> non-object return types).
>>> * TBD: Support for Cython-specific constructs like memoryview slices
>> (so that arrays with strides and shape can be passed faster than
>> passing an {{{"O"}}}).
>> Definitely, maybe something simple like M{1f}, for a 1D memoryview
> slice of floats.
It would certainly be useful to have special syntax for memory views
(after nailing down a well-defined ABI for them) and builtin types.
Being able to declare something as taking a
"sage.rings.integer.Integer" could also prove useful, but could result
in long (and prefix-sharing) signatures, favoring the
runtime-allocated ids.