On Sat, Apr 14, 2012 at 2:02 PM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Hi,
>> thanks for writing this up. Comments inline as I read through it.
>> Dag Sverre Seljebotn, 14.04.2012 21:08:
>> === GIL-less accesss ===
>>>> It is OK to access the native-call table without holding the GIL. This
>> should of course only be used to call functions that state in their
>> signature that they don't need the GIL.
>>>> This is important for JITted callables who would like to rewrite their
>> table as more specializations gets added; if one needs to reallocate
>> the table, the old table must linger along long enough that all
>> threads that are currently accessing it are done with it.
>> The problem here is that changing the table in the face of threaded access
> is very likely to introduce race conditions, and the average library out
> there won't know when all threads are done with it. I don't think later
> modification is a good idea.
I agree; a JIT that wants to do this can over-allocate.
>> == Native dispatch descriptor ==
>>>> The final format for the descriptor is not agreed upon yet; this sums
>> up the major alternatives.
>>>> The descriptor should be a list of specializations/overload
>> While overloaded signatures are great for the callee, they make things much
> more complicated for the caller. It's no longer just one signature that
> either matches or not. Especially when we allow more than one expected
> signature, then each of them has to be compared against all exported
> signatures.
>> We'll have to see what the runtime impact and the impact on the code
> complexity is, I guess.
The caller could choose to only check the first signature to avoid
complexity. I think, however, that overloaded signatures are
important, and even checking a dozen is cheaper than going through the
Python call. Fused types naturally lead to overloads as well.
>> each described by a function pointer and a signature specification
>> string, such as "id)i" for {{{int f(int, double)}}}.
>> How do we deal with object argument types? Do we care on the caller side?
> Functions might have alternative signatures that differ in the type of
> their object parameters. Or should we handle this inside of the caller and
> expect that it's something like a fused function with internal dispatch in
> that case?
>> Personally, I think there is not enough to gain from object parameters that
> we should handle it on the caller side. The callee can dispatch those if
> necessary.
I don't think we should prohibit the signature from being able to
declare arbitrary Cython types. Whether it proves useful is dependent
on the library, and is the library writer's choice.
> What about signatures that require an object when we have a C typed value?
>> What about signatures that require a C typed argument when we have an
> arbitrary object value in our call parameters?
When considering conversion, one gets into the sticky question of
finding the "best" overload. I'd be inclined to not do any conversion
in the caller. One can (should) export the object version of the
signature as well to avoid the slow Python call.
> We should also strip the "self" argument from the parameter list of
> methods. That's handled by the attribute lookup before even getting at the
> callable.
>>>> === Approach 1: Interning/run-time allocated IDs ===
>>>>>> 1A: Let each overload have a struct
>> {{{
>> struct {
>> size_t signature_id;
>> char *signature;
>> void *func_ptr;
>> };
>> }}}
>> Within each process run, there is a 1:1
>> mapping/relation
>>> between {{{signature}}} and
>> {{{signature_id}}}. {{{signature_id}}} is allocated by some central
>> registry.
>>>> 1B: Intern the string instead:
>> {{{
>> struct {
>> char *signature; /* pointer must come from the central registry */
>> void *func_ptr;
>> };
>> }}}
>> However this is '''not'' trivial, since signature strings can
>> be allocated on the heap (e.g., a JIT would do this), so interned strings
>> must be memory managed and reference counted.
>> Not necessarily, they are really short strings that could just live
> forever, stored efficiently by the registry in a series of larger memory
> blocks. It would take a while to fill up enough memory with those to become
> problematic. Finding an efficiently lookup scheme for them might become
> interesting at some point, but that would also take a while.
>> I don't expect real-world systems to have to deal with thousands of
> different runtime(!) discovered signatures during one interpreter lifetime.
>>>> ==== Discussion ====
>>>> '''The cost of comparing a signature''': Comparing a global variable (needle)
>> to a value that is guaranteed to already be in cache (candidate match)
>>>> '''Pros:'''
>>>> * Conceptually simple struct format.
>>>> '''Cons:'''
>>>> * Requires a registry for interning strings. This must be
>> "handshaked" between the implementors of this CEP (probably by
>> "first to get at {{{sys.modules["_nativecall"}}} sticks it there),
>> as we can't ship a common dependency library for this CEP.
>> ... which would eventually end up in the stdlib, but could equally well
> come from PyPI for now. I don't see a problem with that.
>> Using sys.modules (or another global store) instead of an explicit import
> allows for dependency injection, that's good.
It excludes (or makes it difficult) for non-Python libraries to participate.
>> === Approach 2: Efficient strcmp of verbatim signatures ===
>>>> The idea is to store the full signatures and the function pointers together
>> in the same memory area, but still have some structure to allow for quick
>> scanning through the list.
>>>> Each entry has the structure {{{[signature_string, funcptr]}}}
>> where:
>>>> * The signature string has variable length, but the length is
>> divisible by 8 bytes on all platforms. The {{{funcptr}}} is always
>> 8 bytes (it is padded on 32-bit systems).
>>>> * The total size of the entry should be divisible by 16 bytes (= the
>> signature data should be 8 bytes, or 24 bytes, or...)
>>>> * All but the first chunk of signature data should start with a
>> continuation character "-", i.e. a really long signature string
>> could be {{{"iiiidddd-iiidddd-iiidddd-)d"}}}. That is, a "-" is
>> inserted on all positions in the string divisible by 8, except the
>> first.
>>>> The point is that if you know a signature, you can quickly scan
>> through the binary blob for the signature in 128 bit increments,
>> without worrying about the variable size nature of each entry. The
>> rules above protects against spurious matches.
>> Sounds pretty fast to me. Absolutely worth trying. And if we store the
> signature we compare against in the same format, we won't have to parse the
> signature string as such, we can really just compare the numeric values.
> Assuming that's really fast, that would allow the callee to optimistically
> export additional signatures, e.g. with compatible subtypes or easily
> coercible types, ordered by the expected overhead of processing the
> arguments (and the expected probability of being called), so that the
> caller would automatically hit the fastest call path first when traversing
> the list from start to end. The number of possible signatures would
> obviously explode at some point...
>> Note that JITs could still be smart enough to avoid the traversal after a
> few loop iterations.
>> One problem: if any of the call parameters is a plain object type, identity
> matches may not work anymore because we won't know what signature to expect.
>>>> ==== Optional: Encoding ====
>>>> The strcmp approach can be made efficient for larger signatures by
>> using a more efficient encoding than ASCII. E.g., an encoding could
>> use 4 bits for the 12 most common symbols and 8 bits
>> for 64 symbols (for a total of 78 symbols), of which some could be
>> letter combinations ("Zd", "T{"). This should be reasonably simple
>> to encode and decode.
>>>> The CEP should provide C routines in a header file to work with the
>> signatures. Callers that wish to parse the format string and build a
>> call stack on the fly should probably work with the encoded
>> representation.
>> Huffman codes can be processed bitwise from start to end, that would work.
>> However, this would quickly die when we start adding arbitrary object
> types. That would require a global registry for user types again. A reason
> not to care about object types at the caller.
>> Also, how do we encode struct/union argument types?
>>>> ==== Discussion ====
>>>> '''The cost of comparing a signature''': For the vast majority of
>> functions, the cost is comparing a 64-bit number stored in the CPU
>> instruction stream (needle) to a value that is guaranteed to already
>> be in cache (candidate match).
>>>> '''Pros:'''
>>>> * Readability-wise, one can use the C switch statement to dispatch
>>>> * "Stateless data", for compiled code it does not require any
>> run-time initialization like interning does
>>>> * One less pointer-dereference in the common case of a short
>> signature
>>>> '''Cons:'''
>>>> * Long signatures will require more than 8 bytes to store and could
>> thus be more expensive than interned strings
>> We could also ignore trailing arguments and only dispatch based on a fixed
> number of first arguments. Callees with more arguments would then simply
> not export native signatures.
>>>> * Format looks uglier in the form of literals in C source code
>> They are not meant for reading, and we can always generate a comment with a
> spelled-out readable signature next to it.
>>>> == Signature strings ==
>>>> Example: The function
>> {{{
>> int f(double x, float y);
>> }}}
>> would have the signature string {{{"df)i"}}} (or, to save space, {{{"idf"}}}).
>>>> Fields would follow the PEP3118 extensions of the struct-module format
>> string, but with some modifications:
>>>> * The format should be canonical and fit for {{{strcmp}}}-like
>> comparison: No whitespace, no field names (TBD: what else?)
>>>> * TBD: Information about GIL requirements (nogil, with gil?), how
>> exceptions are reported
>> What about C++, including C++ exceptions?
>>>> * TBD: Support for Cython-specific constructs like memoryview slices
>> (so that arrays with strides and shape can be passed faster than
>> passing an {{{"O"}}}).
>> Is this really Cython specific or would a generic Py_buffer struct work?
>> Stefan
> _______________________________________________
> cython-devel mailing list
>cython-devel at python.org>http://mail.python.org/mailman/listinfo/cython-devel