Scope<br>-----<br><br>This idea affects the CPython ABI for extension modules. It has no impact on the Python language syntax nor other Python implementations.<br><br>The Problem<br>-----------<br><br>Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. The two are binary incompatible, but the distinction is not included as part of the platform name. Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:<br>
<br> undefined symbol: PyUnicodeUCS2_FromString<br><br>In Python 2, some extension modules can blissfully link to either ABI, as the problem only arises for modules that call a PyUnicode_* macro (which expands to calling either a PyUnicodeUCS2_* or PyUnicodeUCS4_* function). For Python 3, every extension type will need to call a PyUnicode_* macro, since __repr__ must return a Unicode object.<br>
<br>This problem has been known since at least 2006, as seen in this thread from the distutils-sig:<br> <a href="http://markmail.org/message/bla5vrwlv3kn3n7e?q=thread:bla5vrwlv3kn3n7e">http://markmail.org/message/bla5vrwlv3kn3n7e?q=thread:bla5vrwlv3kn3n7e</a><br>
<br>In that thread, it was suggested that the Unicode representation become part of the platform name. That change would require a distutils and/or setuptools change, which has not happened and does not appear likely to happen in the near future. It would also mean that anyone who wants to provide binary eggs for common platforms will need to provide twice as many eggs.<br>
<br>Solution<br>--------<br><br>Get rid of the ABI difference for the 99% of extension modules that don't care about the internal representation of Unicode strings. From the extension module's point of view, PyObject is opaque. It will
manipulate the Unicode string entirely through PyUnicode_* function calls and does not care about the internal representation.<br><br>For example, PyUnicode_FromString has the following signature in the documentation:<br>
PyObject *PyUnicode_FromString(const char *u)<br>Currently, it's #ifdef'ed to either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString. <br><br>Remove the macro and name the function PyUnicode_FromString regardless of which internal representation is being used. The vast majority of binary eggs will then work correctly on both UCS2 and UCS4 Pythons.<br>
<br>Functions that explicitly use Py_UNICODE or PyUnicodeObject as part of their signature will continue to be #ifdef'ed, so extension modules that *do* care about the internal representation will still generate a link error.<br>
<blockquote style="margin: 1.5em 0pt;">--<br>
Daniel Stutzbach, Ph.D.<br>
President, <a href="http://stutzbachenterprises.com">Stutzbach Enterprises, LLC</a>
</blockquote>