Although I mentioned in my previous post that PyFriBidi isn’t the Swiss Army knife of Python language support, it’s still rather useful as far as RTL language support goes given the alternatives (i.e., none). Using it, though, requires that little step called “compiling”, which turns out to be far from a bundle of joy on Windows.

I ended up compiling it using Visual Studio 2003 to keep with the cardinal rule of using the same compiler as Python itself (2.5 in my case). Given the number of bug reports for MinGW/Cygwin, I don’t think either of those would have made this any easier. Instead of having you follow 500 steps, I’m just going to refer you to a build script to handle all this. I need to compile this frequently enough that I didn’t want to follow a bunch of steps myself either.

The README file covers how to configure the various settings in the script. You are more than welcome to rewrite this as a Python script should a VBScript somehow insult your OSS vibe. If you rewrite it in Perl, I’ll sacrifice a kitten.

For those curious, the script does the following:

Cleans up any output files

Downloads FriBidi2 from CVS if not found

Sets up the output directories

Copies in the config.h and fribidi-config.h configuration files

Copies in the custom PyFriBidi source files

Fixes the pow references in packtab.c

Reverts to the 0.10.9 behavior of exporting the UTF-8 functions

Excludes the toupper definition for Windows (it’s part of the standard library)

Fixes the formatting of the export definition file to work with Visual Studio’s linker

Fixes the benchmark file to correctly obtain user time on Windows

Builds and run the Unicode data generators

Builds the FriBidi library

Builds the FriBidi utility applications

Builds PyFriBidi

Runs the FriBidi tests

All of these changes to the source code should be forwards-compatible since the script checks the original file before applying them. I had to make enough changes to the PyFriBidi2 source code that I ended up including a patched version with the build script (see their tracker for the history here). I’d recommend running a diff between this copy and the latest trunk just to make sure the two haven’t deviated whenever you use this.

Part of working for a language services company means you get very familiar with the Unicode standard. Intimately so, to the point of awkward night-after calls. Case in point: Arabic shaping. The basic gist is that, because Arabic script is cursive, the appearance of any specific character depends on how it joins to its neighbors. For the eye-dilating details, refer to Section 8.2 of the Unicode Standard 5.1.0.

One aspect of our system uses ReportLab to create PDFs in any of the languages we support. Given Middle Eastern languages are all the rave right now (I’ll let you figure out why that might be), it didn’t take long for us to hit right-to-left (RTL) languages and, in particular, Arabic. Python’s support for RTL languages, outside of handling the Unicode, is essentially nonexistent. Since we know each passage’s language, a very crude approach is simply text.reverse(). That, however, doesn’t get you anywhere with Arabic shaping. It also makes for interesting words whenever one of these RTL passages includes an English proper noun.

In comes FriBidi and its Python offspring, PyFriBidi. Along with providing UAX #9-compliant RTL handling, the latest version of FriBidi also includes legacy Arabic shaping. “Legacy” is key here, though. If you’re going to be dealing with any Arabic script-based languages, it’s worth understanding why that is.

When displaying Arabic text, a rendering engine essentially goes through two steps. The first is to correctly order the letters based on the RTL properties of each Unicode character. This can be done without knowledge of the target font since it’s purely an interpretation of Unicode data. The next step is “shaping”, which, as previously mentioned, involves selecting the appropriate visual representation of a character based on its joining properties, neighboring characters, and ligatures. This is where the font matters since there is no separate Unicode character for each combination. The only information a Unicode character provides is what types of joining it supports (right, left, dual, and none). Beyond that, glyph selection is based on OpenType information provided by the font. Per the Unicode Standard, a font must provide a minimum number of glyph combinations if it supports an Arabic code point.

FriBidi, and PyFriBidi by extension, performs Arabic shaping with no knowledge of the target font. What it’s actually doing is replacing the original Arabic characters with code points from the legacy Arabic Presentation Forms A & B Unicode blocks, which contain a set of these glyph combinations to support older systems and applications that can’t select them during rendering. The Unicode data files provide the information necessary to map from a base Arabic character to one of these presentation forms based on the desired joins and ligatures.

This works pretty well for Arabic, the language. Problems arise with the numerous other languages that use the Arabic script but have joins that aren’t covered by these presentation form blocks. As a result, FriBidi leaves the original characters as is with these, and the final text doesn’t have the appropriate joins.

There are a few libraries with Python bindings that can handle this properly when it comes to rendering to the screen. But for integrating into a ReportLab workflow, I’m coming up empty-handed on the Python end. The most promising lead so far is IBM’s ICU Project, partly available via PyICU. We’re already using this to provide Unicode-compliant line wrapping (ever tried line-wrapping space-deficient Thai?), but its glyph selection-related portions aren’t yet available via the Python bindings.

I’d be interested in hearing how others have dealt with this. Given Python’s age, I can’t imagine we’re the first to try using it with these trickier scripts.