A real gotcha! (well, it got me)

I recently built Cairo 1.12.16 as a Windows .lib file using Visual Studio. Somewhat of a painful process but it seems to work fine. One detail that caught me out (and took hours to track down) was that I did not set a critically important preprocessor setting: HAVE_FT_LOAD_SFNT_TABLE. This is important if you are using FreeType: Without setting HAVE_FT_LOAD_SFNT_TABLE Cairo uses a "fallback" process for embedding fonts, which is not ideal.

The C source files for a Windows build

Through trial-and-error I eventually reduced the C files I needed to the list below. So, the .lib file I built is a slightly cut-down build of Cairo but so far it seems to work OK, at least for what I need. You will also need to manually create a header file called cairo-features.h.

Introduction: From WEB to C, a bit of history/background

For some time I'd wanted to build TeX (the original Knuth version) from the WEB source code, but the relatively complex process to generate C from WEB meant it was one of those "tasks" I kept putting off. Well, back in early 2013 I finally decided to have a go and, eventually, I managed to create a Windows port/build of the Web2C executable and associated tools. Using those tools I was finally able to generate TeX.C from TeX.WEB and compile a working TeX executable. As part of that exercise I decided remove the kpathsea path-searching library from my build of TeX, replacing it with a simple recursive directory search – based, at the moment, on compile-time options (which I plan to make fully configurable – probably with a Lua-based config file).

Why am I doing this... ?

I ask myself this on many occasions... Having "ported" LuaTeX to a native Windows build, I already have a TeX-based system to explore via Visual Studio (and LuaTeX is written in clean C, no need of Web2C). I guess it's mainly curiosity but there is also the fact I can "tweak + explore" some parts of Knuthian TeX and rapidly and easily re-compile it – the C code base of Knuthian TeX is tiny fraction of LuaTeX and is thus far, far quicker to compile. I also don't want to risk doing something dumb and somehow wrecking my port/build of LuaTeX.

Poking around inside TeX.C

Although I have quite a collection of books on TeX, I've always found it really, really hard to understand how TeX – the language and program – actually works. So, for me, I find it much more instructive to watch how some bits of TeX actually work by stepping through the C code as TeX is executing – single-stepping via the Visual Studio interface. However, before attempting to do that I spent some time using regular expressions to "tidy up" the machine generated C code produced by Web2C – the raw C code (produced by Web2C) is almost impossible to read/follow. At present, the "tidied C code" is still far from "easily legible code", but it's gradually improving, especially as I copy/paste explanatory text from TeX.WEB into TeX.C. Many parts of TeX (algorithms) are truly fiendishly complex (line-breaking, hyphenation, math typesetting, etc...) so I doubt I'll spend too much time probing those inner depths. Whilst being in awe at the sophistication and complexity of the algorithms inside TeX, I do confess that, at times, the C code is, in places, somewhat spaghetti-like. For example, there is a significant number of global variables and some individual globals are used for more than 1 purpose. Additionally, there is extensive use of "goto" statements, causing the code to jump all over the place.

Some confusion starts to ease

Despite the difficulty in following the execution of TeX.C, it is nevertheless fascinating to watch TeX actually run: Parsing the input file, acting on catcode values, creating tokens, defining macros, building boxes, running the page-builder and shipping out pages. Although I'm only just starting to explore TeX via C code, it has, for me, started to lift some of the confusion surrounding the TeX language – even if I have barely scratched the surface of this truly extraordinary program.

A new series of posts...?

My plan is to write a series of short, but fairly frequent, posts based on some aspects of TeX's internals: To relate/use those internals to explain, with examples, some parts of the TeX language semantics. At least, in areas that I found tricky to understand and ones that, I hope, might be instructive/useful for others.

Regular expressions are part of many programmer's toolkit but they can be quite fiddly to get right. At the moment, I'm trying to "sanitize" the C code generated for TeX (via Web2C) by post-processing the TeX.c file to make the C source code far more readable. To do that I'm using the original definitions in TeX.WEB to generate C #define statements that I can use in TeX.c. For example, in TeX.WEB you see the following "WEB macros" related to entries in TeX's "equivalence table":

Not very readable but, of course, it is machine-generated C code so what would you expect. Through regular expressions I'm (slowly/carefully) replacing many raw C statements using #defines, such as the following:

As part of this work, I use two very useful tools for building and testing regular expressions: RegexBuddy and RegexMagic (the tools are compared/explained here). They help you build, test/develop regular expressions and support the syntax and options of many regular expression engines. Once you have a working regex, RegexBuddy and RegexMagic will generate code that allows you to use the regex in a language of your choice (many languages are supported), including C code to use the regex with PCRE – which is my favourite regex library. Again, this is not an advert for these tools, just some notes from someone who has found them to be extremely useful – and have saved me considerable amounts of time in building, testing/using powerful regular expressions with PCRE.

TeX uses the concept of "badness" as a measure of how much the glue in a box has to stretch or shrink. In the following C function, t is the difference between the total of the natural sizes (N) of the components in the box and the desired size of the box (d). So, t = N-d. If the total amount of glue available for stretching or shrinking is s, then the badness, according to the TeXbook, is $100(t/s)^3$ – note that t/s is also known as the glue-set-ratio (often denoted r). In reality, TeX uses an approximation to this calculation, as shown below – the C code is from the C output by Web2C.

Just a brief post, partly to record this for my own use. If you read the source code of TeX you will see references to a data structure called a memoryword. It is very carefully defined in the source file texmfmem.h, using various #ifdef blocks to account for endian-type and the "flavour" of TeX you are compiling. So, here is the memoryword, stripped to the very basics for my Windows-only build of TeX. On my machine, sizeof(memoryword) = 8 bytes – glueratio is defined as the type double (8 bytes) – TeX does use the type double for glue calculations. From section 109 of the TeX source code:

When TEX "packages" a list into a box, it needs to calculate the proportionality ratio by which the glue inside the box should stretch or shrink. This calculation does not affect TEX's decision making, so the precise details of rounding, etc., in the glue calculation are not of critical importance for the consistency of results on different computers.

Building on the work of porting LuaTeX to build on Windows I decided to explore adding HarfBuzz to provide Arabic shaping. The excellent HarfBuzz API lends itself to some interesting solutions so here's a quick post to show some early results.