Summary

This is not a technical article on using TeX (i.e, TeX installation or programming). Instead, it offers some background information for people who work in STM (scientific, technical and medical) publishing and aims to provide an easy-to-follow explanation by addressing the question “what is TeX?”—and, hopefully, demystifies some confusing terminology. My objective is, quite simply, to offer an introduction to TeX-based software for new, or early-career, STM publishing staff—especially those working in production (print or digital). Just by way of a very brief bio, as in “am I qualified to write this”: I’m writing this piece based on my 20+ years of experience of STM publishing, having worked in senior editorial positions through to technical production and programming roles. In addition, over the last few years I have spent a great deal of time building and compiling practically every TeX engine from its original source code, together with creating my own custom TeX installation to explore the potential of production automation through modern TeX-based software.

Introduction

If you work in STM (scientific, technical and medical) publishing, especially within mathematics and physics, chances are that you’ve heard of something called “TeX” (usually pronounced “tech”)—you might also have encountered, or read about, authors using tools called LaTeX, pdfTeX, pdfLaTeX, XeTeX, XeLaTeX, LuaTeX, LuaLateX etc. Unless you are a TeX user, or familiar with the peculiarities of the TeX ecosystem, you may be forgiven for feeling somewhat confused as to what those terms actually mean. If you are considering working in STM publishing and have never heard of TeX, then I should just note that it is software which excels at typesetting advanced mathematics and is widely used by mathematicians, physicists, computer scientists to write and prepare their journal articles, books, PhD theses and so forth. TeX’s roots date back to late 1970s but over the intervening decades new versions have evolved to provide considerable enhancements and additional functionality. Those new to STM publishing, or considering it as a career, may be surprised to learn that a piece of software dating back to the late 1970s is still in widespread use by technical authors—and publishing workflows.

NOTE: TeX is not just for mathematics. It is a common misconception that the use of TeX is restricted to scientific and technical disciplines—typesetting of complex mathematics. Whilst it finds most users in those domains, TeX is widely used for the production of non-mathematical content. In addition to typesetting mathematics, modern TeX engines (XeTeX and LuaTeX) provide exquisite handling of typeset text, support for OpenType font technologies, Unicode support, OpenType math fonts (as pioneered by Microsoft Word), multilingual typesetting (including Arabic and other complex scripts) and output directly to PDF. LuaTeX, in particular, is incredibly powerful because it also has the Lua scripting language built into its typesetting engine, offering (for example) almost unlimited scope for the automated production/typesetting of highly complex or bespoke documentation, books and so forth. LuaTeX also provides you with the ability to write plugins to extend its capabilities. Those plugins are usually written in C/C++ to perform specialist tasks—for example: graphics processing, parsing XML, specialist text manipulation, on-the-fly database queries or, indeed, pretty much anything you might need to do as part of your document production processes. If you don’t want the complexities of writing plugins, chances are you can simply use the Lua scripting language to perform many of your more complex processing tasks.

Irrespective of the tools used by authors to write/prepare their work, the lingua franca of today’s digital publishing workflows—especially journals—is XML, which is generated from the collection of text and graphics files submitted by authors. Most publishers now outsource the generation of XML to offshore companies usually based in countries such as India, China or the Philippines. Many production staff usually do not have to worry (too much) about the messy details of conversion—provided the XML passes quality control procedures and is a correct and faithful representation of the authors’ work. The future is, of course, online authorship platforms which remove the need for this expensive conversion of authors’ work into XML—but we’re still some way from that being standard practice: old habits die hard, so Microsoft Word and TeX will be around for some time, as will the need for conversion into XML.

And so to TeX: A brief history in time

My all-time favourite quote comes from the American historian Daniel J. Boorstin who once noted that:

“Trying to plan for the future without a sense of the past is like trying to plant cut flowers.”

In keeping with the ethos of that quote I’ll start with a very brief history of TeX.

On 30 March 1977 the diary of Professor Donald Knuth, a computer scientist at Stanford University, recorded the following note:

“Galley proofs for vol. 2 finally arrive, they look awful (typographically)… I decide I have to solve the problem myself”.

That small entry in Professor Knuth’s diary was the catalyst for a programming journey which lasted several years and the outcome of that epic project was a piece of typesetting software capable of producing exquisitely typeset mathematics and, of course, text: that program was called TeX. Along the way, Knuth, and his colleagues, designed new and sophisticated algorithms to solve some very complex typesetting problems: including automatic line breaking, hyphenation and, of course, mathematical typesetting. As part of the development, Knuth needed to fonts to use with his typesetting software so he also developed his own font technology called METAFONT, although we won’t discuss that in any detail here.

To cut short a very long story, TeX proved to be a huge success—in no small part because Knuth took the decision to make TeX’s source code (i.e., program code) freely available, meaning that it could be built/ported, for free, to work on a wide range of computer systems. TeX enabled mathematicians, physicists, computer scientists and authors from many other technical disciplines to have exquisite control over typesetting their own work, producing beautifully typeset material containing highly complex mathematical content. Authors could use TeX to write and prepare their books and papers, and submit their “TeX code” to publishers—usually assured of a greater degree of certainty that their final proofs would not suffer the same fate as Knuth’s.

TeX: Knuth maintains his version, but others have evolved

Even today, nearly 4 decades after that fateful genesis of TeX, Professor Knuth continues to make periodic bug fixes to the master source code of his version of TeX—which is archived at ftp://ftp.cs.stanford.edu/tex/ and available from other sources, such as CTAN (Comprehensive TeX Archive Network). Those updates take place every few years with the latest being “The TeX tuneup of 2014” as reported in the journal TUGboat 35:1, 2014. During those “tuneups” Knuth does not add any new features to TeX, they really are just bug fixes. In the 1980s Knuth decided that in the interest of achieving long-term stability he would freeze the development of TeX; i.e., that no new features would be added to his version of TeX. I specifically mentioned “his version of TeX” because Knuth did not exclude or prevent others from using his code to create “new versions of TeX” which have additional features and functionality. Those “new versions” are usually given names to indicate that whilst they are based on Knuth’s original they have additional functionality—hence the addition of prefixes to give program names such as pdfTeX, XeTeX and LuaTeX.

Huh—what about LaTeX? At this point you might be wondering why I have not mentioned LaTeX, and it is a good question. Just to jump ahead slightly, the reason I am not mentioning LaTeX (at this point) is because LateX is not a version of the executable TeX typesetting program—it is a collection of TeX macros, a topic which I will discuss in more detail below.

At this point, I’ll just use the term “TeX” (in quotes) to refer to the Knuth’s original version and all its later descendants (pdfTeX, XeTeX, LuaTeX).

So, what does “TeX” actually do?

As noted, “TeX” is a typesetting program—but if you have formed a mental image of a graphical user interface (GUI), such as Adobe InDesign, then think again. At the time of TeX’s genesis, in the late 1970s, today’s sophisticated graphical interfaces and operating systems were still some way into the future and TeX’s modus operandi still reflects its heritage—even for the new modern variants of TeX. Those accustomed to using modern page layout applications, such as Adobe InDesign, may be surprised to see how TeX works. Suppose someone gives you a copy of a “TeX” executable program and you want to use it to do something, how do you do that? “TeX” uses a so-called command-line interface: it has no fancy graphical screen into which you type your text to be typeset or point, click, tap to set options or configurations. If you run the “TeX” program you see a simple screen with a blinking cursor. Just by way of example, here’s the screen I see when I run LuaTeX (luatex.exe on Windows):

Clearly, if you want a piece of software to typeset something, you will need to provide some form of input (material to typeset) in order to get some form of output (your typeset material). Your input to the typesetting program will not only need to contain the material to be typeset but will also require some instructions to tell a typesetting program which fonts to use, the page size and a myriad of other details controlling the appearance of the typeset results. To typeset anything with “TeX” you provide it with an input text file containing your to-be-typeset material interspersed with “typesetting instructions” telling “TeX” how to format/typeset the material you have provided: i.e., what you want it to achieve. And here is where “TeX” achieves its legendary power and flexibility. The “typesetting instructions” that control “TeX’s” typesetting process are written using a very powerful programming language—one that Professor Knuth designed specifically to provide users with enormous flexibility and detailed control of “TeX’s” typesetting capabilities. So we can now start to see that “TeX” is, in fact, a piece of typesetting software that users can direct and control by providing it with instructions written in a programming language. You should think of “TeX” as an executable program (“typesetting engine”) which understands the TeXtypesetting language.

A tiny example

Just to make it clear, here is a tiny example of some input to “TeX”—please do not worry about the meaning of the strange-looking markup (“TeX” commands that start with a “\”). The purpose here is simply to show you what input to “TeX” looks like:
$$\left| 4 x^3 + \left( x + {42 \over 1+x^4} \right) \right|.$$

And here is the output (as displayed in this WordPress blog using the MathJax-LaTeX plugin):

\[\left| 4 x^3 + \left( x + {42 \over 1+x^4} \right) \right|.\]

So, in order to produce your magnum opus you would write and prepare a text file containing your material interspersed with “TeX” commands and save that to a file called, say, myopus.tex and then tell your “TeX” engine to process that file. If all goes well, and there are no bugs in your “TeX” code (i.e., “TeX” programming instructions) then you should get an output myopus.pdf containing a beautifully typeset version of your work. I have, of course, omitted quite some detail here because, as I said at the start, this is not an article about running/using “TeX”.

“TeX” the program (typesetting “engine”) and “TeX” the typesetting language

So, the word “TeX” refers both to an executable program (the “TeX” typesetting engine) and the set of typesetting instructions that the engine can process: instructions written in the “TeX” language. Understanding that the executable “TeX” engine is programmable is central to truly appreciating the differences between LaTeX, pdfTeX, pdfLaTeX, XeTeX, LuaTeX and so forth.

Each “TeX” engine (program) understands hundreds of so-called primitive commands. Primitive in this sense does not mean “simple” or “unsophisticated”, it means that they are the fundamental building blocks of the TeX language. A simple, though not wholly accurate, analogy is the alphabet of a particular language: the individual characters of the alphabet cannot be reduced to simpler entities; they are the fundamental building blocks from which words, sentences etc are constructed.

And finally: from TeX to pdfTeX, XeTeX and LuaTeX

Just to recap. When Knuth wrote the original version of “TeX” he defined it to have the features and capabilities that he thought were sufficient to meet the needs of sophisticated text and mathematical typesetting based, of course, on the technology environment of that time—including processing and memory of available computers, font technologies and output devices. Knuth’s specification of “TeX” included its internal/programming design (“TeX’s” typesetting algorithms) and, of course, defining the “TeX” language that people can use to “mark up” the material to be typeset. What I mean by “defining the TeX language” is defining the set of several hundred primitive commands that the “TeX” engine can understand, and the action taken by the “TeX” engine whenever it encounters one of those primitives during the processing of your input text.

Naturally, technology environments evolve: computers become faster and have more storage/memory, new font technologies are released (Type 1, TrueType, OpenType), file output formats evolve (e.g., the move from PostScript to PDF) and Unicode became the dominant way to encode text. Naturally, “TeX” users wanted those new technologies to be supported by “TeX”—in addition to incorporating ideas for, and improvements to, the existing features and capabilities of Knuth’s original TeX program. As noted earlier, in the 1980s Knuth decided to freeze his development of TeX: no more new features in his version—bug fixes only. With the genuine need to update/modernize Knuth’s original software, TeX programming experts have taken Knuth’s original source code and enhanced it to add new features and provide support for modern typesetting technologies. The four-decade history of TeX’s evolution is quite complex but if you really want the full story then read this article by Frank Mittelbach: TUGboat, Volume 34 (2013), No. 1.

These new versions of TeX not only provide additional features (e.g., outputting direct to PDF, supporting OpenType fonts) they also extend and adapt the TeX language too: by adding new primitives to Knuth’s original set, thus providing users with greater programming power and flexibility to control the actions of the typesetting engine. Each new TeX engine is given its own name to distinguish it from Knuth’s original software: hence you now have pdfTeX, XeTeX and LuaTeX. These three TeX engines are not 100% compatible with each other and it is quite easy to prepare input that can be processed with one TeX engine but fail to work with others—simply because a particular TeX engine may support primitive commands that the others do not. But all is not lost: enter the world of TeX macros!

Primitives are not the whole story: macros and LaTeX

I have mentioned that each TeX engine supports a particular set of low-level commands called primitives—but this is not the full story. Of course, many of the same primitives are supported by all engines but some are specific to a particular engine. “TeX” achieves its true power and sophistication through so-called TeXmacros. The primitive commands of an engine’s TeX language can be combined together to define new commands, or macros, built from low-level primitive instructions—and/or other macros. TeX macros allow you to define new commands that are capable of performing complex typesetting operations, saving a great deal of time, typing and programming errors. In addition, TeX engines provide primitives that you can use to detect which TeX engine is being used to typeset a document—so that a TeX engine can, on-the-fly, adapt its behaviour depending on whether or not it supports a particular primitive it might encounter. If a particular primitive is not supported directly but can be “mimicked” (using combinations of other primitives) then all is usually well—but if the chosen TeX engine really cannot cope with a particular primitive then typesetting will fail and an error will be reported.

The TeX language is, after all, a programming language—albeit one designed to solve typesetting problems; but as a programming language TeX is extremely arcane and works very differently to most programming languages you are likely to encounter today.

So, finally, what is LaTeX?

We’ve talked about various versions of the TeX engine—from Knuth's original TeX to its descendants of pdfTeX, XeTeX and LuaTeX—and briefly discussed TeX as a typesetting language: primitives, programming and the ability to write macros. Finally, we are in a position to discuss LaTeX. The logical extension to writing individual TeX macros for some specific task you want to solve, as an individual, is to prepare a collection of macros that others can also use—a package of macros that collectively provide some useful tools and commands that others can benefit from. And that is precisely what LaTeX is: it is a very large collection of complex and sophisticated macros designed to help you typeset books, journal papers and so forth. It provides a wealth of features to control things like page layout, fonts and a myriad of other typesetting details. Not only that but LaTeX was designed to be extensible: you can plug-in additional, more specialist, macro packages written to solve specific typesetting problems—e.g., producing nicely typeset tables or typesetting particularly complex forms of mathematics. If you visit the Comprehensive TeX Archive Network you can choose from hundreds, if not thousands, of macro packages that have been written and contributed by users worldwide.

So, if someone says they are typesetting their work with LaTeX then they are only telling you part of the story. What they really mean is that they are using the LaTeX macro package with a particular TeX engine—usually pdfTeX but maybe XeTeX (for multilingual work) or LuaTeX (perhaps for advanced customized document production). Sometimes you will see terms such as pdfLaTeX, XeLaTeX or even LuaLaTeX: but these are not actually the names of TeX engines, all they signify is which TeX engine is being used to run LaTeX. For example, if someone says I am “using pdfLaTeX” what that really means is “I am preparing my typeset documents using the LaTeX macro package and processing it with the pdfTeX engine”. Equally, if anyone says to you that they are “using TeX” then, I hope, you now see that statement does not actually tell you the whole story.

Introduction: From WEB to C, a bit of history/background

For some time I'd wanted to build TeX (the original Knuth version) from the WEB source code, but the relatively complex process to generate C from WEB meant it was one of those "tasks" I kept putting off. Well, back in early 2013 I finally decided to have a go and, eventually, I managed to create a Windows port/build of the Web2C executable and associated tools. Using those tools I was finally able to generate TeX.C from TeX.WEB and compile a working TeX executable. As part of that exercise I decided remove the kpathsea path-searching library from my build of TeX, replacing it with a simple recursive directory search – based, at the moment, on compile-time options (which I plan to make fully configurable – probably with a Lua-based config file).

Why am I doing this... ?

I ask myself this on many occasions... Having "ported" LuaTeX to a native Windows build, I already have a TeX-based system to explore via Visual Studio (and LuaTeX is written in clean C, no need of Web2C). I guess it's mainly curiosity but there is also the fact I can "tweak + explore" some parts of Knuthian TeX and rapidly and easily re-compile it – the C code base of Knuthian TeX is tiny fraction of LuaTeX and is thus far, far quicker to compile. I also don't want to risk doing something dumb and somehow wrecking my port/build of LuaTeX.

Poking around inside TeX.C

Although I have quite a collection of books on TeX, I've always found it really, really hard to understand how TeX – the language and program – actually works. So, for me, I find it much more instructive to watch how some bits of TeX actually work by stepping through the C code as TeX is executing – single-stepping via the Visual Studio interface. However, before attempting to do that I spent some time using regular expressions to "tidy up" the machine generated C code produced by Web2C – the raw C code (produced by Web2C) is almost impossible to read/follow. At present, the "tidied C code" is still far from "easily legible code", but it's gradually improving, especially as I copy/paste explanatory text from TeX.WEB into TeX.C. Many parts of TeX (algorithms) are truly fiendishly complex (line-breaking, hyphenation, math typesetting, etc...) so I doubt I'll spend too much time probing those inner depths. Whilst being in awe at the sophistication and complexity of the algorithms inside TeX, I do confess that, at times, the C code is, in places, somewhat spaghetti-like. For example, there is a significant number of global variables and some individual globals are used for more than 1 purpose. Additionally, there is extensive use of "goto" statements, causing the code to jump all over the place.

Some confusion starts to ease

Despite the difficulty in following the execution of TeX.C, it is nevertheless fascinating to watch TeX actually run: Parsing the input file, acting on catcode values, creating tokens, defining macros, building boxes, running the page-builder and shipping out pages. Although I'm only just starting to explore TeX via C code, it has, for me, started to lift some of the confusion surrounding the TeX language – even if I have barely scratched the surface of this truly extraordinary program.

A new series of posts...?

My plan is to write a series of short, but fairly frequent, posts based on some aspects of TeX's internals: To relate/use those internals to explain, with examples, some parts of the TeX language semantics. At least, in areas that I found tricky to understand and ones that, I hope, might be instructive/useful for others.

TeX uses the concept of "badness" as a measure of how much the glue in a box has to stretch or shrink. In the following C function, t is the difference between the total of the natural sizes (N) of the components in the box and the desired size of the box (d). So, t = N-d. If the total amount of glue available for stretching or shrinking is s, then the badness, according to the TeXbook, is $100(t/s)^3$ – note that t/s is also known as the glue-set-ratio (often denoted r). In reality, TeX uses an approximation to this calculation, as shown below – the C code is from the C output by Web2C.

Just a brief post, partly to record this for my own use. If you read the source code of TeX you will see references to a data structure called a memoryword. It is very carefully defined in the source file texmfmem.h, using various #ifdef blocks to account for endian-type and the "flavour" of TeX you are compiling. So, here is the memoryword, stripped to the very basics for my Windows-only build of TeX. On my machine, sizeof(memoryword) = 8 bytes – glueratio is defined as the type double (8 bytes) – TeX does use the type double for glue calculations. From section 109 of the TeX source code:

When TEX "packages" a list into a box, it needs to calculate the proportionality ratio by which the glue inside the box should stretch or shrink. This calculation does not affect TEX's decision making, so the precise details of rounding, etc., in the glue calculation are not of critical importance for the consistency of results on different computers.

Summary

This is a lengthy post which covers numerous topics on using fonts with TeX and DVIPS. It was fun to write and program but it certainly absorbed many hours of my evenings and weekends. In some areas I've had to omit some finer details because it would make the article way too long and I'd probably run out of steam and never finish it: think of it as a "getting started" tutorial. I hope it is useful and interesting. Now to get on with some of those household tasks I've put off whilst writing this – and thanks to my partner, Alison Tovey, who has waited patiently (well, almost :-)) whilst I was glued to WordPress!

Introduction

Modern TeX(-based) engines, such as XeTeX and LuaTeX, provide direct access to using OpenType fonts, albeit using different philosophies/methods. This post looks at just one way to use TrueType-flavoured OpenType fonts with the traditional TeX–DVIPS–PostScript–PDF workflow which is usually associated with the 8-bit world of Type 1 PostScript fonts. The idea is that we'll convert TrueType-flavoured OpenType fonts to Type 42 PostScript fonts and include the Type 42 font data into DVIPS's PostScript output stream using the DVIPS -h filename mechanism. In addition, we'll look at using font encoding and the creation of TeX Font Metrics to enable access to the rich set of glyphs in a modern TrueType-flavour OpenType font.

Many Truetype-flavoured OpenType fonts (and thus the resulting Type 42 PostScript font) contain hundreds, if not thousands, of glyphs – making the 8-bit world of the traditional PostScript Encoding Vector little more than a tiny window into the rich array of available glyphs. By re-encoding the base Type 42 font we can generate a range of 256-character fonts for TeX and DVIPS to exploit the full range of glyphs in the original TrueType font – such as a true small caps font if the TrueType font has them.

We will also need to create the TeX Font Metrics (TFMs) so that TeX can access the metric data describing our fonts – the width, height, depth plus any kerning and linatures we care to add. Of course, the virtual font mechanism is also a valid approach – see Virtual Fonts: More Fun for Grand Wizards for more details. Much of what we're doing here uses a number of freely available software tools to extract key data from the actual OpenType font files for onward processing into a form suitable for TeX.

Context of these experiments

Over the past few weeks I've spent some evenings and weekends building TeX and friends from WEB source code, using Microsoft's Visual Studio. At the moment, this all resides in a large Visual Studio project containing all the various applications and is a little "Heath Robinson" at the moment, although it does work. Within each of my builds of TeX and friends I've replaced the venerable Kpathsea path/file-searching library with my one of own creation – which does a direct search using recursive directory traversal. I'm also toying with using database-lookup approach, hence the appearance of SQLite in the list of C libries within the screenshot.

Turning to Eddie Kohler's marvellous LCDF Typetools collection, I used MinGW/MSYS to build this. LCDF Typetools contains some incredibly useful tools for working with fonts via TeX/DVIPS – including ttftotype42 which can generate a Type 42 PostScript font from TrueType-flavoured OpenType fonts. You can think of a Type 42 font as a PostScript "wrapper" around the native TrueType font data, allowing you to insert TrueType fonts into PostScript code.

Characters, glyphs, glyph names, encodings and glyph IDs

Firstly, we need to review several interrelated topics: characters, glyphs, glyph names, encodings and glyph IDs (contained in OpenType fonts). Let's begin by thinking about characters. A character can be considered as the fundamental building block of a language: it is, if you like, an "atomic unit of communication" (spoken or not) which has a defined role and purpose: the character's meaning (semantics). Most characters usually need some form visual representation; however, that visual representation may not be fixed: most characters of a human spoken/written language can be represented in different forms. For example, the character 'capital H' (H) can take on different visual appearances depending on the font you use to display it. Fonts come in different designs and each design of our 'capital H' is called a glyph: a specific visual design which is particular to the font used to represent the 'capital H'. Each character that a font is capable of displaying will have a glyph designed to to represent it – not only that but you may have a fancy font that contains multiple representations for a particular character: small caps, italic, bold and so forth. Each of these variants uses a different glyph to represent the same character: they still represent the same fundamental "unit of meaning" (a character) just using different visual forms of expression (glyphs).

If we look around us we see, of course, that there are hundreds of languages in our world and if we break these languages down into their core units of expression/meaning we soon find that many thousands of characters are needed to "define" or encompass these languages. So, how do we go about listing these characters and, more to the point, communicating in these languages through e-mails, text files, printed documents and so forth? As humans we refer to characters by a name (e.g., 'capital H') but computers, obviously, deal with numbers. To communicate our characters by computer we need a way to allocate an agreed set of numbers to those characters so that we can store or transmit them electronically. And that's called the encoding. An encoding is simply an agreed set of numbers assigned to an agreed set of characters – so that we can store those numbers and know that our software will eventually display the correct glyphs to provide visual expression of our characters. To communicate using numbers to represent characters both sides have to agree on the encoding (mapping of numbers to characters) being used. If I save my text file (a bunch of numbers) and you open it up then your software must interpret those numbers in the same way I did when I wrote the text. Clearly, it's essential for encoding standards to exist and perhaps the most well known is, of course, the Unicode standard which allocates a unique number to well over 100,000 characters (at present), with new characters being added from time-to-time as the Uniciode standard is updated.

Let's take closer at fonts. We've seen that the job of a font is to provide the glyphs which represent a certain set of characters. Naturally, any particular font will only contain glyphs to represent a small subset of the world's characters: there are just too many for any single font to contain them all. We've also said that some fonts may contain multiple glyphs to represent the same character. Considering OpenType fonts for the moment, within each font the individual glyphs (designs representing a specific chartacter) are each given a name and a numeric identifier, called the glyph identifier (also called the index or glyph ID). Each glyph is thus described by a (name, glyph ID) pair. It's really important to realise that the glyph ID has nothing to do with encoding of characters: it is just an internal bookkeeping number used within the font and assigned to each glyph by the font's creator. The numeric IDs assigned to a particular glyph are not defined by a global standard. Furthermore, the names given to glyphs also show a great deal of variation too, although there are some attempts at standardizing them: see the Adobe Glyph List which aims to provide a standard naming convention.

Let's recap. We've seen that the fundamental "unit of communication" is the character and that characters are encoded by assigning each one to a number. We've also seen that fonts contain the designs, called glyphs, which represent the characters supported by the font. Internally, each (OpenType) font assigns every glyph an identifier (glyph ID) and a glyph name which may, or may not, be "standard".

So, the next question we need to think about is: given a text file containing characters represented (stored) according to a specific encoding (a set of numbers), how does any font actually know how to map from a certain character in the text file to the correct glyph to represent it? After all, the encoding in the text file is usually based on a standard but the data in our font, glyph IDs and glyph names, are not standard? Well, not surprisingly there is indeed some extra bit of data inside the font which provides the glue and this is called the Encoding Vector (in older PostScript fonts) or character map (CMAP) table within the modern world of Unicode and OpenType fonts. The job of the Encoding Vector (or character map (CMAP)) is to provide the link between the standard world of encoded characters to the (relatively) non-standard inner font world of glyph IDs and glyph names.

A sneak peek at GentiumPlus-R: 5586 glyphs in a single font

For the remainder of this post I'll use the free Gentium OpenType font (GentiumPlus-R) as an example because I do not want to inadvertantly infringe any commercial licence conditions in the work below. To help solidify the ideas described above I generated a table of all the glyphs (plus glyph ID and glyph name) contained within the GentiumPlus-R TrueType-flavour OpenType font.

GentiumPlus-R glyph chart

Technical details: To generate these glyph tables I wrote a command-line utility (in C) which used the FreeType library to extract the low-level data from inside the OpenType font. This data was written out as a PostScript program which loops over all the glyphs: drawing each glyph together with its glyph ID and name. This PostScript program was combined with the GentiumPlus (TrueType) font after converting it to a Type 42 PostScript font using ttftotype42 compiled from the source code distributed as part of the wonderful LCDF Typetools collection.

PostScript Encoding Vectors

Let's recap on our objectives. We've explored the idea of glyphs, characters and encodings and seen that OpenType fonts can contain many thousands of glyphs to display thousands of characters. However, OpenType fonts can't easily be used within the traditional TeX–DVIPS–PostScriptS–PDF workflow: most traditional TeX workflows use 8-bit characters and Type 1 PostScript fonts. As yet, we've still not explained exctly how a character code is "mapped" to a specific glyph in a font. So, it's time to look at this, focussing on Type 1 and Type 42 PostScript fonts, ignoring OpenType fonts. The "magic glue" we need to explore is the so-called Encoding Vector present in Type 1 and Type 42 fonts. The job of the Encoding Vector is to map from character codes in the input to glyphs contained in the font. Let's look at an example to make this clearer. I'll assume that you have access to the ttftotype42 utility from the LCDF Typetools collection. If you don't have it, or can't compile it, contact me and I'll e-mail my compiled version to you.

Using ttftotype42

If you run ttftotype42 on a TrueType-flavour OpenType font it will generate a fairly large plain text file which you can inspect with any text editor, so let's do that. In these examples I'll use the free Gentium OpenType font.

If you download GentiumPlus and place the GentiumPlus-R.ttf file in the same directory as ttftotype42 and run

ttftotype42 GentiumPlus-R.ttf GentiumPlus.t42

you should generate a file GentiumPlus.t42 which is a little over 2MB in size – remember, the GentiumPlus font contains over 5,500 glyphs! Loosely speaking you can think of the Type 42 font generated by ttftotype42 as being made up from the following sections:

PostScript header

Encoding Vector

/sfnts glyph data array

/CharStrings dictionary

PostScript trailer

Download GentiumPlus.t42: I uploaded the Type 42 font file GentiumPlus.t42 created by ttftotype42 onto this site: you can download it here.

Here's an extract from the Type 42 font version of GentiumPlus-R.ttf with vast amouts of data snipped out for brevity:

The Encoding Vector is an array indexed by a number which runs from 0 to 255 and the value stored at each index position is the name of a glyph contained in the font. You have probably guessed that the index (0 to 255) is the numeric value of an input character. So, via the Encoding Vector with 256 potential character values as input, we can reach up to 256 individual glyphs contained in the font. (Note: I'm ignoring the PostScript glyphshow operator which allows access to any glyph if you know its name).

The full story (quoting from the Type 42 font specification) "The PostScript interpreter uses the /Encoding array to look up the character name, which is then used to access the /Charstrings entry with that name. The value of that entry is the glyph index, which is then used to retrieve the glyph description."

However, there are 5586 glyphs in GentiumPlus so does this mean the remaining 5330 glyphs are wasted and unreachable? Of course that's not true but we can only reach 256 glyphs via each individual Encoding Vector: the trick we need is font re-encoding. The Encoding Vector is not a fixed entity: you can amend it or replace it entirely with a new one to map character codes 0 to 255 to different glyphs within the font. I won't give the full details here, although it's quite simple to understand. What you do, in effect, is a bit of PostScript programming to "clone" some of the font data structures, give this "clone" a new PostScript font name and a new Encoding Vector which maps the 256 character codes to totally different glyphs. For some excellent tutorials on PostScript programming, including font re-encoding, I highly recommend reading the truly excellent Acumen Training Journal which is completely free. Specifically, November 2001 and December 2001 issues.

If you want a simple example to explore the ideas behind Encoding Vectors you can download this code example (with PDF) to see the results of re-encoding Times-Roman.

Hooking this up to TeX and DVIPS

Having discussed fonts, encoding and glyphs at some length we now move to the next task: how do we use these ideas with TeX and DVIPS? Let's start with TeX. Here, I'm referring to the traditional TeX workflows that use TeX Font Metric (TFM) files. So what is a TFM? To do its typesetting work TeX's algorithms need only some basic information about the font you want to use: it needs the metrics. TeX does not care about the actual glyphs in your font or what they look like, it needs a set of data that describes how big each glyph is: to TeX your glyphs are boxes with a certain width, depth and height. That's not the whole story, of course, because TeX also needs some additional data called fontdimens which are a set of additional parameters that describe some overall characteristics of the font. For pure text fonts there are 7 of these fontdimens, for math fonts there are 13 or 22 depending on the type/role of the math font. These fontdimens are, of course, built into the TFM file.

Looking inside TFMs

TFM files are a highly compact binary file format and quite unsuitable for viewing or editing. However, you can convert a TFM file to a readable/editable text representation using a program called tftopl, which is part of most TeX distributions. The editable text version of a TFM is referred to as a property list file. At the start of a TFM file for a text font (e.g., cmr10.tfm) you should see the 7 fontdimens displayed like this:

The role of these fontdimens within math fonts is extremely complex. If you want to read about this in depth you can find a list of excellent articles in this post. In addition to the glyph metrics (height, width, depth) and fontdimens TFM files contain constructs for kerning and ligatures. There's a lot of information already available on the inner details of TFMs so there's no point repeating it here.

The bulk of a TFM file is concerned with providing the height, width and depth of the characters encoded into the TFM. And that brings up a very important point: individual TFM files are tied to a particular encoding. For example, right at the start of a cmr10.tfm file you should see something like this:

It contains the line (CODINGSCHEME TEX TEXT) telling you that the TFM is encoded using the TeX Text encoding scheme. Let's examine this. Referring back to our discussion of PostScript Encoding Vectors, let's take a look at the first few lines of the Encoding Vector sitting inside the Type 1 font file for cmr10 – i.e., cmr10.pfb. The first 10 positions are encoded like this:

And this is the key point: the character encoding in your TFM file has to match the encoding of your PostScript font (or a re-encoded version of it). If we look at the metric data for the corresponding characters encoded in the cmr10.tfm file we find:

Statements such as CHARACTER O 0 describe the metrics (just width and height in these examples) for the character with octal value 0, CHARACTER O 12 describes character with octal value 12 (i.e., 10 in denary (base 10)). Note that the values are relative to the (DESIGNSIZE R 10.0) which means, for example, that CHARACTER O 12 has a width of 0.722224 × 10 = 7.22224 points – because the DESIGNSIZE is 10 points. So, it is clearly vital that the encoding of your TFM matches the encoding of your PostScript font otherwise you'll get the wrong glyphs on output and the wrong widths, heights and depths used by TeX's typesetting calculations!

Using FreeType to generate raw metric data

FreeType is a superb C library which provides a rich set of functions to access many internals of a font, together, of course, with functions to rasterize fonts for screen display. Just to note, FreeType does not provide an OpenType shaping engine, for that you'll need to use the equally superb libotf C library (which also uses FreeType). However, I digress. Using FreeType you can create some extremely useful and simple utilities to extract a wide range of data from font files to generate raw data for creating the TFM files and Encoding Vectors you'll need to hook-up a Type 42 font to DVIPS and TeX. Let's look at this is a little detail. The task at hand is: given an OpenType (TrueType) font, how can do you obtain details of the glyphs it contains: the names and metrics (width, height, depth)?

Simple examples of using the FreeType API

Here's some ultra-basic examples, without any proper error checking etc, to show how you might use FreeType. You start by initializing the FreeType library (FT_Init_FreeType(...)), then create a new face object (FT_New_Face(...)) and use this to access the font and glyph details you need. The first example writes metric data to STDOUT, the second example processes the font data to create an Encoding Vectors and a skeleton property list file for creating a TFM. Note that is a "bare bones" TFM and does not generate any ligatures or kerning data. To generate a binary TFM from a property list file you need another utility called pltotf which is also part of most TeX distributions.

Creating an Encoding Vector and property list file

The following simple-minded function shows how you might use FreeType to generate an Encoding Vector and property list file. Reflecting the unusual glyphs we're using, the output files are called weirdo.pl and weirdo.enc.

Here is a small extract from weirdo.vec and weirdo.pl – if you wish to explore the output you can download them (and weirdo.tfm) in this zip file. (In the data below I followed the neat idea from LCDF Typetools and put the glyph name in as a comment).

pltotf weirdo.pl weirdo.tfm
I had to round some heights by 0.0002451 units.

I got a warning from pltotf, but I don't think it is too serious. To use the TFM you'll need to put it in a suitable location within your TEXMF tree.

Hooking up to DVIPS

We've covered a huge range of topics so it is time to recap. So far, we've generated an Encoding Vector (weirdo.vec) based on the names of glyphs (in the Gentium-Plus font) whose glyph IDs span the range 5000–5523. Within our Encoding Vector we mapped those glyph names to the character codes 32–255. We have also created a property list file, based on the same encoding, which simply contains the width, height and depth of the Gentium-Plus glyphs in the range 5000–5523. Our next task is to pull together the following items and convince DVIPS to use them.

Re-encode GentiumPlus.t42: We need to create a re-encoded font that uses our new Encoding Vector (weirdo.vec).

Update config.ps: We need to tell DVIPS how to use our new font by creating a .map file and making sure DVIPS can find that map file.

Command-line switches: We'll need to use some command-line switches to give DVIPS the info it needs to do its job.

Our Type 42 font: GentiumPlus.t42: We must tell DVIPS to embed that font into its PostScript output.

Our goal is to tell TeX to load a font (TFM) called weirdo and for DVIPS to know how to use and find the weirdo font data to generate the correct PostScript code to render our glyphs. We'll use our strange new weirdo font like this (in plain TeX):

\font\weird=weirdo {\weird HELLO}

Note that the displayed output will not be the English word "HELLO" because we've chosen some rather strange glyphs from Gentium-Plus. The key observation is the input character codes are the ASCII values of the string HELLO; i.e. (in base 10):

H = 72
E = 69
L = 76
L = 76
O = 79

and our weirdo.enc Encoding Vector maps these character codes to the following glyphs:

So, we can expect some strange output in the final PostScript or PDF file!

How do we do the re-encoding?

The basic idea is that we tell DVIPS to embed the GentiumPlus.t42 PostScript Type 42 font data into its PostScript output stream. We will then write some short PostScript headers that will do the re-encoding to generate our newly re-encoded font: which we're calling weirdo. By using the DVIPS -h command-line switch we can get DVIPS to embed GentiumPlus.t42 and the header PostScript file to perform the re-encoding. For example:

DVIPS -h GentiumPlus.t42 -h weirdo.ps sometexfile.dvi

The actual re-encoding, and "creation", of our weirdo font will be taken care of by the file weirdo.ps, which will also need to contain the weirdo.enc data. If you wish, you can download weirdo.ps. Here is the tiny fragment of PostScript required within weirdo.ps to "create" the weirdo font by re-encoding our Type 42 font whose PostScript name is GentiumPlus.

Note, of course, you could create a header PostScript file to generate multiple new fonts each with their own unique Encoding Vectors containing a range of glyphs from the Type 42 font.

Telling DVIPS how to use our new font

So far we've built the TFM file for TeX so now we need to tell DVIPS how to use it – so that it can process the weirdo font name as it parses the DVI file. DVIPS uses .map files to associate TFM file names with PostScript font names, together the actions DVIPS needs to take in order to process the font files and get the right PostScript font data into its output. These actions include processing/parsing Type 1 font files (.pfb, .pfa) and re-encoding Type 1 fonts. For our weirdo font the .map file is very simple: all we need to do is create a file called weirdo.map with a single line:

weirdo weirdo

This super-simple .map file says that the TeX font name (TFM file) weirdo is mapped to a PostScript font called weirdo (as defined by the code in weirdo.ps). It also tells DVIPS that no other actions are needed because we're not doing the re-encoding, here nor are we asking DVIPS to process a Type 1 font file (.pfb) file associated with weirdo – because there isn't one! After you have created weirdo.map you'll need to edit the DVIPS's configuration file config.ps to tell DVIPS to use weirdo.map. Again, this is easy and all you need to do is add the following instruction to config.ps:

p +weirdo.map

Does it work?

Well, I'd have wasted many hours if it didn't :-). I used the following simple plain TeX example (test.tex) which I processed using my personal build of TeX for Windows (which does not use Kpathsea).

\hsize=300pt
\vsize=300pt
\font\smallweird=weirdo at 12pt
Dear \TeX\ I would like to say HELLO in weirdo so {\smallweird HELLO}. I would also like to see
a lot of strange glyphs so I'll input a text file containing some of them: {\smallweird \input weirdchars }.
\bye

The resulting DVI file was processed to PostScript using a standard build of DVIPS with the following command line:

DVIPS -h GentiumPlus.t42 -h weirdo.ps test.dvi

The resulting PostScript file is large because the GentiumPlus.t42 file is over 2MB. However, the PDF file produced by Acrobat Distiller was about 35KB because the Type 42 font (GentiumPlus.t42) was subsetted.

The C source code of the once commercial Y&Y TeX distribution (for Windows) was donated to the TeX User Group after Y&Y TeX Inc ceased trading. I bought a copy the Y&Y TeX system in the late 1990s and certainly found it to be an excellent distribution. The source C code is free to download from http://code.google.com/p/yytex/. If you are interested to explore the inner workings of TeX and want an easy to build/compile Windows code base then this should be of real interest. Note that the DVI viewer, DVIWindo, makes use of Adobe Type Manager libraries which are not included in the download; in addition, binaries are not provided so you'll need to compile them.