%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %
% A SMALL TUTORIAL ON THE MULTILINGUAL FEATURES OF PATGEN2 %
% %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentstyle{article}
\title{A small tutorial on the multilingual features of PatGen2}
\author{Yannis Haralambous}
\date{}
\begin{document}
\maketitle
\def\XeT{X\kern-.1667em\lower.5ex\hbox{E}\kern-.125emT}
(This document is released under the GNU General Public License,
version~2 or any later version.)
I will very briefly discuss and illustrate by an example the features
of PatGen2\footnote{the extension to PatGen by Peter Breitenlohner,
author of many beautiful and extremely useful \TeX ware, such as
\TeX--\XeT, {\tt DVIcopy} etc.}, related to the {\em translation}
file.
\section{Syntax of the translation file}
\subsection{What is the problem}
The problem is that in our (non-English, non-Latin,
non-Indonesian\footnote{As far as I know, English, Latin and
Indonesian are the only languages without diacritics or special
characters...}) languages, we have more characters than just the~26
letters of the Latin alphabet. Not to mention that alphabetical orders
can be quite different (remember: in Spanish, cucaracha comes before
chacal), and that there can be many ways to express these additional
characters: for example the French \oe{} (like in << Histoire de l'\oe
il >> by Georges Bataille) can be given to \TeX\ as an 8-bit
character, or as \verb*=\oe =, or as \verb=\oe{}=, or as \verb=^^f7=
(since we are {\em all} now working with DC fonts) and so forth.
Now let's make (or enhance the) hyphenation patterns for our language(s).
Fred Liang has not provided PatGen with the extensibility feature: before
PatGen2, if your language had less than~26 letters you could hack patterns by
substituting letters, if it had more\ldots then the odds were against you.
So the problem is to express additional characters in the patterns, in some
understandable way (and if possible to get the results in the appropriate
alphabetical order).
\subsection{The solution}
PatGen2 allows you to include more characters than the~26 letters of
the alphabet. You can specify arbitrary many {\em input forms} for
each of them (these forms will be identified internally), and you can
specify the alphabetical order of your language. The latter feature is
only interesting when you want to study the patterns file to correct
bugs, it doesn't affect the result.
These informations are transmitted to PatGen2 by means of a {\em
translation} file (usually with the extension \verb=.tra=). The syntax
of this file is not very user-friendly, but if you deal with PatGen
you are supposed to be a hacker anyway, and hackers just {\bf love}
weird-but-efficient-syntaxes [the proof: we all like
\TeX!\footnote{\ldots just joking}].
On the first line you specify the \verb=\lefthyphenmin= and
\verb=\righthyphenmin= values. The seven first positions of this line
are used as follows: two positions for \verb=\lefthyphenmin=, two
positions for \verb=\righthyphenmin=, and one position for substitute
symbols of each of \verb=.=, \verb=-=, \verb=*= in the output file.
This can be useful if you are going to use the dot the asterisk or the
hyphen to describe your additional characters (for example one may use
the dot to specify the dot accent on a letter).
The next lines concern letters of the alphabet of our language. The
first position of each line is the delimiter of our fields. You know
from languages like C or {\tt awk} that it is very useful to define
our own delimiter (instead of the blank space) if for any reason we
want to {\em include} a blank space into our fields. This will be the
case in the example: one usually leaves a blank space after a \TeX\
macro without argument.
Otherwise just leave the first position blank. Next comes the
standard representation of our character (preferably lowercase). This
is how the character will appear in the patterns and in the hyphenated
output file.
Follows again a delimiter, and all the {\em equivalent}
representations of your character: its uppercase form, and as many
other input forms we wish, always followed by a delimiter.
The order of lines specifies the alphabetical order of your characters.
When PatGen sees two consecutive delimiters,
it stops reading; so we can include
comments after that.
\section{An example}
To illustrate this I have taken some
Greek words (people who know $X\acute\alpha%
\rho\rho\upsilon$ $K\lambda\acute\upsilon\nu\nu$
and his album $\Pi\alpha\tau%
\acute\alpha\tau\epsilon\varsigma$ will recognize some of these
words\ldots) and included an $\alpha$ with accent, and a variant
representation of $\pi$, in form of a macro \verb*=\varpi =. Also I
followed the Greek alphabetical order in the translation file.
Here is my list of words:
\begin{verbatim}
A-NU-PO-TA-QTOS
A-KA-TA-M'A-QH-TOS
A-EI-MNH-STOS
MA-NA
TOUR-KO-GU-FTIS-SA
NU-QO-KO-PTHS
LI-ME-NO-FU-LA-KAS
MPRE-LOK
A-GA-\varpi A-EI
PEI-RAI-'AS
\end{verbatim}
(file {\tt greek.dic}); and here is my translation file ({\tt greek.tra}):
\begin{verbatim}
1 1
a A
'a 'A
b B
g G
d D
e E
z Z
h H
j J
i I
k K
l L
m M
n N
o O
#p#P#\varpi ##
r R
s S
t T
u U
f F
q Q
y Y
w W
\end{verbatim}
(yes, yes, that's not a joke; Greek really uses values~1
and~1 for minimal right
and left hyphenations!). As you see, I have chosen \verb=#= as delimiter
in the case of $\pi$, because \verb=\varpi= is supposed to be followed
by a blank space. And then I wrote two delimiters (\verb=##=) to show PatGen2
that I finished talking about $\pi$.
Here is what I got on my console:
\begin{verbatim}
PatGen -t greek.tra -o greek.out greek.dic
This is PatGen, C Version 2.1 / Macintosh Version 2.0
Copyright (c) 1991-93 by Wilfried Ricken. All rights reserved.
left_hyphen_min = 1, right_hyphen_min = 1, 24 letters
0 patterns read in
pattern trie has 266 nodes, trie_max = 290, 0 outputs
hyph_start: 1
hyph_finish: 2
pat_start: 2
pat_finish: 4
good weight: 1
bad weight: 1
threshold: 1
processing dictionary with pat_len = 2, pat_dot = 1
0 good, 0 bad, 31 missed
0.00 %, 0.00 %, 100.00 %
69 patterns, 325 nodes in count trie, triec_max = 440
25 good and 42 bad patterns added (more to come)
finding 29 good and 0 bad hyphens, efficiency = 1.16
pattern trie has 333 nodes, trie_max = 349, 2 outputs
processing dictionary with pat_len = 2, pat_dot = 0
29 good, 0 bad, 2 missed
93.55 %, 0.00 %, 6.45 %
45 patterns, 301 nodes in count trie, triec_max = 378
2 good and 43 bad patterns added
finding 2 good and 0 bad hyphens, efficiency = 1.00
pattern trie has 339 nodes, trie_max = 386, 6 outputs
processing dictionary with pat_len = 2, pat_dot = 2
31 good, 0 bad, 0 missed
100.00 %, 0.00 %, 0.00 %
47 patterns, 303 nodes in count trie, triec_max = 369
0 good and 47 bad patterns added
finding 0 good and 0 bad hyphens
pattern trie has 344 nodes, trie_max = 386, 13 outputs
51 nodes and 11 outputs deleted
total of 27 patterns at hyph_level 1
pat_start: 2
pat_finish: 4
good weight: 1
bad weight: 1
threshold: 1
processing dictionary with pat_len = 2, pat_dot = 1
31 good, 0 bad, 0 missed
100.00 %, 0.00 %, 0.00 %
27 patterns, 283 nodes in count trie, triec_max = 315
0 good and 27 bad patterns added
finding 0 good and 0 bad hyphens
pattern trie has 295 nodes, trie_max = 386, 4 outputs
processing dictionary with pat_len = 2, pat_dot = 0
31 good, 0 bad, 0 missed
100.00 %, 0.00 %, 0.00 %
27 patterns, 283 nodes in count trie, triec_max = 303
0 good and 27 bad patterns added
finding 0 good and 0 bad hyphens
pattern trie has 320 nodes, trie_max = 386, 6 outputs
processing dictionary with pat_len = 2, pat_dot = 2
31 good, 0 bad, 0 missed
100.00 %, 0.00 %, 0.00 %
24 patterns, 280 nodes in count trie, triec_max = 286
0 good and 24 bad patterns added
finding 0 good and 0 bad hyphens
pattern trie has 328 nodes, trie_max = 386, 9 outputs
35 nodes and 7 outputs deleted
total of 0 patterns at hyph_level 2
hyphenate word list? y
writing PatTmp.2
31 good, 0 bad, 0 missed
100.00 %, 0.00 %, 0.00 %
Time elapsed: 0:37:70 minutes.
\end{verbatim}
and here are the results: first of all the patterns
(file {\tt greek.out})
\begin{verbatim}
a1g
a1e
a1k
a1m
a1n
a1p
a1t
a1q
'a1q
e1l
e1n
h1t
i1'a
i1m
i1r
1ko
o1g
o1p
o1t
o1f
r1k
s1s
1st
u1l
u1p
u1f
u1q
\end{verbatim}
As you see, {\tt o1t} comes before {\tt o1f}: (smile) simply
because we are talking about $o1\tau$ and $o1\phi$. So the
alphabetical order is well respected. Also you see that PatGen2
has read the word \verb*=A-GA-\varpi A-EI= exactly as if it
were \verb*=A-GA-PA-EI= and has made out of it the pattern
{\tt a1p} (if you look in the input words there is no other
reason for this pattern to exist).
And here is our result (file {\tt PatTmp.2}):
\begin{verbatim}
a*nu*po*ta*qtos
a*ka*ta*m'a*qh*tos
a*ei*mnh*stos
ma*na
tour*ko*gu*ftis*sa
nu*qo*ko*pths
li*me*no*fu*la*kas
mpre*lok
a*ga*pa*ei
pei*rai*'as
\end{verbatim}
(Of course, it is correct; you think I would have shown it if it
weren't correct?) As you see there is no \verb*=\varpi = anymore:
PatGen2 has really replaced it by \verb=p=, and so
\verb*=A-GA-\varpi A-EI= has become \verb=a*ga*pa*ei=.
\section{Where do I find more information?}
I voluntarily didn't discussed the various parameters used for
pattern generation. For these there is very good litterature:
\begin{itemize}
\item the \TeX book, by the Grand Wizard of \TeX\ arcana,
appendix H;
\item \LaTeX\ Erweiterungsm\"oglichkeiten, by Helmut Kopka,
Addison-Wesley, Pages 482--489;
\item Swedish Hyphenation for \TeX, by Jan Michael Rynning,
[sorry, I don't know where this paper is published];
\item Word Hy-phen-a-tion by Com-put-er, Stanford University
Report {\tt STAN-CS-83-977};
\item forthcoming paper by Dominik Wujastyk, on British
hyphenation;
\item Hyphenation Patterns for Ancient Greek and Latin,
TUGboat~13 (4), pages 457--469.
\item and many others\ldots
\end{itemize}
Where to get PatGen2? probably everywhere, but certainly in
Stuttgart ({\tt IP 129.69.1.12}),
\verb=soft/tex/systems/pc/utilities/patgen.zip= (take a look also at
\verb=soft/tex/systems/knuth/texware/patgen.version2.1/patgen.README=).
\section{Go forth, etc etc}
OK, once again {\sc Go Forth} and make masterpieces of hyphenation
patterns\ldots
{\bf but} please get in touch with the TWGMLC (Technical Working
Group on Multiple Language Coordination) first, since people there
are working on many languages: maybe they have already done what
you need and are still testing it; or maybe they haven't and in
that case you could help us a lot.
\end{document}