Conversion scripts

Conversion scripts

The conversion scripts are located in
gt/script. They are of two different types:
perl scripts (*.pl) and xfst scripts. The xfst scripts are compiled,
they have filename.regex as source file names and filename.fst as
binary file names.

The scripts have different functions. Some scripts convert input text
to the internal format used by the program, whereas other scripts convert
the output of the program into a format suitable for output.

Note that the unix utility iconv contains ready-made
conversion routines for many code tables. The syntax is as follows:

$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file
> new_file

A list of code tables is listed with iconv --list.
This of course does not help in converting text to our internal format,
but in the future it may be used for conversion to utf-8.

Naming the scripts

The scripts are named "sourceform-targetform.scripttype".
The perl script converting Latin 6 input to the internal 7-bit digraph
system á, c1, d1, n1, s1, t1, z1, is called
latin6-7bit.pl.

There are at the moment script for converting from ws2, Latin6 and
mac (here called "linmac", since mac files are translated
to something else when the files are moved to Linux. "Something
else" is here called "linmac" (mac as observed on
Linux), and taken as a starting point for the conversion script.

Scripts converting input text to and from internal digraphs ("7bit")

Perl scripts

The perl scripts contain conversion lines of the format
s/\273/t1/g. This line converts a t-stroke to t1.
The code position (in the code table Latin 6, used a.o. by Statens
Kartverk) is hexadecimal BB. Perl uses octal notation, and the octal
value of BB is 273.

Note that there are two different scripts, utf8-7bit.pl and
utf8.pl. The former converts from utf8 to 7bit, the other one is
some sort of all-in-one-script that converts from different formats
(mac saved as utf8, text written on Win9x saved as utf8, etc. to
7-bit. Testing is needed to see whether this is a relevant
partition, in any case, the utf8-7bit.pl works in cases where the
input signal has not been corrupted, i.e. it takes
real utf8 as input.

xfst scripts

The <encoding>-7bit.regex files are files that convert from
the given encoding to the internal format.

The 7bit-<encoding>.regex files are files that convert to the
given encoding from the internal format.

Compiling .regex files

To make use of the .regex files you may have to compile them to
.fst files. Go to the script directory and have a look at
the .regex and .fst files. If the .regex file is older than the
.fst file with the same name, you may use the .fst file right on,
and you do not need to compile. If the .fst file is older or do
not exist, you must compile it. Do that by while in the
script directory type the command:

make all

Using the resulting .fst files for North Sámi

In order to convert from encoding X to internal format, be in
the script directory, and type the following command:

This file converts the input from the ws2 encoding to the
internal format. The input will then be analyzed with the sme.fst
file and the result is converted back to the ws2 format.

The other <encoding>-sme files follow the same pattern.

The case conversion scripts

Initial capital letter

The most improtant caseconvertion scripts are case.regex
(caseconv.fst). They are different form language to language, and
located in the language-specific directories. They form an integrated
part of the Makefiles, and the resulting parsers contain the ability
of recognising initial capital letters.

Letters in all caps

There are also scripts to allow for words written in all caps,
called allcaps.regex. By the help of such scripts,
("Duodji" is accepted, as is "DUODJI",
but "DuoDji" is not. These are also located in the src
directories (so far only for sme), and are integrated in the
Makefile. But the resulting allcaps.fst is not compiled together
with sme.fst into a single transducer, as this would have resulted
in a too large network. Instead, it is kept separate in the sme/bin
directory, and when needed, it may be invoked by the following
command (assuming you stand in gt/sme):

... | lookup -flags mbTT -f src/cap-sme | ...

Note that the lookup script file is located in sme/src, but the
binary allcaps.fst that the cap-sme file refers to, is located in
sme/bin.

The spellrelax scripts

South and Lule Saami have scripts to allow for different
practices for writing ï¿½ (as ï¿½or i) and for the
Norwegian/Swedish ï¿½ï¿½and ï¿½ mix. These are xfst scripts,
integrated in the makefiles of sma and smj.

The scripts converting 7bit to html

Børre?

or should this be documented on the webinterace page?

Scripts converting from "alien" fileformats to 7bit

pdf to 7bit converters

The script pdfto7bit.pl is a script that converts pdf files to
7bit. It is used like this:

pdfto7bit.pl [option] <filename>

The options allowed are:

-e: output the even pages

-o: output the odd pages

To use it you will have to have the gt/script catalog in
your path. Type this at the command prompt.

PATH="~/[path to the gt directory]/gt/script:$PATH"

After this you can type "pdfto7bit.pl" at the command
prompt to use it. Typical uses are shown below

To analyze a pdf file, go to the gt/sme directory,
and type: pdfto7bit.pl <filename.pdf> | preprocess
--abbr=bin/abbr.txt |lookup -flags mbTT sme.fst | less.
The more advanced uses, documented in the
sme-manual
can also be used.

for pdffile in [directory of pdf files]/*.pdf
do
pdfto7bit.pl $pdffile > [directory of text files]/`basename $pdffile .pdf`.txt
done
. This command takes a batch of pdf files, converts them to text files and saves them in a given directory. The command `basename $pdffile .pdf`.txt assures that a pdf file named: foo.pdf is saved as foo.txt.

Some pdf documents have sámi and norwegian text on every other page. The options -e and -o is to overcome this problem. If the sámi text is on the even pages of the offending document type the following at the command prompt: