Mapping introduction

The object of this tutorial is to introduce you to the tools we'll be using for conversion of data encoded in legacy fonts to Unicode. We'll use a trivial example to see an overview of the steps in the process.

First we'll look at the source file

Open Explorer and navigate to wherever you unzipped the file Demo.txt. Double-click on this file. The Notepad program comes up displaying the three characters: SIp in the Times New Roman font (or whatever you have as your default font).

Note

If your computer is set to use a different program than Notepad to open .txt files, you will have to manually launch Notepad and open this file, or figure out the equivalent process with your program.

The Notepad program now displays the three characters as IPA in the SILDoulos IPA93 font.

Select Format / Font, then choose Times New Roman (or whatever the font was before you changed it!) from the list and click OK. This is to reverse the font change done above.
Select File / Close to close Notepad.

Now we'll look at the mapping file

In order to keep this demonstration simple, we have only included mappings for the three characters that occur in the source file. This is not at all realistic!

From Windows Explorer, right-click on Demo.map and select Open With and the TECkit Mapping Unicode Editor.

At the Select Conversion Type dialog box, make sure Legacy to Unicode and Bidirectional are selected.

Click on OK .

Where it says Select the font for the left-hand side encoding click on OK .

Select "SILDoulos IPA93" for the font and click on OK .

Where it says Select the font for the right-hand side encoding click on OK .

Select "Doulos SIL" for the font and click on OK .

The TECKit Mapping Editor program will start with the Demo.map file loaded. The display should have:

If you do not already see a message saying "Compiled successfully!" then select File / Compile. The TECKit Mapping Editor creates the compiled mapping in the folder C:UTTutorial1-IntroDemo.tec.

Near the bottom in the Left-side Sample: box, carefully type "SIp". Down below you'll see the what it is converting the characters to and it shows the roundtrip conversion. They should all visually look the same.

Delete the "SIp" in the top box and replace it with "RST". You'll see "FFFD 0283 FFFD" displayed in the middle box. U+FFFD is the Unicode character which is "...used to replace an incoming character whose value is unknown or unrepresentable in Unicode." Since the Demo.map file contains no mapping for 0x52 ("R") or 0x54 ("T"), these characters are replaced by the U+FFFD. And in the conversion back to bytes, the U+FFFD is converted to a glottal character.

Click on File / Exit to close the TECKit Mapping Editor.

Using the compiled mapping

Now we will use the compiled mapping Demo.tec to convert the text in our source file Demo.txt using the DropTEC program.

Click on All Programs / SIL Converters / TECkit / DropTEC to open the program.

From Explorer, drag the compiled mapping file Demo.tec to the box at the top labeled "Mapping file:". (Alternatively, you can use the adjacent Browse button and navigate to the file.)

From Explorer, drag the file Demo.txt to the box on the left labeled "Legacy text file:" (alternatively, you can use the adjacent Browse button and navigate to the file). When you have loaded the file, DropTEC prompts you for a destination file. Click Save and accept the default name of Demo-U.txt.

Once you supply the name, the file is created, so if you want to select a different "Unicode output form", you'll need to do it before specifying the input file (and giving the name for the output file).

Click on File / Exit of the "DropTEC" window to close the program.

Next, open the Demo-U.txt in Notepad. Change the font to "Doulos SIL" and see if the characters display properly. They should.

Encoding Forms

Note

The object of this tutorial is to look at several different Unicode encoding forms. We'll use the example file from the Introduction tutorial.

Click on All Programs / SIL Converters / TECkit / DropTEC to open the program

With Explorer navigate to the C:UTTutorials2-Encoding Forms folder and drag the compiled mapping file Demo.tec to the box at the top labeled "Mapping file:". (Alternatively, you can use the adjacent Browse... button in DropTEC and navigate to the file.)

From Explorer, drag the file Demo.txt to the box on the left labeled "Legacy text file:". (Alternatively, you can use the adjacent Browse... button and navigate to the file.) When you have loaded the file, DropTEC prompts you for a destination file. Replace the default name of Demo-U.txt with Demo-UTF8.txt. Make sure that UTF8 is selected in the "Unicode output form" box. Click on Save .

Note

Note that once you supply the target file name, the file is created, so if you want to select a different "Unicode output form", you'll need to do it before specifying the input file (and giving the name for the output file).

Click on UTF16BE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF16BE.txt. Click on Save .

Click on UTF16LE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF16LE.txt. Click on Save .

Click on UTF32BE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF32BE.txt. Click on Save .

Click on UTF32LE in the "Unicode output form" box. Drag the Demo.txt file to the "Legacy text file:" box. Replace the default name of Demo-U.txt with Demo-UTF32LE.txt. Click on Save .

Close the "DropTEC" window.

Use Explorer to double-click on the DumpDemo.bat file in the C:UTTutorials2-Encoding Forms folder. (This will not work on Windows 95.) This will generate a hex dump on the source file and the five files you just created. When you're done looking at the hexadecimal dump, press a key to close the box. The display should look like:

The table below shows how the three characters (plus the initial Byte Order Mark) appear in each of the different encodings.

Note

Note that although, strictly speaking, the Byte Order Mark is not needed for UTF-8, some programs include it as an indication that the text is encoded in UTF-8.