Introduction

In version 6 of Paratext,1 users will need to choose between different options for the character encoding of their data. In this article, I will try to explain these options and to provide some information that will help users choose between them.

Background on character encoding

Any text data stored in a computer uses some character encoding — a set of rules that specify how each character is stored in terms of bytes or byte sequences. Computer systems such as Microsoft Windows are designed to work with certain standard encodings. For instance, English Windows systems use an encoding that Microsoft calls “Western” or “code page 1252”.

Windows systems are able to use one of several different encodings.2 Many languages around the world, however, use characters that are not supported by any of the standard encodings recognised by Windows. In the past, users have often resorted to creating fonts that use a custom encoding that includes the particular characters they need. The problem with this is that software tools didn’t know how they should handle characters (for instance, where it was allowed to break lines of text), and different users might use different custom encodings for the same language, making it difficult to exchange data.

Recently, the computer industry has begun to adopt a new standard for character encoding known as Unicode. Unicode is a single character encoding standard that is designed to support all of the world’s languages. It is being used by more and more software programs. Thus, with Unicode it will become possible for all kinds of programs to know how to handle text in any language properly, and for users to be able to exchange data without worrying about incompatibilities.

Character encoding options in Paratext 6

When you start a new project in Paratext 6 or import a project created in an earlier version of Paratext, you need to let Paratext 6 know what encoding you want it to use. Paratext 6 is capable of working with three different kinds of character encoding: standard encodings defined within Windows, custom-defined encodings, or Unicode.

If you have existing data, you will first need to determine which of these three kinds of encoding you have used. Also, whether you have existing data or are starting a new project, you will need to decide which kind of encoding you will want to use with Paratext 6. So, let’s look at how to determine which kind of encoding might be used in existing data.

If you created your existing data using Shoebox or an earlier version of Paratext, and you are able to work with your data using any of the fonts that come with Windows, such as Arial or Times New Roman, then that data probably uses a standard Windows encoding. If, however, you need to use special fonts, for instance, fonts created using the SIL Encore Fonts system, or perhaps a specially-modified version of Times New Roman, then your data likely uses a custom encoding. The only scenario in which existing SFM3 data is likely to use Unicode is if that data was created using the Toolbox program.4 If you are at all uncertain about which kind of encoding your existing data uses, ask for help from a qualified computer-support consultant.

Now, let’s look at the other question: which kind of encoding should you use in Paratext 6? Internally, Paratext 6 always uses Unicode, so the issue is really one of how data is stored on the disk, though it also affects the choice of fonts and keyboards that you use. In Paratext 6, there are four choices:

Standard encodings defined in Windows: the data gets converted to Unicode internally using standard mapping tables in Windows, and Unicode-conformant fonts and keyboards (such as those provided with Windows) must be used.

Custom-defined encodings posing as Windows-standard encodings: when setting up your project in Paratext 6, you tell it that you are using a Windows-standard encoding, and the data gets converted to “Unicode” as though it were in that Windows-standard encoding. What results is data encoded internally using a non-conformant variant of Unicode. This is the easiest choice since you can simply continue working with your existing proprietary fonts and keyboards, but this approach involves some serious risks and potential problems.

Custom-defined encodings supported by encoding-conversion mapping tables: the data gets converted to conformant Unicode encoding using mapping tables that must be created, and Unicode-conformant fonts and keyboards (such as those provided with Windows) must be used.

Unicode (“UTF-8”): the data is stored directly in Unicode, so no conversion takes place in reading from or writing to the disk. Unicode-conformant fonts and keyboards (such as those provided with Windows) must be used.

Choosing a character encoding option in Paratext 6

Pros and cons of the different options

Overall, the fourth option, Unicode, is preferred for several reasons. First of all, Unicode is designed specifically to support languages from all over the world, including the language you are working with. In situations in which multiple languages are being used, or multiple orthographies are used for a single language, by using Unicode, all of the data can be maintained in a single encoding. The other options all force you to work with multiple encodings.

Data encoded in Unicode will be able to work with a wide variety of software applications on any recent computer system, even computers that use other operating systems, such as the Apple Macintosh. In contrast, Windows-standard encodings may not be recognised on other platforms, and transfer of data between platforms has long been recognised as a serious problem when working with custom-encoded data.

Also, when using the second option, any software, including Paratext 6, will not know what rules to apply when processing your text. For instance, it may not always break lines where you expect, and when you double-click to select a word, it might not recognise the correct word boundaries. Such problems are avoided using Unicode.

Moreover, very serious data-loss problems can occur if you attempt to use custom-encoded data with software designed to work with Unicode. Such problems can arise specifically if the custom-encoded data is presented to such software as though it were in a Windows-standard encoding (option 2): for certain accented characters (e.g., “é”), Unicode supports different encoded representations that are considered equivalent. Some Unicode-based software may change the data to use these other equivalent representations.

For instance, suppose your custom encoding uses the code 233 (decimal) for the eng character, “ŋ”, and suppose your data is interpreted by a Unicode-based program as though it were encoded using the Windows “Western” encoding (code page 1252). In that Windows-standard encoding, that code is used for e-acute, “é”. When that Unicode-based software reads in your data and converts it to Unicode, it will think that all your engs (“ŋ”) are e-acute (“é”), and it might decide to change them to another equivalent representation in Unicode (a sequence of two characters, “e” followed by the combining acute diacritic).5 At this point, your data has been seriously damaged with no easy means of recovery. Such data loss will not occur within Paratext 6, but it may occur in other software that you also use for your language data, such as FieldWorks.

Because Unicode is a widely-adopted industry standard, there is likely to be a large selection of high-quality fonts that can be used with data encoded in Unicode. The same fonts will work with standard Windows encodings; they will also work with custom-encoded data that gets converted in Paratext 6 into conformant Unicode. But custom-encoded data that gets converted into a non-conformant variation of Unicode in Paratext 6 (option 2) or that is used in other applications will require proprietary fonts. It should be noted that such fonts aren’t even guaranteed to work on all Windows systems!

As new versions of software have appeared, people working with custom encodings have often needed to revise those encodings because the original encodings no longer worked with the newer software. Indeed, as more and more software is written to work with Unicode, it is becoming increasingly difficult to work at all with custom encodings. For instance, in moving from Windows 98 to Windows 2000, many users have found that some of the characters in their custom encoding no longer display on the newer version of Windows. In contrast, Unicode is expected to be the software industry’s final solution for character encoding, and so data encoded using Unicode should never have to be fixed in order to work with new versions of software.

Many that have used custom encodings have used fonts that are categorised by Windows as “symbol” fonts.6 If you try to use symbol fonts with Paratext 6, you are likely to encounter serious problems. Thus, Paratext 6 makes no claim to work with symbol fonts. This would be a potential issue only for the second option, custom encodings posing as Windows-standard encodings. Of course, if you choose to use Unicode, this issue does not arise.

Unicode data can be exchanged with others all over the world without worrying about incompatibilities or the need for data conversion or proprietary fonts. This includes being able to publish data directly on the World Wide Web without needing to convert to images or PDF format, or forcing readers to download proprietary fonts.

Unicode is the best choice for archiving of data since the encoding is very well documented. In contrast, archiving data that uses custom encodings is particularly problematic since one must also archive documentation explaining the encoding, as well as the proprietary fonts designed to work with that encoding.

If you choose to have your data stored using Unicode, then it will be stored the same way that it is processed internally. This has several important secondary effects.

Potential data loss: The data does not undergo any changes when it is saved to or read from the disk. In contrast, with the other encoding options, data is constantly being transformed, and such transformations create potential for data loss. For instance, since Unicode is used internal to Paratext 6, it becomes possible to include characters that are not supported in the 8-bit encoding used for storage; if such characters get entered in Paratext 6 (for instance, suppose you enter an em dash, but the em dash isn’t supported in your custom encoding), they would be lost once the file is closed or you quit that session. Even if the data in Paratext only ever includes characters supported by your 8-bit encoding, there can be situations in which it is difficult to ensure “round-trip integrity” such that the data can be converted from Unicode (inside Paratext) to the other encoding (on the disk) and back again without the data being changed adversely in any way.

Different encoded representations in copied text: If you use Unicode for your data and you copy data from Paratext 6 and paste it into another Unicode-capable program such as Word 2000 and then save it to disk from that program, that second copy of the data will use the same encoding as you are using for Paratext 6. In contrast, if you use one of the other encoding options, it’s easier to end up with data in different encodings.7

Different fonts needed in different working contexts: The same fonts can be used if Unicode is used for data storage just as is used internally to Paratext 6, whether you are working with your data in Paratext 6 or have opened it in another Unicode-capable application. This will not be the case if you use option 3 (data is maintained in a custom encoding but converted into conformant Unicode within Paratext 6): in this situation, different fonts will likely be needed for working in other applications than what are used when working in Paratext 6.

Other factors to consider

I have listed many benefits of using Unicode, and in the process pointed out drawbacks of other approaches. The approach that involves the greatest risks, option 2, happens to be the one that will be the easiest to adopt for users upgrading from earlier versions of Paratext who have been using a custom encoding. It is important that you consider the issues carefully, and not choose the easiest path without weighing the long-term risks.

While there are many reasons in favour of choosing Unicode for your Paratext 6 data, there are two situations in which it might possibly make sense to consider holding off from using Unicode. First, if you regularly needed to process your Paratext 6 data using some software program that has not yet been updated to work with Unicode, it may be easier for you to continue using a custom or Windows-standard encoding until your other software has been updated to support Unicode.

Secondly, while Paratext 6 uses Unicode and so can process text containing thousands of characters from many different scripts, many of these characters require advanced fonts and font technologies for correct display. Paratext 6 relies on certain software components provided by Microsoft to handle these aspects of text display, and those components have not yet been developed to the point that they support all of the characters in Unicode. If you work with a language that uses characters not yet fully supported by these software components, having Paratext 6 work with your data internally in Unicode may mean that your data may not display entirely correctly (for instance, diacritics might not be positioned correctly).

Also, while Unicode is designed to support all of the world’s languages in a single encoding, it is still in development. In a relatively small number of instances, a language may be written using characters that are not yet supported in Unicode. Unicode provides mechanisms for working with characters not yet supported by the standard, and these mechanisms can be utilized in Paratext 6. If these characters require advanced font support, however, these characters will not be supported by Microsoft’s components for advanced font support, and so it may not be feasible to work with these characters using Unicode within Paratext 6.

In these two situations (on-going need to use older non-Unicode software, and lack of advanced font support for your particular characters), your only option is to continue to use your custom encoding and proprietary fonts (option 2). Keep in mind, though, that this option involves several risks and potential problems. If you are forced to work this way, you will need to be especially careful to monitor the integrity of your data.

Fortunately, adoption of Unicode in the computer industry is advancing very well in general. Therefore, there is good reason to hope that only a small number of users will fall in one of the two scenarios just described. For all the rest, you should be able to benefit from the support for Unicode built into Paratext 6.

Final notes

A few final observations should be made in relation to the use of Unicode by Paratext 6: While use of Unicode means that you are not limited to using proprietary fonts but can use any of a variety of Unicode-conformant fonts, you will still need to have Unicode-conformant fonts that support the particular selection of characters needed for the language you are working with. This applies to the first, third and fourth approaches to encoding in Paratext 6 described above. Similarly, you will need to have keyboard input methods that generate the appropriate codes used by the Unicode Standard.

Also, if you chose to use the third option (custom encoding supported by encoding-conversion mapping tables), it will be necessary to have a mapping table that works with the TECkit conversion processor in order to convert between the custom encoding used for storage on the disk and Unicode for processing internally within Paratext 6.

You may find it necessary to get assistance from a qualified computer-support consultant to obtain the fonts, keyboards and mapping tables you will need.

In Unicode, these alternate representations are said to be “canonically equivalent.” The two-character sequence “e” followed by the combining acute diacritic is said to be a “decomposed” representation, while the single character “é” is referred to as a “composed” representation, or a “precomposed” character.

The SIL Encore Fonts system allows you to create fonts using normal Windows “ANSI” encoding, or “symbol” encoding. Symbol-encoded fonts are handled differently in Windows and some Windows applications, and text that is formatted with a symbol-encoded font would likewise be handled differently. For instance, Microsoft Word does various character transformations, such as changing straight quotation marks into true curly quotation marks or capitalising the first character of a sentence, but it will not do these things if text is formatted with a symbol-encoded font. Since these kinds of changes were not always desirable when using custom-built fonts, some people created their custom fonts using symbol encoding so that applications like Word would not perform these character transformations.

It should be noted that if you copy data from Paratext 6 and then go to paste it into another program that does not support Unicode, such as Shoebox, then the data will be converted by Windows as though it were encoded in a default Windows-standard encoding that is assumed by the system. This is true no matter which approach you take to encoding of your Paratext 6 data, and there is currently no easy way to control the encoding that is assumed by Windows.