4 Answers
4

Input encoding is on the, sorry, input side, i.e. getting the characters from your input file into (La)TeX correctly.
Font encoding is on the output side, i.e. "I want to print an 'A', where in my font do I find it?".

Should I always specify both in any document I create? And what about the relationship between font encoding and font? For example: I choose T1 font encoding and bera font. I am a newbie in this field. :) Thank you.
–
xportDec 3 '10 at 7:46

3

@xport: Yes, you should specify both in your document (because you don't know the defaults, don't know how stable/portable the defaults are). The used font package should support the selected encoding. I guess that T1 is a sufficient common one - for non-deprecated font-packages at least.
–
maxschlepzigDec 3 '10 at 8:38

@maxschlepzig the default is ascii input (with the usual tex wierdnesses), ot1 output. it won't change unless latex is redone all over. (my guess is "redone all over" will happen when we get latex 3 as a real format, rather than a series of latex 2e files.)
–
wasteofspaceJul 10 '13 at 21:27

A file stored in your computer is just a sequence of numbers in the range 0–255. When you give a file as input to a program, this program has to know how to interpret this sequence of numbers. A text editor will interpret the file by associating a certain glyph to each number (or subsequence of numbers) based on well defined rules, which I'll call "code pages".

In the olden times the allowed numbers were in the range 0–127 and there was only one code page, ASCII (actually some systems used another method called EBCDIC, but this is not very relevant). Of course such a limited range made impossible to represent the necessary characters for languages using accents, diacritics or alphabets altogether different from the basic Latin alphabet. Therefore many code pages were developed, filling the range 128–255 in different ways.

TeX uses a different interpretation than text editors: the numbers it reads are subject to the tokenization process, so that control sequences can be distinguished from normal characters. But at the end, TeX typesets characters and when TeX is told to "typeset character number x", it outputs the glyph it finds in position x of the current font.

Original TeX (version 2) could really understand only input in the range 0–127, but could manage fonts with 256 glyphs, for some sequences of numbers could be mapped to glyphs via information contained in the current font metric file (this is still how --- is mapped to an em-dash). This way of doing revealed impractical for managing different languages, as it required different metric files for each code page. With TeX3, the situation changed a bit, because all input in the range 0–255 became legal, but it still didn't solve the problem of different metric files for different code pages.

The LaTeX development team therefore devised a new strategy. You can announce LaTeX what code page you're going to use for input (say latin1 or koi-8r) and what code page you need for output. Some standard output code pages were developed: T1, T2A, T2B, T2C, T3, T4; the first one for the Latin alphabet, the T2*x* for the Cyrillic alphabet, T3 for IPA, T4 for the African-Latin alphabet.

The method works by providing an intermediate layer, called LaTeX Internal Character Representation (LICR). Each input character is changed, based on the information given at the start with

\usepackage[<codepage>]{inputenc}

to a LICR; for instance, the number corresponding to à in the Latin-1 code page is mapped to \`a. In turn \`a is mapped to a single glyph when the current output encoding is T1 or to a set of lower level TeX instruction for printing a grave accent over the a when the current output encoding is, say, T2A.

Having a limited number of output code pages is necessary for avoiding the need to develop font files for each input code page, while maintaining the possibility of correctly hyphenating words: TeX is able to hyphenate only words formed by characters in the same font and not cointaining "built up" glyphs, such as accents over base characters.

The situation changed a bit with the advent of Unicode and UTF-8, which is a method for representing Unicode with sequence of one up to four numbers read from a computer file. When you say

\usepackage[utf8]{inputenc}

you're essentially free from the "only 256 characters" limitation for input encoding, but the limitation is still present for output encodings. Thus you can't expect that LaTeX is able to interpret correctly mixed input using, say, Latin, Cyrillic, Devanagari and Chinese without properly segregating each part in order to use the right output encoding.

For example, a document written both in Italian and Russian will use the babel environments and commands to switch between languages and, implicitly, between output code pages. So, if one announces

\usepackage[italian,russian]{babel}

the base language will be Russian and the default output code page will be T2A (automatically selected by babel). When \foreignlanguage{italian}{parola} is found, the code page is temporarily changed into the one automatically selected for Italian. This opens a problem, unfortunately: the default output code page is still, for compatibility reasons, the "original TeX" with only 128 glyphs, so in this case it's best to say

so that the Italian parts will use a 256 glyph font containing also accented characters.

If one says only \usepackage[utf8]{inputenc} and doesn't announce the Russian language to babel, many errors of "Undefined character" would be raised when trying to use cyrillic characters, because the characters wouldn't correspond to anything in the current font.

Important notice

The editor you use for writing LaTeX documents is not TeX or LaTeX. You have to take care that the input mapping used by the editor corresponds to the input mapping used by LaTeX. There is no general advice for how to ensure this equivalence, because each editor has its own ideas about it.

The best answer - but I have one question: Could you elaborate on the following text, "Having a limited number of output code pages is necessary [...] accents over base characters". (emphasis mine) To the user, only input encodings "ought" to be relevant, because LaTeX is an algorithm/function from input encodings to a visual representation in a pdf-file. Thus, the concept of an output encoding is really algorithm-internal. I feel that a user "should" not need to worry about these. I understand there are historical reasons for the limitations. Which? (Maybe you said so already.)
–
Lover of StructureSep 24 '12 at 1:42

2

@user14996 In most instances the output encoding doesn't bother (much) the end user. However, multiplying them poses some problems: in order to support an output encoding, a font must be accompanied by .fd files on the LaTeX level and by .tfm files (and, probably, also .vf files) at the typesetting level. The TeX Gyre Termes fonts has already eight .tfm files, corresponding to the supported output encodings; and notice that it doesn't support Greek and Cyrillic. So the necessity is not "theoretical", but mainly practical.
–
egregSep 24 '12 at 6:35

You need an input encoding to tell TeX how to interpret the contents of your text file, you need an font encoding for proper hyphenation. Old TeX can only hyphenate words from one font and therefore you need to squeeze all the characters you use (including all accented ones) into one font. If you need more than 256 different characters in one word you're out of luck.

So the input encoding is very important as a wrong input encoding makes it impossible for TeX to interpret the text correctly. The font encoding is not that important, as long as all your characters are represented. There is for example the T1 (== ec, tex256) encoding which is widely used, but there are others such as the original OT1 font encoding.

More details

When you write a German text, you'll get to know input encoding and output encoding really fast. Assume you want to write the word draußen (which means 'outside'). The ß is not in the ascii code and thus with 8 bit encodings such as windows ansi or latin1 the code for the ß is different for each encoding. Why is this a problem?. If you want a ß in your PDF to appear, TeX need to put a distinct piece of information into the output.

Input encoding

Let's assume TeX wants to use the byte "255" for the ß. Therefore you need a mapping

my input code ---> \ss ----> 255

Assume you use latin1 input (as it used to be standard on Linux). The word "draußen" is saved on your harddrive as bytes 100, 114, 97, 117, 223, 101, 110, and the latin1.def defines: \DeclareInputText{223}{\ss} so when it sees the byte 223 it transforms it to \ss (that is the input encoding). If you use a windows machine with codepage 850 encoding, the byte sequence is 100, 114, 97, 117, 225, 101, 110 and you need a different mapping for that (the file cp850.def that uses \DeclareInputText{225}{\ss})

Font encoding

But how does TeX know what to do with the \ss? When you load T1 fontencoding it defines:

\DeclareTextSymbol{\ss}{T1}{255}

so this is the magic part that leads TeX to choose the byte number 255 when it encounters a \ss in its input. But is that all? No, of course not. The PDF format needs to map the byte 255 to a glyph! This is done in the encoding file, for example tex256.enc:

The font gets re-mapped so the byte 255 is matched to the glyph named /germandbls.

It gets even more ugly

You can even make things worse by re-mapping the encodings internally by using virtual fonts. This is a common practice with the psnfss package. But I don't give details here because it does not help understanding the inputenc/fontenc subject.

It can't be only the hyphenation that's affected. The times I curse about fontenc are when I change fonts for a document and then suddenly get bitmap fonts instead of scaleable fonts because I chose the wrong font encoding.
–
ChristianApr 23 '12 at 6:36

1

@Christian Then your font is not correctly installed or the TeX system is misconfigured. TeX never uses a bitmap version of a truetype font (except for special cases). It seems that it can't find a font and therefore tries to load other fonts.
–
topskipApr 23 '12 at 7:00

Yeah, you are right. Now that I think of it these problems usually occurred during font installation or when changing systems. It is still weird that changing fontenc made errors go away though :/
–
ChristianApr 23 '12 at 9:51

Which input encoding option do I have to specify when my TeX input file is saved in Unicode encoding? Note that: There are 4 file encoding options available in Notepad's Save As dialog box: ANSI, Unicode, Unicode Big Endian, and UTF-8.
–
kiss my armpitApr 24 '12 at 18:33

In my answer on greek text I gave an explanation which I'll basically repeat and extend here. I'll be basically talking about text encoding here, as it is rather obvious that math requires special treatment.

Please also look at encguide where all this is explained in more detail.

First of all, all this refers to TeX engines like pdftex which only support 8-bit encodings on the input as on the font side.

This means the engine can only understand 256 different characters in input text and output at most 256 different characters in a given font.

For 16+-bit aware systems like LuaTeX and XeTeX, all this is irrelevant.

It shouldn't surprise that 256 characters are few to express all the characters there are, even if not considering chinese and such. encguide has some code tables, and looking at the table for T1 (which is the "standard" in some sense) it is obvious that it's completely stuffed and still some characters are missing. T1 represents the effort to squeeze in every character needed for european languages, but still it misses some special characters for polish or lithuanian, for instance, not talking about greek or russian.

Now, what is the strategy for producing documents which can contain more than 256 different characters? The answer, of course, is font encodings. Basically, this means

There is a dedicated font for every encoding, containing the 256 characters specified by the encoding in the right places.

Some LaTeX internals are redefined to accomodate the encoding.

Note that when using a PostScript or TTF or OTF font which can contain more than 256 characters, it is not strictly neccessary to have different fonts for different encodings, TeX offers "virtual fonts" as an intermediary concept. But I don't know a thing about those ;-)

Back to encodings. What happens internally when I say, for instance, \fontencoding{T2A}\selectfont?

Internally, the font is switched to one which accomodates this encoding. That is, inside the font the characters will be in the places specified by this encoding.

Some LaTeX internals will switch their meaning such that text output will map to the correct font characters.

How is the latter achieved?

Every font encoding is associated with a file <encname>enc.def containing definitions like the following. The file is loaded automatically when the declaration \fontenc[<encname>]{fontenc} is issued.

This means that for instance, the control sequence \AE will be defined such that when the font encoding T1 is active, the character number 198 is produced. Several \DeclareTextSymbol declarations for the same control sequence with different encodings are possible, for instance LY1 encoding defines

\DeclareTextSymbol{\L}{LY1} {128}

so the polish Ł is in a different place in fonts having the respective encodings, but LaTeX will sort that out internally.

Basically, this means that LaTeX can internally handle any number of different text characters, but can output only those which are available in the currently active font encoding.

But where do the control sequences like \AE or \CYRZH which are defined here come from?

Of course, you can always type them as control sequences, but when you are writing a text in a specific language, you are of course expecting to be able to type the characters available on your keyboard directly.

This is handled by input encodings. For convenience, input encodings are organised based on the "code pages" which are (or were) usual on computer systems with 8-bit input scheme anyway, that is, ISO 8859-<something> (called latin<something> in LaTeX), applemac or ansinew.

Typing text into a text editor with a certain code page active means that by typing a certain character, its associated code for that code page is written to the text file.

To declare a certain code page to LaTeX, there exists a file <code page name>.def containing declarations like the following. The file is loaded automatically when the declaration \inputenc[<code page name>]{inputenc} is issued.

Internally, this means the input character 198 will produce the control sequence \AE when latin1 (meaning ISO 8859-1) input encoding is active and \CYRC when iso88595 (meaning ISO 8859-5) input encoding is active.