Update: Sudbury
So, it took us 8 hours to reach Sudbury. Sudbury is only about a 3 and half hour drive by car on the highway. You could call this a leisurely pace.

We are now on tracks that were originally owned by the Canadian Northern Railway (not to be confused with the Northern Railway of Canada, whose tracks we were on before). These tracks stretch across the country from Vancouver to Quebec City. Canadian Nothern Railway was the second railway to provide transcontinental service across Canada (the first being Canadian Pacific Railway, who still dominate Canada’s rail industry today).

Now these tracks are owned by CNR (Canadian National Railway).

Update: Hornepayne

If you look at a population map of Canada, you’ll see there’s a mostly unpopulated gap between Sault Ste Marie and Kenora, separating western Canada from eastern Canada. I’m currently in that gap.

I didn’t realize it would be so hard to get internet here. I haven’t managed to get an internet connection since leaving Sudbury, and I suspect I won’t be able to until we’re near Thunder Bay.

I’m supposed to be working today, but lack of internet is really hindering what I can do. I spent a while glued to my phone, hoping to glimpse a bar of service or two, but eventually gave up. It’s frustrating, because I’m trying to submit a patch for review, and I was hoping to get it approved before this weekend. I don’t like my chances of getting it reviewed next week when everyone’s busy at Whistler.

After a while, I went to the observation car and stared out the window, which I found relaxing. There seem to be countless beautiful lakes and rivers up here. We never seem to be far from a watercourse. I wonder if these rivers were a route used by fur traders in the early days of Nouvelle France.

The train stopped a couple times to drop off canoeists. I suspect we’re in a part of the country only accessible by rail.

I saw a beaver! You know, I’ve been in Canada all these years, and I think that’s the first time I’ve actually seen a beaver.

Most of the signs on this train are embossed with Braille. The signs use uncontracted Braille. I notice that the English is prefixed with dot-6 (⠠) while the French is prefixed with dot-46 (⠨). I wonder if that’s a standard convention.

Update: Sioux Lookout

We’re near Thunder Bay. For about 15 minutes we had cell phone and internet coverage, and everyone was staring to their phones.

We’re now on tracks that were part of the National Transcontinental Railway, built in 1885. This route spanned from Winnipeg to Moncton. The section from Winnipeg to Quebec City is almost a straight line stretching across northern Ontario and northern Quebec. The idea was that it would provide the quickest route from the Prairies to the Maritimes by bypassing cities like Toronto, Ottawa, Montreal. The railway was never successful.

Most of the railway is now abandoned and owned by no one. Parts of it are owned by CNR, and we’re currently on one of those parts.

Many francophone towns in Northern Ontario, like Kapuskasing and Hearst, lie along the abandoned portion of this railway. Those towns are now serviced by Highway 17.

But there are no highways along this section. There’s barely a road in sight.

I’m currently sitting aboard Via Rail’s The Canadian, waiting for the train to depart from Toronto’s Union Station. I’m on my way to a company-wide Mozilla event in Whistler.

This isn’t the first time I’ve travelled on Via Rail, but this is the first time I’ll be travelling all the way to Vancouver.

I’m thinking I might live blog it.

Update: Vaughan

The train is currently stationary, and I figure we’re probably in Vaughan, north of Toronto. We passed a sign saying “Parc Downsview Park” about 10 minutes ago, so that must mean we’re on Metrolinx’s Barrie Line (formerly the CNR Newmarket Sub).

According to Wikipedia, this is the oldest trackage in Ontario, dating back to 1835. This track was laid by the Northern Railway of Canada, which was acquired by Grand Trunk Railway in 1888, which was in turn absorbed into CNR in 1923. CNR sold this trackage to Metrolinx in 2009.

I guess we’re waiting for a freight train to pass. (They said that would happen a bit.)

Oh, we’re moving again.

Update: Richmond Hill

We just passed Richmond Hill Centre. I recognize it from the giant Cineplex.

I remember when there was a proposal to build a casino here. But that didn’t happen.

Occasionally you hear people complain that Richmond Hill (and nearby Markham) is too dominated by Chinese people. That’s obviously bullshit. If this place was dominated by Chinese people, there would have been a casino here years ago.

Update: ???

I have no idea where I am. It is completely dark outside and I can’t see a thing.

The Unicode committee is very clear that U+2019 (RIGHT SINGLE QUOTATION MARK) should represent the English apostrophe.

Section 6.2 of the Unicode Standard 7.0.0 states:

U+2019 […] is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.”

This is very, very wrong. The character you should use to represent the English apostrophe is U+02BC (MODIFIER LETTER APOSTROPHE). I’m here to tell you why why.

Using U+2019 is inconsistent with the rest of the standard

Earlier in section 6.2, the standard explains the difference between punctuation marks and modifier letters:

Punctuation marks generally break words; modifier letters generally are considered part of a word.

Consider any English word with an apostrophe, e.g. “don’t”. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a modifier letter, not a punctuation mark, regardless of what colloquial English calls it.

According to the Unicode character database, U+2019 is a punctuation mark (General Category = Pf), while U+02BC is a modifier letter (General Category = Lm). Since English apostrophes are part of the words they’re in, they are modifier letters, and hence should be represented by U+02BC, not U+2019.

(It would be different if we were talking about French. In French, I think it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as two words. But that’s a conversation for another time. Right now I’m talking about English.)

Using U+2019 breaks regular expressions

When doing word matching on Unicode text, programmers might reasonably assume they can detect “words” with the regex /\w+/ (which, in a Unicode context, matches characters with General Category L*, M*, N*, or Pc). This won’t actually work with English words that contain apostrophes if the apostrophes are represented as U+2019, but it will work if the apostrophes are represented as U+02BC.

To be fair, this problem exists in ASCII right now, where /\w+/ fails to match \x27 (the ASCII apostrophe). This leads to common bugs where users named O’Brien get told they can’t enter their name on a form, or where blog titles get auto-formatted as “Don’T Stop The Music”. Programmers soon learn they need to include the ASCII apostrophe in their regex as an exception.

But we shouldn’t be perpetuating this problem. When a programmer is writing a regex that can match text in Chinese, Arabic, or any other human language supported by Unicode, they shouldn’t have to add an exception for English. Furthermore, if apostrophes are represented as U+2019, the programmer would have to add both \x27 and \u2019 to their regex as exceptions.

The solution is to represent apostrophes as U+02BC, and let programmers simply write /\w+/ to match words like O’Brien and don’t.

[Edit: If you’re about to tell me that word segmentation should be done using UAX #29, guess what: Using U+2019 for the apostrophe breaks UAX #29! Using U+02BC would fix it. See comments. –Ed]

Using U+2019 means that Word Processors can’t distinguish between apostrophes and actual quotation marks, leading to a heap of problems.

How many times have you seen things like ‘Tis the Season or Up and at ‘em with the apostrophe curled the wrong way, because someone’s word processor mistook the apostrophe for an opening single quotation mark? Or, have you ever cut-and-pasted a block of text to use as a quote, put quotation marks around it, and then had to manually change all the nested quotation marks from double to single? Or maybe you’ve received text from the UK to use in your American presentation, but first you had to change all the quotation marks, because the UK prefers single-quotes while the US prefers double-quotes.

These are all things your word processor should be able to handle automatically and properly, but it can’t due to the ambiguity of whether a U+2019 character represents a single quotation mark or an apostrophe. We wouldn’t have these problems if apostrophes were represented by U+02BC. Allow me to explain.

1) In my perfect world, Word processors would automatically ensure that quotes (and nested quotes) were properly formatted according to your locale’s conventions, whether it be US, UK, or one of the myriad crazy quote conventions found across Europe. All quote marks would be reformatted on the fly according to their position in the paragraph, e.g. changing ‘ to “ (and vice versa) or ’ to ” (and vice versa).

2) Right now, Microsoft word users use the " key to type a double-quote (either opening or closing) and the ' key to type either a single-quote (either opening or closing) or an apostrophe. In my perfect world, since the word processor could automatically convert between single- and double-quotes as needed, you’d only need one key (the " key) to type all quotation marks, and the ' key would be reserved exclusively for typing apostrophes. Therefore, the word processor would know that '-t-i-l means ʼtil, not ‘til.

But because U+2019 can represent either an apostrophe or single quote, it’s hard for Word Processors to do (1), which means they also can’t do (2).

So whenever you see ‘Til we meet again with the apostrophe curled the wrong way, remember that’s because of the Unicode committee telling you to use U+2019 for English apostrophes.

Common bloody sense

For godsake, apostrophes are not closing quotation marks!

U+2019 (RIGHT SINGLE QUOTATION MARK) is classed as a closing punctuation mark. Its general category is Pf, which is defined in section 4.5 of the standard as meaning:

When you use U+2019 to represent an apostrophe, it’s behaving as neither a Ps or Pe. (Or, if it is, it’s an unbalanced one. As of Unicode 6.3, the Unicode bidi algorithm attempts to detect bracket pairs for bidi processing. They would be unable to do the same for quotation marks, due to all these unbalanced “quotation marks”.)

Compare that to U+02BC (MODIFIER LETTER APOSTROPHE) which has “apostrophe” right in its name. LOOK AT IT. RIGHT THERE, IT SAYS APOSTROPHE.

Earlier this year I had the pleasure of implementing for Firefox OS an input method for Vietnamese (a language I have some familiarity with). After being dissatisfied with the Vietnamese input methods on other smartphones, I was eager to do something better.

I believe Firefox OS is now the easiest smartphone on the market for out-of-the-box typing of Vietnamese.

The Challenge of Vietnamese

Vietnamese uses the Latin alphabet, much like English, but it has an additional 7 letters with diacritics (Ă, Â, Đ, Ê, Ô, Ơ, Ư). In addition, each word can carry one of five tone marks. The combination of diacritics and tone marks means that the character set required for Vietnamese gets quite large. For example, there are 18 different Os (O, Ô, Ơ, Ò, Ồ, Ờ, Ỏ, Ổ, Ở, Õ, Ỗ, Ỡ, Ó, Ố, Ớ, Ọ, Ộ, Ợ). The letters F, J, W, and Z are unused. The language is (orthographically, at least) monosyllabic, so each syllable is written as a separate word.

This makes entering Vietnamese a little more difficult than most other Latin-based languages. Whereas languages like French benefit from dictionary lookup, where the user can type A-R-R-E-T-E and the system can from prompt for the options ARRÊTE or ARRÊTÉ, that is much less useful for Vietnamese, where the letters D-O can correspond to one of 25 different Vietnamese words (do, dò, dó, dô, dỗ, dơ, dở, dỡ, dợ, đo, đò, đỏ, đó, đọ, đô, đồ, đổ, đỗ, đố, độ, đơ, đờ, đỡ, đớ, or đợ).

Other smartphone platforms have not dealt with this situation well. If you’ve tried to enter Vietnamese text on an iPhone, you’ll know how difficult it is. The user has two options. One is to use the Telex input method, which involves memorizing an arbitrary mapping of letters to tone marks. (It was originally designed as an encoding for sending Vietnamese messages over the Telex telegraph network.) It is user-unfriendly in the extreme, and not discoverable. The other option is to hold down a letter key to see variants with diacritics and tone marks. For example, you can hold down A for a second and then scroll through the 18 different As that appear. You do that every time you need to type a vowel, which is painfully slow.

Fortunately, this is not an intractable problem. In fact, it’s an opportunity to do better. (I can only assume that the sorry state of Vietnamese input on the iPhone speaks to a lack of concern about Vietnamese inside Apple’s hallowed walls, which is unfortunate because it’s not like there’s a shortage of Vietnamese people in San José.)

Crafting a Solution

To some degree, this was already a solved problem. Back in the days of typewriters, there was a Vietnamese layout called AĐERTY. It was based on the French AZERTY, but it moved the F, J, W, and Z keys to the periphery and added keys for Ă, Đ, Ơ, and Ư. It also had five dead keys. The dead keys contained:

a circumflex diacritic for typing the remaining letters (Â, Ê, and Ô);

the five tone marks; and

four glyphs each representing the kerned combination of the circumflex diacritic with a tone mark, needed where the two marks would otherwise overlap

My plan was to make a smartphone version of this typewriter. Already it would be an improvement over the iPhone. But since this is the computer age, there were more improvements I could make.

Firstly, I omitted F, J, W, and Z completely. If the user needs to type them — for a foreign word, perhaps — they can switch layouts to French. (Gaia will automatically switch to a different keyboard if you need to type a web address.) And obviously I could omit the glyphs that represent kerned pairs of diacritic & tone marks, since kerning is no longer a mechanical process.

The biggest change I made is that, rather than having keys for the five tone marks, words with tones appear as candidates after typing the letters. This has numerous benefits. It eliminates five weird-looking keys from the keyboard. It eliminates confusion about when to type the tone mark. (Tone marks are visually positioned in the middle of the word, but when writing Vietnamese by hand, tone marks are usually added last after writing the rest of the word.) It also saves a keystroke too, since we can automatically insert a space after the user selects the candidate. (For a word without a tone mark, the user can just hit the space bar. Think of the space bar as meaning “no tone”.)

This left just 26 letter keys plus one key for the circumflex diacritic. Firefox OS’s existing AZERTY layout had 26 letter keys plus one key for the apostrophe, so I put the circumflex where the apostrophe was. (The apostrophe is unused in Vietnamese.)

In order to generate the tone candidates, I had to detect when the user had typed a valid Vietnamese syllable, because I didn’t want to display bizarre-looking nonsense as a candidate. Vietnamese has rules for what constitutes a valid syllable, based on phonotactics. And although the spelling isn’t purely phonetic (in particular, it inherits some peculiarities from Portuguese), it follows strict rules. This was the hardest part of writing the input method. I had to do some research about Vietnamese phonotactics and orthography. A good chunk of my code is dedicated to encoding these rules.

Knowing about the limited set of valid Vietnamese syllables, I was able to add some convenience to the input method. For example, if the user types V-I-E, a circumflex automatically appears on E because VIÊ is a valid sequence of letters in Vietnamese while VIE is not. If the user types T to complete the partial word VIÊT, only two tone candidates appear (VIẾT and VIỆT), because the other three tone marks can’t appear on a word ending with T.