2012-08-30

I was upgrading my Libre Office installation to version 3.6.1 when I've noticed the question marks in the installer UI:

How can this be? Ș and ț (s and t comma below) characters are present in the Microsoft Sans Serif and Tahoma fonts since Windows 2000.

The only explanation is: ANSI installation program. No way! You would think that Windows Installer would be Unicode, but it's not! Michael Kaplan wrote about this weirdness seven years ago: MSI Databases and Unicode?

WiX also states on their help page: „Top-level elements like Product, Module, Patch, and PatchCreation support a Codepage attribute. You can set this to a valid Windows code page by integer like 1252, or by web name like Windows-1252. UTF-7 and UTF-8 are not officially supported because of user interface issues. Unicode is not supported.”

Programs like 7-zip use MSI for x64 target because NSIS installer did not have support for x64. NSIS officially doesn't support Unicode and x64, but there are forks which do (Unicode fork, x64 fork).

I've filed a bug (#54232) for LibreOffice. I guess they will use the old ş and ţ (s and t cedilla) characters to fix this problem.

Windows 95 End of Life was 1st of January 2003, Windows 98, Windows 98 Second Edition, and Windows ME End of Life was 1st of April 2007.

They haven't fixed this problem even though is has been five years since they do not support any ANSI operating system.

2012-08-05

One important "screen" is the TV screen. In Romania the movies are not dubbed, they are subtitled. This means that all dialogs are presented as text for the viewer to read and understand the movie.

I remember I wanted to learn how to read because I wanted to read the movie subtitles.

Television

By using www.cool-itv.net, which uses P2P SoapCast tehnology to distribute cable TV stations I was able to analyze which Romanian diacritics were used in movie subtitles.

Below you will see some screenshots of some TV stations:

Almost all of the TV stations used the old diacritics - S and T cedilla (şŞţŢ) - with the exception of the last one which uses the correct diacritics - S and T comma below (șȘțȚ).

The state-owned public TV broadcasters (TVR1, TVR2 and so on) did their homework and their software can handle Unicode characters and they are using the correct Romanian diacritics. Chapeau!

At least some of them use the same diacritic: Discovery, Animal Planet, ProTV, Pro Cinema, and they did not mix s cedilla with t comma below like HBO, Antena1, and Kanal D.

One interesting case was TCM which used A Caron (ǎ) instead of A Breve (ă). Also Prima used A Tilde (ã) in their promotional clips.

The usage of the old diacritics is due to the fact that the specialized TV software used is some old software written before Microsoft started promoting Unicode.

Old software and subtitle standard from 1991 is responsible to the major usage of incorrect Romanian diacritics. The 1991 standard is the EBU (European Broadcasting Union) TECH. 3264-E which doesn't support Unicode characters.

Hopefully all this will change with the arrival of the new EBU-TT Subtitling Specification which was announced on 31st of July 2012. In a couple of years all TV stations will be using the correct Romanian diacritics.

Nobody uses Unicode to encode the subtitles, which would not require the user to configure their media player of choice to Windows-1250 / Central European / ISO-8859-2 as default code page for subtitles.

I've created a small Windows tool (133KBytes) which automatically converts subtitles from old diacritics to correct diacritics. The tool can be downloaded from here.

Below you have a screen shot of the tool:

I hope Unicode subtitles will be more popular in the future. There is no need to stick to ANSI code pages!