Recode Windows-1252 characters as UTF-8

You are dealing with "\u008a" or "\u009a" strings in your database. They get rendered correctly as "Š" or "š" in a browser, but you cannot display them in a command line console. Ruby says that they are "valid UTF-8" encoding.

Luckily, characters from 0080 to 009F, spanning the whole windows-1252 encoding, are non-printable in Unicode, so it's perfectly safe to assume those are just wrongly interpreted windows-1252 characters, to be able to match and recode them.

Here we are stripping the first byte of the (wrong) encoding utf-8 encoding (0xc2), creating a new single-character string with the second byte and telling Ruby it's windows-1252, and letting Ruby itself do the encoding to utf-8.

1 Response

Add your response

It helps me a lot to read the german postbank csv. The first conversion gives the umlauts, the second (mothod above gives the Euro sign. What I do not know is why I have it to convert twice : Here the code