Search This Blog

Unicode -- Uphill both ways (Ruby Programming pt. 7)

I found this cool article on Unicode (it gets UTF-16 wrong but that's ok). However I'm running into a large wall dealing with Unicode in my program. So I'll put it out there so a solution presents itself.

So far my program checks each line of the file to see if it's ASCII only text. If so it reverses it with Ruby's built-in reverse method.

If not what I want to do is to have it read each hex pair (or four-some) decide if it is below U+007F (inclusive) to treat it as plain ASCII and pass the character as one element to an array, if it's between U+0080 and U+FFFF then to take a two byte chunk and pass it as one element to an array. And finally if it is between U+010000 and U+10FFFF then to take a three byte chunk and pass it as one element to an array. Then to read the elements of the array First one In Last one Out (FILO), remove the end of line (/n) marker and put the elements into another array. Join that array add an end of line element and write it to the file.

So the first thing I need to do is find a way of reading the hexadecimal values of the characters. So after a lot of looking I found a hex editor plugin for Notepad++ and though it doesn't do exactly what I want I figure something out. The last character or the U+007F is 7F in the hex value of the file. Apparently Notepad++ hides the 00 of endian-ness. So that's the one I want to move as a one element to another array. And at least for now I can assume that every thing above 80 is a two-byte element, till I figure a way of reading the three-byte ones. It won't be perfect but if it works it will be a step.
Now to try it out.

Get link

Facebook

Twitter

Pinterest

Google+

Email

Other Apps

Labels

Comments

Post a Comment

Popular posts from this blog

Typing accents on a PC is a complicated Alt + three numbered code affair. One feels like a sorcerer casting a spell. "I summon thee accented é! I press the weird magical key Alt, and with 0191 get the flipped question mark!" For a bilingual person this meant that writing on the computer was a start-and-stop process. With Mac's it a whole lot easier, just Alt + e and the letter you wanted for accents and alt + ? for the question mark. No need to leave the keyboard for the number pad and no need to remember arcane number combinations or have a paper cheat sheet next to the keyboard, as I've seen in virtually every secretaries computer in Puerto Rico.

Linux has a interesting approach to foreign language characters: using a compose key. You hit this key which I typically map to Caps Lock and ' and the letter you want and voilá you get the accent. Kinda makes sense: single quotation mark is an accent, double gets you the ümalaut, works pretty well. Except for the ñ, wh…

There is interestingly enough a big difference between what's considered good writing in Spanish and English. V.S. Naipul winner of the 2001 Nobel prize for literature publish an article on writing. In it he emphasizes the use of short clear sentences and encourages the lack of adjectives and adverbs. Essentially he pushes the writer to abandon florid language and master spartan communication. This is a desired feature of English prose, where short clipped sentences are the norm and seamlessly flow into a paragraph. In English prose the paragraph is the unit the writer cares about the most.

This is not the case in Spanish where whole short stories (I'm thinking this was Gabriel Garcia Marquez but maybe it was Cortázar) are written in one sentence. Something so difficult to do in English that the expert translator could best manage to encapsulate the tale in two sentences. The florid language is what is considered good writing in Spanish but unfortunately this has lead to what …

I really like Github's Atom Text Editor. I really like that it's multi-platform allowing me to master one set of skills that is transferable to all platforms and all machines.

On thing that just burns me of the default set-up in Atom is the Autocomplete feature that seems to change my words as a type them. Because Ruby uses the end of line as a terminus for a statement you usually finish a word with pressing the return button and you get really annoying changes to your finished typed word a la MS Word. I find myself yelling "No that's not what I wrote!" at the screen in busy coffee shops.

I disabled autocomplete for a while but it is a very useful function. Then I found out they changed the package that gave the autocomplete to a new one called "Autocomplete Plus" that gives you more options. All that I needed to change to make autocomplete sane again: