This text is based on the assumption, that current locale uses UTF-8 encoding. Behavior might differ for other encodings. I use Ruby 2.0 for evaluation.

I was recently doing some text parsing in my native language (which is Polish). Poland uses latin / roman alphabet with a few additional accent letters, such as “ąęśćłóźżń”. Some unexpected issues occured, when I tried to use them with regular expressions in ruby.

Accent characters

Let’s first take a look how Polish characters are represented in bytes:

by now we can see, that “test” was matched correctly, while “\w+” applied on “teść” matched only first two characters. The solution to this problem would be to use POSIX regular expressions, so that the regexp would look like:

(note that in second example, there is not-breaking space between words. In Windows you can insert one by pressing alt+0160 (numbers on numeric keyboard), in VIM by pressing <C-k> <space> <space>).
So to make it work with non breaking space you might either do convert text to replace non-breaking spaces to regular spaces, or if you want to preserve the information: