Foxpond Hollow has asked for the
wisdom of the Perl Monks concerning the following question:

Hi monks,

I come seeking a means of preserving diacritical marks in a string. The situation is that I am using LWP to access a website and copy certain parts of it into various strings. It's all bibliographic information for various books. The titles sometimes contain diacritical marks, ranging from your run of the mull umlaut and accent grave to your more bizarre Russian characters that I don't know the names of.

I'm not looking for a way of stripping the diacritics out. In fact, that's the problem. When I copy the text into the string, it copies as basic ASCII. I need it preserved as-is, because I'm turning it right around and searching a database with it, and that database expects it to still have the diacritics, and finds no results if it doesn't.

I'm not too familiar with encoding schemes, so I'm not really sure what I should be looking for in terms of modules and approaches. Any help would be appreciated. Thanks.

Upon closer inspection, I realized it is not actually converting the characters to basic ASCII. It is just removing them entirely. So "Das europäische Volksmärchen" becomes "Das europische Volksmrchen", which is why the searches weren't working. It turns out the database doesn't actually care about the accent marks, but I do kinda still need the letters.

The weird thing is that according to the source for the page, it's UTF-8 and there is no encoding on the characters themselves (i.e., no &xxxx codes), but I thought UTF-8 could be converted back to basic ASCII as needed? Is this something I need to actually implement in the code to make happen?

So $MARC_page is the actual link (provided above) to the page I need, LWP fetches it and after a couple steps passes all of the content into the $HTML scalar. The code that fetches the title from $HTML is the following:

if ($HTML =~ m{
245\d{0,2} # MARC code 245 followed by 0-2 indicators
.*? # followed by anything, ungreedy
\|a\s # followed by a pipe and the subfield
(.*?) # followed by the title,
# which can be anything, ungreedy
(?:\||<) # followed by a pipe and the next subfield
# or, if no subfield, an opening HTML tag brac+ket
}xmgs) {
$title = MARC::Field->new('245','','', 'a' => "$1");
}
else {
$title = MARC::Field->new('245','','', 'a' => "field does +not exist");
}

I'm sure that didn't format nearly as well as I'd've liked, but hopefully it's still readable.

I'm using Perl 5.8.9. Sorry for not providing more info earlier, like I said, I wasn't even sure what info would be needed. Hopefully this will be more helpful. Thanks for any assistance.

UPDATE 2: Okay so the link I gave above doesn't work because that record has actually been deleted as part of routine maintenance. It's irrelevant to this, so don't worry about that. Here's a link to the same info that should still work:

I don't know if it alerts you when a post you've commented on is updated, so in case it doesn't, I've updated the post with the info you asked for. Note that the second update has the correct URL and you should ignore the URL in the first update.

I can't see the obvious source of the problem. I think you need to dump out the result of the request before any processing and be sure exactly where the special characters are being lost. i.e. is it coming correctly out of LWP, is it the regex, could it be the MARC:: module, etc.

As graff said it shouldn't be losing these characters, but there are a number of places where things can go wrong.

It's all a bit complicated and I can't think of a good guide to it at the moment. On the other hand, I've never heard of Perl completely stripping special characters because of an encoding problem - normally, you would get a multi-byte utf-8 character treated as 2 or 3 characters if the encoding is not set correctly. So I suspect an error in some code somewhere - could it be that something is validating input and stripping out characters it doesn't think are "safe"...?

Sorry I can't be of more help. Try to narrow it down to where they disappear and it will be solved eventually.

Some relevant sample data (or the web site url, if that's appropriate) would really help here, along with an actual code snippet that shows us what you are doing with the data.

It matters what sort of character encoding the web site is using (some sort of latin-1? utf-8? something else?), and it also matters what your script is doing when opening file handles for input or output, making database connections, and using LWP methods. Oh, and it also matters what character encoding is being used in the database. (Is it the same or different compared to what is being used at the web site?)

Lacking all those details, I don't think there's much we can say about your problem -- except that it sounds a bit implausible: if the web site content includes accented characters, I wouldn't expect a quiet conversion to "basic ASCII", unless your script is explicitly applying this sort of behavior somehow. I might expect warnings or errors or some sort of character-entity-reference stuff, if the data is ending up different from its original form.

Thanks for the update. I had no luck with the urls you posted, but I was able to go to the web page, put in a request that yielded accented characters in the output, and use the resulting url to push that request through LWP.

Since I got different content from what you were getting, your regex didn't really apply for my data (and I guess your regex isn't related to the problem anyway, since it has nothing to do with accented letters). Anyway, here's some code that demonstrates how the non-ascii content works:

Having tried it myself, I learned that non-spacing diacritic marks (presented as separate characters, rather than being an intrinsic part of a letter -- e.g. the second character in "U+0061 U+02CA" for á, rather than U+00E1) all fall into the category of things that match "\w".

You might want to check out this little command-line tool I posted a while back -- it can really help with getting a handle on what kinds of unicode data you are really dealing with: tlu -- TransLiterate Unicode; check my home page for a few other unicode tools.

(UPDATE: Forgot to mention -- I also noticed that the source data from the web site tended to use both the single-character "accented_letter" and the two-character "letter accent_mark" for the same thing -- that is, their unicode usage is inconsistent, and somewhat non-standard.)

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other