Synopsis

Description

Have you ever been processing a Web form submit for feed, assuming that the incoming text was encoded as specified in the Content-Type header, or in the XML declaration, only to end up with a bunch of junk because someone pasted in content from Microsoft Word? Well, this is because Microsoft uses a superset of the Latin-1 encoding called "Windows Western" or "CP1252". If the specified encoding is Latin-1, mostly things will come out right, but a few things--like curly quotes, m-dashes, ellipses, and the like--may not. The differences are well-known; you see a nice chart at documenting the differences on Wikipedia.

Of course, that won't really help you. What will help you is to quit using Latin-1 and switch to UTF-8. Then you can just convert from CP1252 to UTF-8 without losing a thing, just like this:

use Encode;
$text = decode 'cp1252', $text, 1;

But I know that there are those of you out there stuck with Latin-1 and who don't want any junk characters from Word users. That's where this module comes in. Its zap_cp1252 function will zap those CP1252 gremlins for you, turning them into their appropriate ASCII approximations.

Another case that can occasionally come up is when you're reading reading in text that claims to be UTF-8, but it still ends up with some CP1252 gremlins mixed in with properly encoded characters. I've seen examples of just this sort of thing when processing GMail messages and attempting to insert them into a UTF-8 database, as well as in some feeds processed by, say Yahoo! Pipes. Doesn't work so well. For such cases, there's fix_cp1252, which converts those CP1252 gremlins into their UTF-8 equivalents.

Usage

This module exports two subroutines: zap_cp1252() and fix_cp1252(), each of which accept a single argument:

zap_cp1252 $text;
fix_cp1252 $text;

When called in a void context, as in these examples, zap_cp1252() and fix_cp1252() subroutine perform in place conversions of any CP1252 gremlins into their appropriate ASCII approximations or UTF-8 equivalents, respectively. Note that because the conversion happens in place, the data to be converted cannot be a string constant; it must be a scalar variable.

When called in a scalar or list context, on the other hand, a copy will be modifed and returned. The original string will be unchanged:

In this case, even constant values can be processed. Either way, undefs will be ignored.

In Perl 5.8.8 and higher, the conversion will work even when the string is decoded to Perl's internal form (usually via decode 'ISO-8859-1', $text) or the string is encoded (and thus simply processed by Perl as a series of bytes). The conversion will even work on a string that has not been decoded but has had its utf8 flag flipped anyway (usually by an injudicious use of Encode::_utf8_on(). This is to enable the highest possible likelihood of removing those CP1252 gremlins no matter what kind of processing has already been executed on the string.

In Perl 5.10 and higher, the functions may optionally be called with no arguments, in which case $_ will be converted, instead:

Conversion Table

Here's how the characters are converted to ASCII and UTF-8. The ASCII conversions are not perfect, but they should be good enough for general cleanup. If you want perfect, switch to UTF-8 and be done with it!

Changing the Tables

Don't like these conversions? You can modify them to your heart's content by accessing this module's internal conversion tables. For example, if you wanted zap_cp1252() to use an uppercase "E" for the euro sign, just do this:

local $Encode::ZapCP1252::ascii_for{"\x80"} = 'E';

Or if, for some bizarre reason, you wanted the UTF-8 equivalent for a bullet converted by fix_cp1252() to really be an asterisk (why would you? Just use zap_cp1252 for that!), you can do this:

local $Encode::ZapCP1252::utf8_for{"\x95"} = '*';

Just remember, without local this would be a global change. In that case, be careful if your code zaps CP1252 elsewhere. Of course, it shouldn't really be doing that. These functions are just for cleaning up messes in one spot in your code, not for making a fundamental part of your text handling. For that, use Encode.

As a valued partner and proud supporter of MetaCPAN, StickerYou is
happy to offer a 10% discount on all Custom Stickers,
Business Labels, Roll Labels,
Vinyl Lettering or Custom Decals. StickerYou.com
is your one-stop shop to make your business stick.
Use code METACPAN10 at checkout to apply your discount.