Have you ever been processing a Web form submit for feed, assuming that the incoming text was encoded as specified in the Content-Type header, or in the XML declaration, only to end up with a bunch of junk because someone pasted in content from Microsoft Word? Well, this is because Microsoft uses a superset of the Latin-1 encoding called "Windows Western" or "CP1252". If the specified encoding is Latin-1, mostly things will come out right, but a few things--like curly quotes, m-dashes, ellipses, and the like--may not. The differences are well-known; you see a nice chart at documenting the differences on Wikipedia.
Of course, that won't really help you. What will help you is to quit using Latin-1 and switch to UTF-8. Then you can just convert from CP1252 to UTF-8 without losing a thing, just like this:
$text = decode 'cp1252', $text, 1;
But we know that there are those of you out there stuck with Latin-1 and who don't want any junk characters from Word users. That's where this module comes in. Its zap_cp1252 function will zap those CP1252 gremlins for you, turning them into their appropriate ASCII approximations.
Another case that can occasionally come up is when you're reading reading in text that claims to be UTF-8, but it still ends up with some CP1252 gremlins mixed in with properly encoded characters. We've seen examples of just this sort of thing when processing GMail messages and attempting to insert them into a UTF-8 database, as well as in some feeds processed by, say Yahoo! Pipes. Doesn't work so well. For such cases, there's fix_cp1252, which converts those CP1252 gremlins into their UTF-8 equivalents.
# Zap or fix in-place.
# Zap or fix copy.
my $clean_latin1 = zap_cp1252 $latin1_text;
my $fixed_utf8 = fix_cp1252 $utf8_text;