HTML Forum

I'm having an odd problem with some of my pages concerning apostrophes. The apostrophes in words such as I've, wouldn't, and other contractions show up when opened with the browser (firefox) but not when opened with the editor (notepad++) In the editor they're missing and when I put them in with the editor I get two apostrophes when viewed with the browser. And when viewed with the browser it looks like the apostrophes are two different fonts. Something like this. I''ve wasn''t with the second apostrophe looking like a different font and not visible in in the editor (notepad++).

The best way to avoid this problem is to make sure that your browser isn't misreading your page's charset, and that what the page says its character set is, really is what it is encoded as. In notepad++ you can convert a page that is encoded in iso-8859-1 (for example) to utf-8. The problem is that some character sets use "curly quotes" which look like a superscript comma, a dot with a tail. Down in the lower right corner Notepad++ shows the page encoding (If I recall correctly). I haven't used Notepad++ for a few years now, but I used to appreciate its ability to convert character sets of a page's encoding. If any parts of your pages may have been copied from Windows documents they are most likely not encoded in either utf-8 or iso-8859-1; Windows default encoding has characters that look wonky on most browsers. Be careful of using what Windows calls utf-8 also, because in their Notepad program it can look fine, but be using utf-8 with BOM (the byte order mark)that you won't see until you view the page in a browser.

Yes, that definitely sounds like something that was originally created in Windows-Latin-1 being viewed in UTF-8. A number of common characters, including all curly quotes (single and double), the oe ligature and the long dash, are in a range of Windows-Latin-1 that UTF-8 doesn't use. Canonically you will get the Unicode Replacement Character which looks like a black diamond with question mark in the middle, but some applications will simply not display the character at all.

Don't be misled by the term "charset" though. All browsers can display all characters; the only difference is what they look like in the page source. The proper word is "file encoding". (There is an arcane historical reason for the term "charset" but I have long since forgotten it :()

Check the preferences for all text editors. Wherever there is an option for saving UTF-8 documents either with or without BOM, say without.

Finally: If you use a text editor to work on html, make sure options such as "smart quotes" are OFF. You don't want any fancy characters sneaking in unless you deliberately put them there.

The problem is that some character sets use "curly quotes" which look like a superscript comma, a dot with a tail.

Careful. Curly quotes are in addition to, not instead of, the "typewriter" characters that live down in the ASCII range. So you can use curly quotes for displayed text without breaking your internal HTML, which requires straight quotes (single or double).

If you are a coward you can use entities like &ldquo; and &rsquo; or numeric equivalent ;) But it makes your code unreadable-- and adds several bytes to each character.

The apostrophe and the single closing quote are the same character. Most of the good stuff lives in the 2000-206F (hex) range.

Once you switch to a end-to-end utf-8 system it comes all much easier.

end-to-end means an editor in utf-8, pages in utf-8, a database in utf-8, ... everything.

Just say NO to the BOM (Byte Order Mark), it's "hidden" at the beginning of the file and will be a PITA as they can be output in e.g. php script before you get to set headers ...

For the rest, as Lucy24 also said, take care with word and co as they like to convert too many things.

In the end once you go UTF-8 all the way you need to worry much less about html entities (in fact polyglot (x)html5 only allows 5 of the named ones anymore: &amp; &lt; &gt; &apos; &quot; all the rest: you just type them, or use the numeric reference if you have to) Recently even only the first 3 ones are all I'm using anymore.

I recently had to deal with some crap on a IIS installation (normally I don't touch that with a yardstick, but I kinda had to for other reasons): it was set in iso-latin-1 all the way, but every so often it threw in a character in UTF-8 nonetheless. Seems this is a "known" issue in the Microsoft world - Again the solution -while not easy in that instance- is to go UTF-8 all the way and forget about iso-latin-1 (or worse).

Yes it does. Although you can't see it, you copied all the windows formatting overhead. Way to get around that is to use an interim text editor like Windows Notepad. Paste to Notepad from the Windows app, then copy and paste to Notepad++ and convert the page format to utf-8 in Notepad++. Another useful free editor for windows users is PSPad. It helps you convert old html to xhtml and can help with character encoding also. It does far more complicated things too, but for the problem you're having it can help.

I've obviously not used notepad++ (as I don't use windows). But the trick is to paste form word using the paste as text option, not the regular paste that tries to keep all formatting (and if your editor is rtf or html/css aware it'll dump in a truckload of crap. - sometimes it's even hidden from plain sight).

It's a PITA to remember to paste from word using the paste as ... then select plain text method, but in some targets it's critical that you do! I consider it the price to pay for using word.

Thanks a bunch not, lucy, and swa. I used Notepad++ to convert the charset on my pages to UTF-8 (formerly iso-8859-1 and whatever MS Word put in) and after a small amount of rekeying the apostrophes and quotation marks are where they should be now. Hats off and applause to you.