The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Problems with javascript & utf-8 encoding

Sorry in advance for the length of this... you can skip to "the problem" if you want to avoid extraneous backstory.

The background...

I'm a designer with some moderate knowledge of programming working with my client's ASP/javascript programmer. The client's site was created back in 1999 or so, when he used FrontPage 98 to develop all the pages. When I was brought on board I was also forced to use FP, reluctantly, because that was basically the only way to edit / upload his pages.

Fast forward to a few months ago, when I got a new computer with Vista. This meant I had to switch to ExpressionWeb, since everything I'd read about FP & Vista compatibility was not very good. Anyway, xWeb was changing everything to UTF-8 by default, despite there being inconsistent charset definitions throughout the site; most were missing, some were ISO 8859-1, others were something else. This inconsistently was revealed once the pages went live thanks to a bunch of odd characters, primarily related to curly quotes, trademarks, and other symbols that had been transferred over when the client copied/pasted stuff from MS Word.

The problem...

I suggested that we go through page by page and switch everything to UTF-8. I argued that we've been lucky to have gotten away with crappy haphazard coding as long as we have; we need to standardize already. Fortunately a consensus was reached and the project began.

I was in charge of converting the static pages (i.e. not our shopping cart or other script-laden pages), which I did by opening the pages in xWeb, adding the proper charset declaration, and resaving/encoding as UTF-8. The pages I did this with ended up working fine, except for one or two places where old MS Word code was still used. Once the extra stuff was removed it worked fine.

Meanwhile the ASP programmer understandably preferred to do the conversion of the vital ASP- and Javascript-laden pages herself. She uses Microsoft Script Editor rather than FP, specifically because MSE doesn't add extra bloated code.

But when she tested these pages last night, they were broken. There was a new line of code at the top (which I believe was something like %codePage="65001") and boxes (square characters) in the middle of her javascript where there should have been blank spaces. She's at a loss to understand what happened.

Now again, I'm no programmer. I completely cede to her knowledge of ASP and javascript. Nevertheless, it struck me that these errors implied that the pages weren't correctly saved as UTF-8. When we were trying to figure out what caused the problem, I asked if, in addition to adding the meta declaration tag, she actually encoded the files as utf-8.

She said she opened up the pages, added the meta charset definition, and closed them again. This concerned me, since I didn't hear anything about 'encoding' in there. So I asked if they were saved as UTF-8 encoded pages, and she said "MS Script Editor doesn't do Save As, it just saves the file."

I thought the problem seems to be that simply adding the charset declaration isn't enough, the pages have to be specifically encoded to match. You usually have to tell your editor how you want pages to be encoded (i.e. what language). The programmer seemed to get irritated by my suggestion -- of course, she was frustrated, understandably -- and said that MS Script Editor is more advanced than FP and that why she uses it, and basically implied that it knows what to do.

Since I'm not a programmer and I'm not nearly as knowledgable about ASP or Javascript as she is (and I have no experience whatsoever with Script Editor), I really couldn't argue with that, or offer any other suggestions. Also I think she resents me for the whole encoding mess anyway. Maybe she's right.

So my question to you gurus is: anyone have ideas about what might have gone wrong? Does anyone have experience in changing files w/javascript & ASP coding to unicode? Is MSE able to encode files in utf-8?

In my defense, I didn't "standardize" the site on my own decision. (And by the way, all these changes were made to a test/backup site, not the live site. So nothing is permanently broken or anything.)

If the programmer had said flatly that "no, the asp/javascript will be screwed up if we do that," we'd of course have stuck with the old files and I'd have to either resign or use an old computer or something.

But she never said that the asp/javascript code would be screwed up, and indeed the pages on the test site that I converted that do have javascript & ASP are working without a problem. So I don't think it was "standardizing" the site that caused the issue -- it seems to be something to do with Microsoft Script Editor or the method the programmer used to make the change that caused the problem.

First, regardless of any prejudice (toward or opposing a particular editor) ASP, Javascript, and HTML all should be saved as simple text. The "square boxes" you described are a flag that the MS editor saved some characters beyond the limited 128 ASCII set. {isn't that encoding?}
Can't you open those files in notepad and resave the changes?

That's probably what she'll be doing. (I ain't going near those files, myself!) Actually she'll probably just use the backup files. I believe that even Notepad gives you the choice to save things in ANSI or other languages.

I have no idea what those boxes represent; there shouldn't be ANYthing except that blank space. They seem to have been added instead of the usual indenting one finds in scripts (be they php, asp or javascript).

Turns out that I was right -- the programmer did just add the meta tag without saving the files using utf-8 encoding. The important thing is that she was able to open up the files, save them as ASCII and that cleared up the formatting issues with the javascript.

For now we've decided to go back to Western European ISO. Which means I go back in and save/re-encode all of the site's pages. Lesson learned: sometimes you have to go backward in order to go forward!

The important thing is that she was able to open up the files, save them as ASCII and that cleared up the formatting issues with the javascript.

There is no such thing as plain ASCII. No such thing. It's a myth. What she probably saved the files as, is CP-1252 (Which coincidentally is almost the same as ISO-8859-1). Judging from your description so far, you're probably better off, using ISO-8859-1 for charset, since it tends to be the default in most systems (No guarantees though).

Oh, and just to save you the grief later on; meta-tags are only relevant, when the page isn't served from a web server, which sends a HTTP-header. In this case, the header takes precedence.

That was slightly separate -- we had that, too, at least in some pages. Not all of them. The BOM was mostly at the top of the page and just caused a few extra characters at the top; these blank/tab characters were breaking things altogether. It was like finding a needle in a haystack!

Originally Posted by kyberfabrikken

There is no such thing as plain ASCII. No such thing. It's a myth. What she probably saved the files as, is CP-1252 (Which coincidentally is almost the same as ISO-8859-1). Judging from your description so far, you're probably better off, using ISO-8859-1 for charset, since it tends to be the default in most systems (No guarantees though).

Truthfully I *think* she might be talking about ANSI instead. Notepad offers that as an option instead of unicode, I know that. Not sure. Honestly, I dunno what the story is, there's kind of a political situation here and the less I question her at this point, the better.

Oh, and just to save you the grief later on; meta-tags are only relevant, when the page isn't served from a web server, which sends a HTTP-header. In this case, the header takes precedence.

Unless I'm mistaken (and God knows, I could be!), isn't that only as long as the server is sending a charset along with the HTTP header? Our server isn't; I've checked, believe me! All it's sending is:

Content-Type:·text/html

Therefore, the meta tag is important, at least in our situation. Heck, what started us off on this merry adventure in the first place was my discovery that in the pages without any charset declaration (a majority), things were getting royally screwed up. Sigh. We were so innocent back then!

Unless I'm mistaken (and God knows, I could be!), isn't that only as long as the server is sending a charset along with the HTTP header? Our server isn't; I've checked, believe me! All it's sending is:

Yes, but most servers would send a charset as part of the header. You can configure it not to (As you have in your case), but that's an odd choice. To prevent any ambiguity, I'd always send the proper charset as part of the content-type header.