I got a big sitemap created dynamically with PHP, it has a sitemap index with some 230 separate sitemaps, and each individual sitemap has between 3.000 and 15.000 URLs.

In most of those 230 sitemaps, everything is ok, but in some of them some URLs contain special characters and Google returns an error, does not accept such sitemap. The example of a normal, accepted URL:

My questions is - HOW do I code those special characters (and other similar ones) so sitemap still checks out fine. Which PHP coding function do I use if that's a solution? Is the only solution to use str_replace and replace those characters with normal ones? It wouldn't be an issue, the URL works no matter what you write in the first part of it as that part is for SEO only, but this would be time-consuming. I'd prefer to be able to write those special characters in a way which doesn't wreck the sitemap for Google.

Everything else regarding my sitemaps is fine, they're coded in UTF-8 or at least they should be with this line:

1 Answer
1

Are the %C5 and %F8 sequences meant to represent the characters U+00C5 (Å) and U+00F8 (ø)? If so, you need to use their UTF-8 encodings, not their raw Unicode codepoint numbers. 'Å' should be %C3%85, and 'ø' should be %C3%B8.

Doing this in PHP is complicated by the fact that PHP strings are really byte strings, not Unicode character strings. They can't store abstract Unicode characters; they can only store the encoded representation of those characters, in a particular encoding such as UTF-8 or UTF-16. You can use the mbstring extension to work with encoded Unicode strings, but doing this correctly will probably mean using the mbstring functions for all handling of Unicode text throughout your application.

You should be looking to fix this encoding problem at the source: how did your program get a string that contains the byte 0xC5 to represent the character U+00C5? Something, somewhere, must've assumed that Unicode codepoint numbers translate directly into bytes, which is wrong. Find and fix that, so that your data is read into the PHP string in UTF-8 form to begin with, and then use the mbstring functions for any manipulation of the string afterward.

Once you have a string that contains the UTF-8 representation of your URL, rawurlencode() should give you the correct percent-escaped result.

What you're saying is that since the document is encoded in UTF-8, the special characters should be displayed with their UTF-8 encodings? Can you link to a list of special characters with UTF-8 encodings?
– Dan HorvatAug 9 '12 at 12:09

Well, it doesn't really matter that the document is encoded in UTF-8, because the URLs are all ASCII. It's RFC 3986 that says you have to convert the original Unicode string into a byte sequence using UTF-8, then percent-encode any bytes that aren't valid ASCII in order to get a string that's entirely ASCII.
– WyzardAug 9 '12 at 12:15

I added some info about how to do this in PHP.
– WyzardAug 9 '12 at 12:27

rawurlencode() didn't work when I tried it earlier, Google still didn't recognize the URL, but I didn't use mbstring. I suppose the easiest way to deal with this situation is to str_replace the strange characters and to avoid the issue. Like I said, it's not crucial for the URL to show those characters, it will work even if they're replaced with the most similar one. So although I'll go with that approach, your answer was correct. Thanks.
– Dan HorvatAug 9 '12 at 12:35