If you've ever gotten a number of weird looking characters in your database or on your website like, "?" and didn't know why, then this episode is for you. Those bizarre characters called "mojibake", rear their ugly heads when we don't account for a consistent character encoding. Today we discuss what character encoding is, how to accommodate for it in HTML, PHP & your database, and how we can ensure we'll never encounter an unexpected alien character in our web apps again.

David Sklar has a post to his site showing you how to fix broken UTF-8 characters in content being passed through the normal string functions.

When working on the i18n bits of Learning PHP 7, I had a problem. My example showing how plain string functions such as strtolower() and strtoupper() mangle multibyte UTF-8 characters was making the book formatting/rendering pipeline barf. The processing tools are expecing nicely formatted, valid, UTF-8 encoded HTMLBook files. It didn’t like the mangled invalid UTF-8 characters in my example output.

To fix this, I wrote the following function to replace invalid UTF-8 sequences with the Unicode Replacement Character (U+FFFD).

He includes the code for this method that walks through the string, character by character, and checks the bytes it contains to see how it needs to be translated. There's plenty of comments in it too, explaining what it's doing as it goes along.

The Three Devs & A Maybe podcast (with hosts Michael Budd, Fraser Hart, Lewis Cains and Edd Mann) has posted their latest episode (#24) talking about character sets and encodings.

Having only just recently been bit by the character encoding issue again, we thought it would be a good time to bring it up on the podcast. Starting from the beginning with ASCII, we move on to discuss how 8-bit compatible machines brought way to the ISO-8859-* standards. This leads us on to Unicode, with the goal to develop a single character-set encoding standard that could support all of the world's scripts. Finally, we discuss the de-factor character encoding implementation used on the web today 'UTF-8', and reasons why this is the case.

Some days ago I saw the following fatal error for the first time in my life:

Fatal error: Cannot access property started with '\o' in file.php

After some debugging, I found out that the source of the error was not some strange BOM or UTF-8 characters in PHP source code files. No, it was a combination of protected class properties, object-to-array casting and automatic template property mappings.

As it turns out, there was a change in how object-to-array casting was done in PHP 5.3 that made this break (related to things appended to private and protected variable names). He includes a bit of sample code to illustrate the problem - a simple class converted from object to array with direct casting. He does point out that it doesn't happen with get_object_vars, though, as that doesn't do the casting, just extraction.

On PHPMaster.com today David Shirey has a written up a new tutorial introducing the ctype functions in PHP. This set of functions provides a handy way to more correctly check values to ensure they're valid (and contain what they should).

If you have a background in C, then you’re probably already familiar with the character type functions because that is where they come from (don’t forget that PHP is actually written in C). But if you’re into Python, then it’s only fair to point out that the PHP Ctype functions have absolutely nothing to do with the Python’s ctypes library. It’s just one of those tragic and totally unavoidable naming similarities.

He briefly explains how the functions work and at least one "gotcha" to watch out for if you're using them for input validation. He then goes through the list of the eleven ctype functions and briefly describes what they do. Some example code is also included showing how you can use them to validate a value based on the true/false return from the function call.

Gareth Heyes has another post to his site on the topic of "non-alpha PHP code", this time getting a bit more into the process and how his examples are parsed by PHP into more familiar functionality.

My first post on PHP non-alpha numeric code was a bit brief, in the excitement of the discovery I failed to detail in depth the process. I’ve decided to follow up with a tutorial and hopefully explain the process better for anyone wanting to learn or improve the technique. The basis of PHP non-alphanumeric code is to take advantage of the fact that PHP automatically converts Arrays into a string “Array” when using in a string context.

He includes some basic examples showing how, with just a combination of things like "+", "_" and "[" or "]" you can reproduce similar output to echoing out an array and use that "Array" output string to get to other strings (like the letter "B"). There's also a more lengthy example showing how to build up the string "print 1+1" and have it execute using this technique.

On the Refulz.com site they've posted the first part of a series about the basics of using special characters regular expressions (both in PHP and outside of it).

With this post, we continue to explore the Regular expressions. The first post of the Learning Regular Expression series introduced Regular Expressions. The first post covers the regular expression delimiters and the “i” pattern modifier. In the language of regular expression, there is a special meaning of certain characters.

In this article they show the use of characters like the caret, asterisk, dot and dollar symbol to modify your expressions to handle special cases, matching for more than one character and the start and end of strings.

On Reddit.com there's a recent post with a growing discussion about character encodings in PHP applications (with some various recommendations).

I would rather not have to convert these weird characters to the HTML character entities, if possible. I'd rather be able to use these characters directly on the web page. If this is for some reason a bad idea, let me know. This might be more of a general web design question (i already posted it there), but I figured it is still appropriate to post here as well since PHP is used to pull an entry from the database, and I figured a lot of you here would know the answer to the question.

The general consensus is to use UTF8 in this case, but there's a few reminders for the poster too:

Character sets can be confusing at the best of times. This post aims to explain the potential problems and suggest solutions. Although this is applied to PHP and a typical LAMP stack you can apply the same principles to any multi-tier stack.

He includes a "boring history" session (and recommends skipping if you just want the good stuff) that talks a bit about character sets and their history in computer system handling. All that said, he recommends using UTF-8 to ease your character encoding woes. He talks about configuring your editor to support it, making sure your browsers understand it and setting up your MySQL database connection to use it.

On WebReference.com there's a new tutorial posted about localizing your website by defining a character set to use for your content.

The process of making your applications/websites usable in many different locales is called internationalization, While customizing your code for different locales is called localization. Localization is the process of making your applications or websites local to where it is being viewed. For example, you can make a website more local to a particular place by converting its text to the predominate language of that location and by displaying the local time (e.g. German for people living in Germany or French for people living in France).

They show how to define constants that can be used in your application for the character set and language encoding. They use two major encodings - UTF-8 and ISO-8859-1 - in their examples of showing a sample "welcome" message in different languages. There's also a simple page to show you how to switch between languages if you'd like to give your visitors the option.