MySQL and storing emoji: proper utf8mb4 with Django

Let me guess: you just bumped into the Incorrect string value: '\x01\x9D\x8C\x86' for column 'column_name' at row 1 error. Fantastic! Looks like you want to store some emoji in that MySQL database of yours. You’ll need to make sure your columns are utf8mb4 encoded. Read on.

Why do I want utf8mb4?

Because you want to store emoji. That’s it, most of the time.

What’s the deal here?

Current posts tell you to just switch your whole database/table to utf8mb4 as default encoding. You most likely don’t want this. It’s a crappy, lazy approach, honestly. We wanted to provide emoji support in the app that powers /docs - which happens to be a modest Django app - and all resources we found online insisted on shooting the just-go-utf8mb4-everywhere cannon. But you can do better.

About utf8mb4

If you want to store face with OK gesture 🙆, face with look of triumph 😤, information desk person 💁 or others, you have to use utf8mb4. The other encoding you may be tempted to use - utf8 - only stores code points up to 3 bytes, and Emoji fall into the 4-byte family.

Bonus: Which collation to use?

You’ll want utf8mb4_unicode_ci on these columns and NOT the bare, 3-byte utf8_unicode_ci.

Creating a custom, empty migration

Django migrations don’t support encoding changes as part of their automatic change detection/generation, and you can’t specify an encoding when defining a model’s field, so we’re going to have to generate an empty migration ourselves and do some raw SQL action on it.

Indexing utf8mb4 columns

See here to understand how this affects your columns maximum index size.

Bonus: why not just escape?

You may have come up with working around this by encoding the emojis somehow - HTML entities most likely. No one will stop you, and it will work - but should you ever need to present that data in a format that doesn’t support HTML character entities, you’re in for more hacks. Storing the raw data as-is will let you sidestep building hack upon hack in some Android app along the way.

Resources

For all things utf8mb4, this fantastic post by Mathias Bynens should be your go-to and launchpad into other related topics. And make sure you dig into the comments, which are another fantastic adventure entirely.

Check out this SO question for a second opinion on the differences between UTF-8 collations.