Day 60: Emoji Detector Library for PHP #100DaysOfIndieWeb

I wanted to find all emoji in a string, including info about them, for my next #100Days project. However I couldn't find a library that does this. The closes I found was iamcal's Emoji conversion library, which can replace emoji in a string with HTML tags, as well as the EmojiOne library which can replace emoji in a string with shortcodes.

If you aren't familiar with the details of Emoji, Unicode and UTF-8 encoding, then what you probably don't realize is that an emoji character such as 👨‍👩‍👦‍👦 is actually composed of seven unicode characters. Each person is a separate character, and they are all connected with the "Zero-Width-Join" (ZWJ) character. This ends up being seven code points in total: 👨 [ZWJ] 👩 [ZWJ] 👦 [ZWJ] 👦. There are also skin tone modifiers which are their own character. So an emoji like 👍🏼 is actually two characters, the 👍 plus the skin-tone-3 modifier.

To further complicate things, I've been talking about unicode code points, but it turns out these code points can be represented in any number of ways in a string depending on the string encoding. Typically we only need to worry about handling UTF-8 encoded strings now, so that's where I started. The UTF-8 encoding of a character like "A" is the same as the ASCII encoding of the character, using only one byte. However a character such as 👍 requires more than one byte to represent. This means actually finding meaningful emoji in a string is not as simple as reading byte by byte, and is not even as simple as reading UTF-8-character by character.

Thankfully, EmojiOne has done the hard work of finding the Emoji characters in a string. However their library doesn't have a way to return the Emoji found, it can only be used to replace them. I also didn't like the list of short names they use, I prefer the Slack names instead.

num_points - The number of unicode code points that this emoji is composed of.

points_hex - An array of each unicode code point that makes up this emoji. These are returned as hex strings. This will also include "invisible" characters such as the ZWJ character and skin tone modifiers.

hex_str - A list of all unicode code points in their hex form separated by hyphens. This string is present in the Slack emoji data array.

skin_tone - If a skin tone modifier was used in the emoji, this field indicates which skin tone, since the short_name will not include the skin tone.