Anyways, the Unicode codepoint range goes from U+0000 to U+10FFFF which is over 1 million symbols, and these are divided into groups called planes. Each plane is about 65000 characters (16^4). The first plane is the Basic Multilingual Plane (U+0000 through U+FFFF) and contains all the common symbols we use everyday and then some. The rest of the planes require more than 4 hexadecimal digits and are called supplementary planes or astral planes. I have no idea if there’s a good reason for the name “astral plane.” Sometimes, I think people come up with these names just to add excitement to their lives.

We can express characters in a couple different ways: "A" === "\u0041" === "\x41" === "\u{41}". These are escape sequences. The \x can be used for most (but not all) of the Basic Multilingual Plane, specifically U+0000 to U+00FF. The \u can be used for any Unicode characters. The curly braces are required if there are more than 4 hexadecimal digits and optional otherwise. This is for Javascript/HTML by the way. Other languages have their own sets of rules.

And "💩" === "\u{1F4A9}". Unfortunately, this is also true: "💩" === "\uD83D\uDCA9". What is this nonsense? All astral codepoints can also be represented by “surrogate pairs”, and this is used for backwards compatibility reasons. This is why "💩".length === 2. There’s a formula to calculate surrogates from astral codepoints, and vice versa.

Given a codepoint C greater than 0xFFFF, it corresponds to a surrogate pair <H,L>.

So, is there a solution that counts symbols correctly? Bynens lists a couple possibilities. Array.from shows some promise. It’s succinct, works in Node, and is generally well supported across browsers, except IE11 and below, I think.

Array.from("💩").length === 1; //hooray!

however

Array.from("❤️").length === 2; //boooo!

From what I understand ❤️ is comprised of two codepoints: U+2764 and U+FE0F. The first is Heavy Black Heart. The second is Variation Selector 16 which changes the appearance of the preceding character! Ugh. In fact, U+FE00 through U+FE0F can all change the appearance of the previous character. With the case of this heart, only U+FE0F does. Other hearts(💙, 💚, 💛, 💜 ) each have their own codepoint, but the red heart requires two. I’m not sure why, but I asked on Stack Overflow.

To accommodate this I can change my code to something horrible like this:

👩‍❤️‍💋‍👩 is created by having a Zero Width Joiner\u{200D} character between the component emojis. So we could do something like this:

functionfancyCount2(str){constjoiner="\u{200D}";constsplit=str.split(joiner);letcount=0;for(constsofsplit){//removing the variation selectorsconstnum=Array.from(s.split(/[\ufe00-\ufe0f]/).join("")).length;count+=num;}//assuming the joiners are used appropriatelyreturncount/split.length;}

Honestly, you should probably never use this. Not all browsers, UIs, etc even render 👩‍❤️‍💋‍👩 as a single symbol. The code assumes the joiners are used between characters appropriately which could be very problematic. During my research I noticed that U+200C allows for ligatures and fancyCount2 isn’t accounting for that. That’s an easy adjustment, but I’m sure there’s even more modifiers and joiners that I’ve never heard of. I know you’re on the edge of your seat waiting for the ultimate solution, but this rabbit hole is too deep! Sorry for the disappointment, but if you know of a more robust, comprehensive character counter, I’d love to hear from you!