Using .encode('latin1').decode('utf8) I've been able to deal with most characters such as "é" or "à" and display them correctly. But I'm having a hard time with emojis, as they seem to be encoded in a different way.

Example of a problematic emoji : \u00f3\u00be\u008c\u00ba

The encoding/decoding does not yield any errors, but Tkinter is not willing to display what the function outputs and gives "_tkinter.TclError: character U+fe33a is above the range (U+0000-U+FFFF) allowed by Tcl". Tkinter is not yet this issue thought because trying to display the same emoji in the consol yields "ó¾º" which clearly isn't what's supposed to be displayed ( it's supposed to be a crying face)

I've tried using the emoji library but it doesn't seem to help any

>>> print(emoji.emojize("\u00f3\u00be\u008c\u00ba"))
'ó¾º'

How can I retrieve the proper emoji and display it?
If it's not possible, how can I detect problematic emojis to maybe sanitize and remove them from the JSON in the first place?

1 Answer
1

.encode('latin1').decode('utf8) is correct - it results in the codepoint U+fe33a("󾌺"). This codepoint is in a Private Use Area (PUA) (specifically Supplemental Private Use Area-A), so everyone can assign his own meaning to that codepoint (Maybe facebook wanted to use a crying face, when there wasn't yet one in Unicode, so they used PUA?).