Category Archives: python

Before I get too far I don’t actually analysis taco emojis. At least not yet. I, however, give you the tools to start parsing them from tweets, text or anything you can get into Python.

This past month Apple released their iOS 9.1 and their latest OS X 10.11.1 El Capitan update. That updated included a bunch of new emojis. I’ve made a quick primer on how to handle emoji analysis in Python. Then when Apple released an update to their emojis to include the diversity, I updated my small Python class for emoji counting to include to the newest emojis. I also looked at what is actually happening with the unicode when diversity modifier patches are used.

With this latest update, Apple and the Unicode Consortium didn’t really introduce any new concepts, but I did update the Python class to include the newest emojis. In my GitHub the data folder includes a text file with all the emojis delimitated by ‘\n’. The class uses this file to find any emoji’s in a unicode string which has been passed to the add_emoji_count() method.

Building off of the diversity emoji update, I added a skin_tone_dict property of the EmojiDict class. This property returns a dictionary with the number of unique human emojis per tweet and their skin tones. This property will not catch multiple human emojis written if they in the same execution of the add_emoji_count() method

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

importsocialmediaparse assmp#loads the package

counter=smp.EmojiDict()#initializes the EmojiDict class

#goes through list of unicode objects calling the add_emoji_count method for each string

#the method keeps track of the emoji count in the attributes of the instance

forunicode_string incollection:

counter.add_emoji_count(unicode_string)

#output of the instance

printcounter.dict_total#dict of the absolute total count of the emojis in corpus

printcounter.dict#dict of the count of strings with the emoji in corpus

printcounter.baskets#list of lists, emoji in each string. one list for each string.

printcounter.skin_tones_dict#dictionary of unique emoji emojis aggregated by the counter.

Above is an example of how to use the new attribute. It is a dictionary so you can work that into your analysis however you like. I will eventually create better methods and outputs to make this feature more robust and useful.

I haven’t been updating this site often since I’ve started to perform a similar job over at FanGraphs. All non-baseball stat work that I do will continued to be housed here.

Over the past week, Apple has implemented new emojis with a focus on diversity in their iOS 8.3 and the OS X 10.10.3 update. I’ve written quite a bit about the underpinnings of emojis and how to get Python to run text analytics on them. The new emojis provide another opportunity to gain insights on how people interact, feel, or use them. Like always, I prefer to use Python for any web scraping or data processing, and emoji processing is no exception. I already wrote a basic primer on how to get Python to find emoji in your text. If you combine the tutorials I have for tweet scraping, MongoDB and emoji analysis, you have yourself a really nice suite of data analysis.

Modifier Patch

These new emojis are a product of the Unicode Consortium’s plan for how to incorporate racial diversity into the previously all-white human emoji line up. (And yes, there’s a consortium for emoji planning.) The method used to produce new emojis isn’t quite as simple as just making a new character/emoji. Instead, they decided to include a modifier patch at the end of human emojis to indicate skin color. As a end-user, this won’t affect you if you have all the software updates and your device can render the new emojis. However, if you don’t have the updates, you’ll get something that looks like this:

That box at the end of the emoji is the modifier patch. Essentially what is happening here is that there is a default emoji (in this case it’s the old man) and the modifier patch (the box). For older systems it doesn’t display, because the old system doesn’t know how to interpret this new data. This method actually allows the emojis to be backwards compatible, since it still conveys at least part of the meaning of the emoji. If you have the new updates, you will see the top row of emoji.

Using a little manipulation (copying and pasting) using my newly updated iPhone we can figure out this is what really is going on for emojis. There are five skin color patches available to be added to each emoji, which is demonstrated on the bottom row of emoji. Now you might notice there are a lot of yellow emoji. Yellow emojis (Simpsons) are now the default. This is so that no single real skin tone is the default. The yellow emojis have no modifier patch attached to them, so if you simply upgrade your phone and computer and then go back and look at old texts, all the emojis with people in them are now yellow.

New Families

The new emoji update also includes new families. These are also a little different, since they are essentially combinations of other emoji. The original family emoji is one single emoji, but the new families with multiple children and various combinations of children and partners contain multiple emojis. The graphic below demonstrates this.

The man, woman, girl and boy emoji are combined to form that specific family emoji. I’ve seen criticisms about the families not being multiracial. I’d have to believe the limitation here is a technical one, since I don’t believe the Unicode consortium has an effective method to apply modifier patches and combine multiple emojis at once. That would result in a unmanageable number of glyphs in the font set to represent the characters. (625 different combinations for just one given family of 4, and there are many different families with different gender iterations.)

New Analysis

So now that we have the background on the how the new emojis work, we can update how we’ve searched and analyzed them. I have updated my emoji .csv file, so that anyone can download that and run a basic search within your text corpus. I have also updated my github to have this file as well for the socialmediaparse library I built.

The modifier patches are searchable, so now you can search for certain swatches (or lack there of). Below I have written out the unicode escape output for the default (yellow) man emoji and its light-skinned variation. The emoji with a human skin color has that extra piece of code at the end.

Python

1

2

3

#unicode escape

\U0001f468#unmodified man

\U0001f468\U0001f3fb#light-skinned man

Here are all the modifier patches as unicode escape.

Python

1

2

3

4

5

6

#modifier patch unicode escape

\U0001f3fb#skin tone 1 (lightest)

\U0001f3fc#skin tone 2

\U0001f3fd#skin tone 3

\U0001f3fe#skin tone 4

\U0001f3ff#skin tone 5 (darkest)

The easiest way to search for these is to use the following snippet of code:

Python

1

2

3

4

5

#searches for any emoji with skin tone 5

unicode_object=u'Some text with emoji in it as a unicode object not str!'

if'\U0001f3ff'inunicode_object.encode('unicode_escape'):

#do something

You can throw that snippet into a for loop for a Pandas data frame or a MongoDB cursor. I’m planning on updating my socialmediaparse library with patch searching, and I’ll update this post when I do that.

On Twitter, about 10% of general-topic tweets contain emoji characters, the tiny icons and emoticons, which are starting to get more attention when analyzing tweets, Facebook messages, or text messages. An emoji [] can capture an emotion or completely change the meaning of the written text. Before exploring how different emojis are used and what they mean to people, I wanted to get an idea of how prevalent they are and which ones are the most popular on Twitter.

I collected tweets using a sampled stream from Twitter. In order to get a general representative sample of tweets, I tracked five popular, basic words: ‘the’, ‘and’, ‘to’, ‘you’, and ‘it’. These words are good search words, since there aren’t many sentences or thoughts that don’t use them. A Python script was used to find and count all the the emojis present in a collection of over 100,000 tweets. To avoid skewing due to a popular celebrity or viral tweet, I removed any retweets which were obvious retweets, and not retweets which function more like mentions.

Results

In the general collection of tweets, I found that 10.23% of tweets contained at least one emoji. So there isn’t an overwhelming number of tweets which contain an emoji, but 10% of Twitter content is a significant portion. The ‘Emoji Selection’ graph shows the percentage of tweets containing that particular emoji out of the tweets that HAD an emoji in it. The most popular emoji by and far was the ‘tears of joy’ emoji followed by the ‘loudly crying’ emoji . Heart-related emoji [the ones I thought would prove most popular] was third and fourth.

Since I only collected these over the course of a day and not over several weeks or months, I would be hesitant to think these results would hold up over time. An event or seasonality can trigger a cascade of people using a certain emoji. For example, the Christmas tree emoji was popular being present in 2.16% of tweets that included emojis; this would be expected to get larger as we get closer to Christmas and smaller after Christmas. Another interesting find is that the emoji ranks high. My pure conjecture is that this emoji’s high use rate is due to protests in Ferguson and around the country. To confirm this I would need a sample of tweets from before the grand jury announcement or track the use as time passes.

Further analysis could utilize emoji groups or clusters. Emojis with similar meanings would not necessarily produce a high number if people spread their selection over 5 emoji instead of one. I plan to update this and expand on this as time passes and I’m able to collect more data.

Technical

In order to avoid any conflicts with ASCII conversions that some Python or R packages do on Twitter data, I stored tweets from the Twitter Streaming API directly into a MongoDB database, which encodes strings in UTF-8. Since tweets come from the API as a JSON object, they can be naturally stored in the document-orientated database with each metadata field in the tweet being accessible without parsing the entire tweet into a data frame or SQL database. Retweets were removed by finding any tweets with ‘RT’ in the first two characters of the text entry. This is how Twitter represents automatic retweets in JSON format.

Also since I collected 103,416 tweets the margin of error for any of the proportions given are well below 1%. Events within the social network would definitely outweigh any margin of error.

I have updated [better] code that allows for easy counting of emoji’s in string objects in Python, it can be found on my GitHub. I have a two counting classes in a mini-package loaded there.

Emoji [], those ubiquitous emoticons that popped up when iPhone users found them in 2011 with iOS 5 are a different set of characters aside from the traditional alphanumeric and punctuation characters. These are essentially another alphabet, and this concept will be useful when using the emoji in Python. Emoji are NOT a font like Wingdings from Windows95, they are unique characters with no corresponding letter or symbol representation. If you have a document or webpage that has the Wingding font, you can simply change the font to a typical Latin font to see the normal characters the Wingding font represents.

Technical Background

Without getting into the technical encoding problems, emoji are defined in Unicode and UTF-8, which can represent just about a million characters. A lot of applications or software packages default to ASCII, which only encodes the typical 128 characters. Some Python IDEs, csv writing packages, or parsing software default to or translate to ASCII, so they don’t necessarily handle the emoji characters properly.

I wrote a Python script [or this Python ‘package’] that takes tweets that are stored in a MongoDB database (more on that later) and counts the number of different emoji in the tweet corpus. To make sure Python plays nice with the emojis, first I loaded in the data by making sure I had UTF-8 encoding specified otherwise you’ll get this encoding error:

I loaded an emoji key I made using all the emoji’s in Apple’s implementation by loading this code into a Panda’s data frame:

Python

1

emoji_key=pd.read_csv('emoji_table.txt',encoding='utf-8',index_col=0)

If Python loads you data in correctly with UTF-8 encoding, each emoji will be treated as separate unique character, so string function and regular expressions can be used to find the emoji’s in other strings such as Twitter text. In some IDEs emoji’s don’t display [Canopy] or don’t display well [PyCharm]. I remedied the invisible/messy emoji’s by running the script in Mac OS X’s terminal application, which displays emoji . Python can also produce an ASCII compliant string by using a unicode escape encoding:

Python

1

unicode_object.encode('unicode_escape')

The escape encoded string will display something like this:

Python

1

\U0001f604

All IDEs will display the ASCII string. You would need to decode it from the unicode escape to get it back into a unicode object. Ultimately I had a Pandas data frame containing unicode objects. To make sure the correct encoding was used on the output text file, I used the following code:

Python

1

2

withopen('emoji_out.csv','w')asf:

emoji_count.to_csv(f,sep=',',index=False,encoding='utf-8')

Emoji Counter Class

I made an emoji counter class in Python to simplify the process of counting and aggregating emoji counts. The code [socialmediaparse] is on my GitHub along with the necessary emoji data file, so it can load the key when the instance is created. Using the package, you can repeatedly call the add_emoji_count() method to change the internal count for each emoji. The results can be retrieved using the .dict, dict_total, and .baskets attributes of the instance. I wrote this because it organizes and simplifies the analysis for any social media or emoji application. Separate emoji dictionary counter objects can be created for different sets of tweets that someone would want to analyze.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

importsocialmediaparse assmp#loads the package

counter=smp.EmojiDict()#initializes the EmojiDict class

#goes through list of unicode objects calling the add_emoji_count method for each string

#the method keeps track of the emoji count in the attributes of the instance

forunicode_string incollection:

counter.add_emoji_count(unicode_string)

#output of the instance

printcounter.dict_total#dict of the absolute total count of the emojis in corpus

printcounter.dict#dict of the count of strings with the emoji in corpus

printcounter.baskets#list of lists, emoji in each string. one list for each string.

counter.create_csv(file='emoji_out.csv')#method for creating csv

Project

MongoDB was used for this project because the data stores the JSON files very well, not needing a parser or a csv writer. It also has the advantage of natively storing strings in UTF-8. If I used R’s StreamR csv parser, there would be many encoding errors and virtually no emoji’s present in the data. There might be possible work arounds, but MongoDB was the easiest way I’ve found to work with Twitter JSON, UTF-8 encoded data.