Sentiment analysis on Trump's tweets using Python 🐍

Extract twitter data using tweepy and learn how to handle it using pandas.

Do some basic statistics and visualizations with numpy, matplotlib and seaborn.

Do sentiment analysis of extracted (Trump's) tweets using textblob.

Phew! It's been a while since I wrote something kinda nice. I hope you find this a bit useful and/or interesting. This is based on a workshop I taught in Mexico City. I'll explain the whole post along with code, in the most simple way possible. Anyway, all the code can be found in the repo I used for this workshop.

What will we need?

First of all, we need to have Python installed.
I'm almost sure that all the code will run in Python 2.7, but I'll use Python 3.6. I highly recommend to install Anaconda, which is a very useful Python distribution to manage packages that includes a lot of useful tools, such as Jupyter Notebooks. I'll explain the code supposing that we will be using a Jupyter Notebook, but the code will run if you are programming a simple script from your text editor. You'll just need to adapt it (it's not hard).

The requirements that we'll need to install are:

NumPy: This is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data.

Pandas: This is an open source library providing high-performance, easy-to-use data structures and data analysis tools.

Tweepy: This is an easy-to-use Python library for accessing the Twitter API.

Matplotlib: This is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Seaborn: This is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

Textblob: This is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks.

All of them are "pip installable". At the end of this article you'll be able to find more references about this Python libraries.

Now that we have all the requirements, let's get started!

1. Extracting twitter data (tweepy + pandas)

1.1. Importing our libraries

This will be the most difficult part of all the post... 😥
Just kidding, obviously it won't. It'll be just as easy as copying and pasting the following code in your notebook:

# General:importtweepy# To consume Twitter's APIimportpandasaspd# To handle dataimportnumpyasnp# For number computing# For plotting and visualization:fromIPython.displayimportdisplayimportmatplotlib.pyplotaspltimportseabornassns%matplotlibinline

Excellent! We can now just run this cell of code and go to the next subsection.

1.2. Creating a Twitter App

In order to extract tweets for a posterior analysis, we need to access to our Twitter account and create an app. The website to do this is https://apps.twitter.com/. (If you don't know how to do this, you can follow this tutorial video to create an account and an application.)

From this app that we're creating we will save the following information in a script called credentials.py:

The reason of creating this extra file is that we want to export only the value of this variables, but being unseen in our main code (our notebook). We are now able to consume Twitter's API. In order to do this, we will create a function to allow us our keys authentication. We will add this function in another cell of code and we will run it:

# We import our access keys:fromcredentialsimport*# This will allow us to use the keys as variables# API's setup:deftwitter_setup():"""
Utility function to setup the Twitter's API
with our access keys provided.
"""# Authentication and access using keys:auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)auth.set_access_token(ACCESS_TOKEN,ACCESS_SECRET)# Return API with authentication:api=tweepy.API(auth)returnapi

So far, so easy right? We're good to extract tweets in the next section.

1.3. Tweets extraction

Now that we've created a function to setup the Twitter API, we can use this function to create an "extractor" object. After this, we will use Tweepy's function extractor.user_timeline(screen_name, count) to extract from screen_name's user the quantity of count tweets.

As it is mentioned in the title, I've chosen @realDonaldTrump as the user to extract data for a posterior analysis. Yeah, we wanna keep it interesting, LOL.

The way to extract Twitter's data is as follows:

# We create an extractor object:extractor=twitter_setup()# We create a tweet list as follows:tweets=extractor.user_timeline(screen_name="realDonaldTrump",count=200)print("Number of tweets extracted: {}.\n".format(len(tweets)))# We print the most recent 5 tweets:print("5 recent tweets:\n")fortweetintweets[:5]:print(tweet.text)print()

With this we will have an output similar to this one, and we are able to compare the output with the Twitter account (to check if we're being consistent):Number of tweets extracted: 200.5 recent tweets:On behalf of @FLOTUS Melania & myself, THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief,… https://t.co/4yZCeXCt6nI will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House.Stock Market up 5 months in a row!'President Donald J. Trump Proclaims September 3, 2017, as a National Day of Prayer' #HurricaneHarvey #PrayForTexas… https://t.co/tOMfFWwEsNTexas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow!

We now have an extractor and extracted data, which is listed in the tweets variable. I must mention at this point that each element in that list is a tweet object from Tweepy, and we will learn how to handle this data in the next subsection.

1.4. Creating a (pandas) DataFrame

We now have initial information to construct a pandas DataFrame, in order to manipulate the info in a very easy way.

IPython's display function plots an output in a friendly way, and the headmethod of a dataframe allows us to visualize the first 5 elements of the dataframe (or the first number of elements that are passed as an argument).

So, using Python's list comprehension:

# We create a pandas dataframe as follows:data=pd.DataFrame(data=[tweet.textfortweetintweets],columns=['Tweets'])# We display the first 10 elements of the dataframe:display(data.head(10))

This will create an output similar to this:

Tweets

0

On behalf of @FLOTUS Melania & myself, THA...

1

I will be going to Texas and Louisiana tomorro...

2

Stock Market up 5 months in a row!

3

'President Donald J. Trump Proclaims September...

4

Texas is healing fast thanks to all of the gre...

5

...get things done at a record clip. Many big ...

6

General John Kelly is doing a great job as Chi...

7

Wow, looks like James Comey exonerated Hillary...

8

THANK YOU to all of the incredible HEROES in T...

9

RT @FoxNews: .@KellyannePolls on Harvey recove...

So we now have a nice table with ordered data.

An interesting thing is the number if internal methods that the tweetstructure has in Tweepy:

The interesting part from here is the quantity of metadata contained in a single tweet. If we want to obtain data such as the creation date, or the source of creation, we can access the info with this attributes. An example is the following:

# We print info from the first tweet:print(tweets[0].id)print(tweets[0].created_at)print(tweets[0].source)print(tweets[0].favorite_count)print(tweets[0].retweet_count)print(tweets[0].geo)print(tweets[0].coordinates)print(tweets[0].entities)

We're now able to order the relevant data and add it to our dataframe.

1.5. Adding relevant info to our dataframe

As we can see, we can obtain a lot of data from a single tweet. But not all this data is always useful for specific stuff. In our case we well just add some data to our dataframe. For this we will use Pythons list comprehension and a new column will be added to the dataframe by just simply adding the name of the content between square brackets and assign the content. The code goes as...:

Now that we have extracted and have the data in a easy-to-handle ordered way, we're ready to do a bit more of manipulation to visualize some plots and gather some statistical data. The first part of the post is done.

2. Visualization and basic statistics

2.1. Averages and popularity

We first want to calculate some basic statistical data, such as the mean of the length of characters of all tweets, the tweet with more likes and retweets, etc.

From now, I'll just add some input code and the output right below the code.

To obtain the mean, using numpy:

# We extract the mean of lenghts:mean=np.mean(data['len'])print("The lenght's average in tweets: {}".format(mean))

The lenght's average in tweets: 125.925

To extract more data, we will use some pandas' functionalities:

# We extract the tweet with more FAVs and more RTs:fav_max=np.max(data['Likes'])rt_max=np.max(data['RTs'])fav=data[data.Likes==fav_max].index[0]rt=data[data.RTs==rt_max].index[0]# Max FAVs:print("The tweet with more likes is: \n{}".format(data['Tweets'][fav]))print("Number of likes: {}".format(fav_max))print("{} characters.\n".format(data['len'][fav]))# Max RTs:print("The tweet with more retweets is: \n{}".format(data['Tweets'][rt]))print("Number of retweets: {}".format(rt_max))print("{} characters.\n".format(data['len'][rt]))

The tweet with more likes is: The United States condemns the terror attack in Barcelona, Spain, and will do whatever is necessary to help. Be tough & strong, we love you!Number of likes: 222205144 characters.

The tweet with more retweets is: The United States condemns the terror attack in Barcelona, Spain, and will do whatever is necessary to help. Be tough & strong, we love you!Number of retweets: 66099144 characters.

This is common, but it won't necessarily happen: the tweet with more likes is the tweet with more retweets. What we're doing is that we find the maximum number of likes from the 'Likes' column and the maximum number of retweets from the 'RTs' using numpy's max function. With this we just look for the index in each of both columns that satisfy to be the maximum. Since more than one could have the same number of likes/retweets (the maximum) we just need to take the first one found, and that's why we use .index[0] to assign the index to the variables favand rt. To print the tweet that satisfies, we access the data in the same way we would access a matrix or any indexed object.

We're now ready to plot some stuff. :)

2.2. Time series

Pandas has its own object for time series. Since we have a whole vector with creation dates, we can construct time series respect tweets lengths, likes and retweets.

The way we do it is:

# We create time series for data:tlen=pd.Series(data=data['len'].values,index=data['Date'])tfav=pd.Series(data=data['Likes'].values,index=data['Date'])tret=pd.Series(data=data['RTs'].values,index=data['Date'])

And if we want to plot the time series, pandas already has its own method in the object. We can plot a time series as follows:

2.3. Pie charts of sources

We're almost done with this second section of the post. Now we will plot the sources in a pie chart, since we realized that not every tweet is tweeted from the same source (😱🤔). We first clean all the sources:

With the following output, we realize that basically this twitter account has two sources:Creation of content sources:* Twitter for iPhone* Media Studio

We now count the number of each source and create a pie chart. You'll notice that this code cell is not the most optimized one... Please have in mind that it was 4 in the morning when I was designing this workshop. 😅

3. Sentiment analysis

3.1. Importing textblob

As we mentioned at the beginning of this post, textblob will allow us to do sentiment analysis in a very simple way. We will also use the re library from Python, which is used to work with regular expressions. For this, I'll provide you two utility functions to: a) clean text (which means that any symbol distinct to an alphanumeric value will be remapped into a new one that satisfies this condition), and b) create a classifier to analyze the polarity of each tweet after cleaning the text in it. I won't explain the specific way in which the function that cleans works, since it would be extended and it might be better understood in the official redocumentation.

The code that I'm providing is:

fromtextblobimportTextBlobimportredefclean_tweet(tweet):'''
Utility function to clean the text in a tweet by removing
links and special characters using regex.
'''return' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet).split())defanalize_sentiment(tweet):'''
Utility function to classify the polarity of a tweet
using textblob.
'''analysis=TextBlob(clean_tweet(tweet))ifanalysis.sentiment.polarity>0:return1elifanalysis.sentiment.polarity==0:return0else:return-1

The way it works is that textblob already provides a trained analyzer (cool, right?). Textblob can work with different machine learning models used in natural language processing. If you want to train your own classifier (or at least check how it works) feel free to check the following link. It might result relevant since we're working with a pre-trained model (for which we don't not the data that was used).

Anyway, getting back to the code we will just add an extra column to our data. This column will contain the sentiment analysis and we can plot the dataframe to see the update:

# We create a column with the result of the analysis:data['SA']=np.array([analize_sentiment(tweet)fortweetindata['Tweets']])# We display the updated dataframe with the new column:display(data.head(10))

Obtaining the new output:

Tweets

len

ID

Date

Source

Likes

RTs

SA

0

On behalf of @FLOTUS Melania & myself, THA...

144

903778130850131970

2017-09-02 00:34:32

Twitter for iPhone

24572

5585

1

1

I will be going to Texas and Louisiana tomorro...

132

903770196388831233

2017-09-02 00:03:00

Twitter for iPhone

44748

8825

1

2

Stock Market up 5 months in a row!

34

903766326631698432

2017-09-01 23:47:38

Twitter for iPhone

44518

9134

0

3

'President Donald J. Trump Proclaims September...

140

903705867891204096

2017-09-01 19:47:23

Media Studio

47009

15127

0

4

Texas is healing fast thanks to all of the gre...

143

903603043714957312

2017-09-01 12:58:48

Twitter for iPhone

77680

15398

1

5

...get things done at a record clip. Many big ...

113

903600265420578819

2017-09-01 12:47:46

Twitter for iPhone

54664

11424

1

6

General John Kelly is doing a great job as Chi...

140

903597166249246720

2017-09-01 12:35:27

Twitter for iPhone

59840

11678

1

7

Wow, looks like James Comey exonerated Hillary...

130

903587428488839170

2017-09-01 11:56:45

Twitter for iPhone

110667

35936

1

8

THANK YOU to all of the incredible HEROES in T...

110

903348312421670912

2017-08-31 20:06:35

Twitter for iPhone

112012

29064

1

9

RT @FoxNews: .@KellyannePolls on Harvey recove...

140

903234878124249090

2017-08-31 12:35:50

Twitter for iPhone

0

6638

0

As we can see, the last column contains the sentiment analysis (SA). We now just need to check the results.

3.2. Analyzing the results

To have a simple way to verify the results, we will count the number of neutral, positive and negative tweets and extract the percentages.

Obtaining the following result:Percentage of positive tweets: 51.0%Percentage of neutral tweets: 27.0%Percentage de negative tweets: 22.0%

We have to consider that we're working only with the 200 most recent tweets from D. Trump (last updated: September 2nd.). For more accurate results we can consider more tweets. An interesting thing (an invitation to the readers) is to analyze the polarity of the tweets from different sources, it might be deterministic that by only considering the tweets from one source the polarity would result more positive/negative. Anyway, I hope this resulted interesting.

As we saw, we can extract, manipulate, visualize and analyze data in a very simple way with Python. I hope that this leaves some uncertainty in the reader, for further exploration using this tools.

It might be possible to find little mistakes in the translation of the material (I designed the workshop in Spanish, originally 😅). Please feel free to comment or suggest all that comes up to your mind. That would complement some ideas that I already have in mind for further work. 😀

I'll now leave some references for documentation and tutorials on the used libraries. Hope to hear from you!

Nice, very interesting! He seems to tweet surprisingly a high count of positive tweets (51%). But how much of this tweets are fake news and lies is another question... nytimes.com/interactive/2017/06/23...

Yeah, that resulted surprising to me! I've heard that he's not the only one tweeting from his account, but he has a team for this. That might be a possible reason. That's why it results interesting to analyze the polarity of tweets that come from different sources.

Hi there, I was having some trouble with the "visualizing the statistics" section as detailed in sections 2.1 and 2.2; if you take a look at my GitHub repo, you'll notice I had to comment out # %matplotlib inline and replaced requirement with plt.ion() within the script-running file (trumpet.py) in order to run the scripts without failure (e.g. python3 trumpet.py). Can you please explain how to generate the visualizations as detailed in those sections? For some reason, I'm unable to render those visual within my Jupyter Notebook-env/config. I'm only 10 days new to Python, so I'd appreciate any guidance. Great tutorial-
thanks!

Instead of adding plt.ion() at the beginning, you can add the following code each time you're generating a plot, in order to visualize it: plt.show(). This will open an external window and display the immediately last plot generated.

Would it be possible to check / detect how many likes comes from the staff of a VIP ? It is said that many politicals manage likes and retweets by asking their support to like and retweet their messages? (not sure to be clear) Through 200 tweets, this would be possible to look at the twitter accounts that like systematically and quickly (as soon as published, like bots do) then substract (or minimize) them from the final evaluation.

If you want to count something like this in real time, you would need to modify the way you're consuming the API (rest) and create a listener (you can still do that with Tweepy). That's what I would do, I'd create a specific listener for Trump's tweets and use threads to count for certain time likes and retweets for a new tweet.

Nicely done. I had installed Anaconda before but didn't really get past Hello World in the Jupyter notebook. This was an excellent idea to get people like me off their proverbial rear-end and use it for a very fun idea! I was able to follow it right through and get everything to work after dusting off the cobwebs of my Anaconda environment.

Consume as a Rest API. In that case, the deployment in Heroku (or any other deployment service) would have to process the new tweets and add the new data to the previous.

Create a stream listener to continuously detect a new tweet and process it.

In 1., the simplest way would be only to schedule a task (a simple script) to be executed on certain time (pythonanywhere also works for this, I have a twitter bot that runs every 24 hours). Anyway one can create a service using Tweepy, in fact there's a Flask-Tweepy integration: flask-tweepy.readthedocs.io/en/lat...

@Rodolfo @ben This is really great, can we do healthcare analysis for USA. As it took time for CBO for recent Healthcare (not making anything political) but they took seven days for calculate CBO score which is ridiculous.

One of my ideas about this post is to give tools to implement solutions on different areas. As you say, this could help in healthcare analysis. For that you might need a specific classifier (not the texblob's default I used), and you can learn how to build one in the last reference I provide in the post.

If you begin working on that, please let us know if there's a thing on which we may help.

Thanks for the awesome tutorial! I'm new to python and had a quick question though. You mentioned that textblob provides a trained analyzer, and you use that in your tutorial to assess the polarity of Trump's tweets. Can you tell me where I can access the list of words that's associated with positive/negative/neutral? I've been looking on textblob documentation but haven't found it yet.

I was looking for a tutorial to recommend to an acquaintance who is moving into digital journalism, and I came across your post. It is very well-written. Thanks for sharing!
This is just a short remark, since you seem to be using Pandas, but not to its fullest potential.
When you observe a possible relationship between RTs and Likes in subsection 2.1, you can quantify this by computing the (Pearson) correlation

data['RTs'].corr(data['Likes'])

(It is close to 0.7.)

When finding the sources of tweets in subsection 2.3, instead of using loops,

I must say that it was for an introductory workshop and I finished all the material during dawn three days before or something. :P
It might be possible that most of the last part is not optimized in code. :(

Thanks for your observations! :D
They simplify the data handling using the potential of Pandas. :)