Common Statistical Operations

SQL Text Analysis with Donald Trump’s Tweets

Sean Cook

|

March 2, 2017

There were many unique aspects of the 2016 U.S. presidential election, with one of the most interesting being Donald Trump’s use of Twitter. As luck would have it, tweets happen to be very easy to extract and put into relational database format. Being the data junkies that we are, we couldn’t resist downloading President Trump’s entire tweet history and loading it into Periscope Data. It’s an excellent opportunity to slice, analyze, and quantify presidential communications while also exploring methods for analyzing text in SQL at the same time.

So let’s get started.

The Data Set

First, we formatted Trump’s tweets into 5 columns of data:

Tweet Text

Tweet Date

Favorites

Likes

Tweet ID

With these columns it was easy to do some quick exploration. @realDonaldTrump tweeted for the first time on May 4, 2009 with a tweet he almost certainly didn’t write.

select

tweet_text as first_tweet

from

trump_tweets

orderby tweet_date

limit 1

Around July of 2011 Trump’s tweet volume quickly escalated. It’s not clear what triggered this increase in activity, but based on his first tweet a reasonable guess is that Trump began tweeting himself, either in addition to or lieu of staff. In total, Donald Trump has tweeted over 34,000 times. At his peak he was tweeting over 30 times per day. A remarkable number by any measure.

A look at Trump’s tweet frequency over time looks like this:

select

month(tweet_date) as mnth

, count(*)

from

trump_tweets

groupby 1

Keyword Frequency: A History of MAGA

Trump first used the phrase ‘Make America Great Again’ long before his announcement to run for President in the 2016 election as a Republican candidate. He filed papers in Iowa in 2012 for the “Make America Great Again” Party to begin a run as an independent candidate. His first use of the phrase on Twitter:

His use of the phrase has obviously increased since. If we sum the number of tweets containing the words ‘Make America Great Again’ or ‘MAGA’, and divide that number by the count of all tweets, we can easily define Trump’s MAGA rate. With a simple query we can look at the MAGA rate over time. Trump’s MAGA rate has a slightly bimodal distribution centered first around the announcement of his candidacy in 2015, then more strongly after his victory in in the presidential election. Trump held an astounding peak monthly MAGA rate of 14% in September 2016.

-- Note the use of ilike and like. Ilike compares all strings

-- as lowercase, while like is case-sensitive. I don’t want to

-- count magazines.

-- Here we will use a float conversion to avoid integer division.

-- This is Redshift specific. Adjust accordingly.

select

month(tweet_date) mnth

, sum(

casewhen tweet_text ilike '%make america great again%' or

tweet_text like '%MAGA%' then 1

else 0

end

)

/count(*)::float as maga_rate

from

trump_tweets

groupby

1

Sentiment: The Anger and Joy of Donald J. Trump

Given Trump’s prolific use of Twitter leading up to and throughout the campaign, we can apply a quantitative approach to measuring the tone and sentiment of his campaign with text analysis.

SQL is not the first tool people think of when doing text analytics, but it has some interesting advantages. It’s fast, well integrated with existing ETL processes, and easy to build into regular reports. We will use a simple but effective approach, by relating individual words to sentiment scores to get a quick read on the emotional state of DJT.

This technique we’re using was explored in the paper “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.” Using Twitter as a foundation, Finn Årup Nielsen from the Technical University of Denmark (DTU) scored 2,477 words from -5 to 5 based on their perceived sentiment, with -5 being a word linked to a strongly negative emotional content. By joining the words in Trump’s tweets to this list, we can quickly get a per-tweet sentiment score. The data set for this topic is published by the department of Informatics and Mathematical Modelling at DTU.

The data is in the form:

In order to begin this analysis, the tweets must be broken down into individuals words to join to our sentiment table. The method to do this is outlined in our blog post about NPS and survey analytics.

Once we have a resulting table of individual words, we can begin joining our tweets to our sentiment scores.

With a few simple queries we can find Trump’s most celebratory and antagonistic tweets. First we sum the sentiment scores by tweet.

His most positive:

with

sentiment as (

select

tweet_text

, tweet_date

, sum(sentiment_score)

as sentiment_sum

, count(*)::float as words

from trump_words

groupby 1,2

)

select

tweet_text

, tweet_date

, sentiment_sum

from

sentiment

where

sentiment_sum isnotnull

orderby sentiment_sum desc

Limit 10

And his most negative:

These are Trump’s most emotionally charged tweets, but since we have summed the sentiment scores we are biased towards longer phrases. Simple declarations such as “Great!” or “Sad!” convey a lot of emotional content without summing to large numbers. As we dig a little deeper we will divide the summed scores by the number of words in the Tweet to gauge the net sentiment.

Let’s look at Trump’s sentiment over time. This has been normalized to his average sentiment over the data history so that it is centered on 0. This is done because the sentiment data has little absolute meaning without a reference point.

with

monthly_sentiment as (

select

month(tweet_date) as mnth

, tweet_text

, sum(sentiment_score) /max(num_words)::float

as sentiment_score

from

trump_words

groupby 1, 2

)

, avg_sentiment as (

select

avg(sentiment_score) as sentiment_avg

from

trump_words

)

select

mnth

, avg(sentiment_score) - sentiment_avg as sentiment

from

monthly_sentiment

-- I want the average sentiment available for every row.

-- You could use a window function as well.

left join avg_sentiment ontrue

As can be seen, Trump had some serious highs and lows before 2013. The highs are often associated with the Miss USA and Miss Universe pageants, or were tweets written by people (we suspect) other than Trump. The very low scores in 2011 - 2012 are mostly political in nature. Let’s take a closer look at some of these:

Most worrying though, Trump has only become more negative over the past 4 years, especially from 2015 and onwards. Zooming in on the period of the election we can see that Trump generally trended downwards, exhibiting increasing negativity over the course of his Presidential campaign. There were some high points, such as the week of his victory in November, but the overall sentiment of Trump’s tweets continues to decline. We can easily summarize Trump’s tweets with a single word: Sad!

Sentiment by Time of Day: Angry, Early Morning Tweets

Another interesting perspective is Trump’s sentiment as a function of the hour of day. Much has been made of his proclivity for early morning tweeting and what the ramifications of this might be. With a bit of SQL, we can look at this quickly and easily. We use the same query as above, continuing to limit our timeframe to the presidential campaign, but we use a 3 hour rolling average and use an extract, like so:

with

monthly_sentiment as (

select

extract(hour from (tweet_date)::timestamp) as hr

, tweet_text

, sum(sentiment_score) /max(num_words)::float

as sentiment_score

from

Trump_words

where date >= '2015-06-16'

groupby 1, 2

)

, avg_sentiment as (

select

avg(sentiment_score) as sentiment_avg

from

trump_words

)

select

hr

, count(*) as tweets

, avg(avg(sentiment_score)) over(orderby hr rowsbetween 1

preceding and 1 following) - sentiment_avg as sentiment

from

monthly_sentiment

leftjoin

avg_sentiment ontrue

groupby

1, sentiment_avg

He’s at his most negative in the morning. On average, between the hours of 4 am and 7 am, Trump unleashes a high volume of negative tweets. Our data does reveal a counterbalance to this finding. Trump’s tweets become more positive throughout the day and this positive sentiment peaks at night, along with his overall tweet volume.

Wrap Up

Many companies have an abundance of text data, but don’t have a strategy for fully making use of it. After a few simple joins in SQL, keyword frequency and sentiment analysis are two excellent methods for gaining context and insight into text data. Both are broadly applicable beyond tweets. Survey data, comments, competitive information, or web copy can all be explored with the methods discussed above.

We hope our advice conducting text analysis in SQL was helpful. If you’d like a CSV file containing the tweet or sentiment dataset we used for this post, please email us at [email protected] and we’ll send it your way.

Sean Cook

Sean has chased the best tech from San Diego to NYC to SF working in GIS, data science and marketing. You can find him alternatively creating dazzling new analyses and driving content strategy at Periscope Data.