Sentiment Analysis and Natural Language Processing (NLP) have always fascinated me yet I never really understood the inner-workings of this type of analysis and never made the time to dig into the science. Until recently, I didn’t even know that you could collect tweets for free using Twitter’s Search and Streaming APIs. A few days and several blogs later, I’ve now set up R to work with both the Search and Streaming APIs. Since much of the information was located on disparate websites, I thought I’d give a general recap here. This first post deals with using the Twitter Search API and R to collect tweets. Before I dig into the code, there are some notes I want to touch on (which I later learned from Twitter’s documentation).

What is the Twitter Search API?

The Twitter search API, one of three such APIs (search, streaming, “firehose“), allows access to a subset of popular or recent tweets (in the last 4-6 days). That is, it allows querying past tweets (though a significantly small fraction of all tweets). To me, this is a great way to get one’s hands wet on collecting and cleaning tweet datasets, however, it doesn’t really provide any utility for research as the fraction of tweets received may not really be representative of the entire tweet stream.

Who can access the Twitter Search API?

Anyone! That’s right…if you have an account, you can create an authorization token and get started with Big Brotheresque collection of people’s thoughts and locations (that’s right….locations).

How do I get started with Twitter Search API and R?

To be able to query the twitter search api and import the data into R, we’ll need to accomplish the following tasks:

Sign up for Twitter & create an application.

Install R and required R packages

Understand the Twitter Search API query structure

Run our first query, and save to database

Sign up for Twitter & Create an Application

If you don’t yet have a Twitter account, head on over to twitter.com and grab yourself an account. Process is pretty self explanatory. If you already have an account with Twitter, we’ll need to set up an application (this will allow us to connect R to the twitter stream).

First step is to head on over to dev.twitter.com. After logging in (yes, you might be prompted for another login), click on your twitter thumbnail (upper right hand corner of the screen) and click on “My Applications.” In the following screen, click on “Create New App.” You’ll need a name, description, and website. I’ve used my blog address as a website, though I’d imagine that anything works.

Once created, click on “modify app permissions” and allow the application to read, write and access direct messages (this might come in handy later on). Lastly, click on the API Keys tab and scroll to the bottom of the page. Under token actions, click on “Create my access token.” We’ll need these access tokens when we fire up R. That’s it! We’re done with the Twitter part of this setup.

Install R and Required R Packages

If you have not yet installed R, head on over to r-project.org and install the version appropriate for your platform. I also highly suggest installing R-Studio, an integrated development environment and gui for R. I am running R-studio throughout the tutorial.

To use the Twitter Search API, we need the following packages installed:

Some of these packages have dependencies on other packages, so make sure you install all required packages before moving on. To install all of these packages in one run, just copy the following code and run it in R (or R-studio):

R Libraries for Twitter Search API

R

1

2

3

4

5

6

# Install and Activate Packages

install.packages("twitteR","RCurl","RJSONIO","stringr")

library(twitteR)

library(RCurl)

library(RJSONIO)

library(stringr)

Once these packages are installed, it’s time to set up our connection to the Twitter Search API. To do so, we’ll need to copy & paste our API credentials into either a text file, or, preferably, an R script. Just copy the following code into your script, noting the requirements in quotation marks:

R

1

2

3

4

5

6

7

8

# Declare Twitter API Credentials

api_key<-"API KEY"# From dev.twitter.com

api_secret<-"API Secret"# From dev.twitter.com

token<-"Access token"# From dev.twitter.com

token_secret<-"Access token secret"# From dev.twitter.com

# Create Twitter Connection

setup_twitter_oauth(api_key,api_secret,token,token_secret)

Keep the api keys in quotation marks, but remember to replace the text with your actual api and token keys. The setup_twitter_oauth function will create a connection to Twitter’s Search API. If you are successful, the following message should show up in your console:

R

1

[1]"Using direct authentication"

Understanding the Twitter Search API Structure

Our R instance is now ready to receive tweets from Twitter. However, before we can receive any information, we’ll need to understand the format of a Twitter Search query. Per Twitter, the best way to build a query and test if it’s valid and will return matched tweets is to first try it at twitter.com/search. This in essence uses the same API that we are calling. Once your results are accurate, we can load that search string into R.

The query has multiple operators and will behave in the following way:

Operator

Behavior

Obamacare ACA

will find tweets containing both "Obamacare" and "ACA"; not case sensitive

will find tweets containing either "Obamacare" or "ACA" or both; not case sensitive; the OR operator IS case sensitive.

Obamacare -ACA

will find tweets containing "Obamacare" but not "ACA

#Obamacare

will find tweets containing the hashtag "Obamacare"

from:BarackObama

will find tweets sent from Barack Obama

to:BarackObama

will find tweets sent to Barack Obama

@BarackObama

will find tweets referencing Barack Obama's account

Obamacare since:2014-08-25

will find tweets containing "Obamacare" and sent since 2010-08-25 (year-month-day)

ACA until:2014-08-22

will find tweets containing "ACA" and sent before 2010-08-25

There are a few other query operators that you can review on the Twitter Search API documentation page, though the ones in the table above will suffice for this tutorial.

So now that we know the basic structure of a Twitter Search API query, let’s build one in R and run it. Remember, if you are building more sophisticated queries, it’s worth running it through twitter.com/search. Let’s say I’d like to run a query on tweets that have mentioned “Obamacare,” “ACA,” “Affordable Care Act,” or “#ACA.” I’d also like to run it with tweets since 2014-08-20, and I’d like the query to run until it returns 100 tweets. The following code should do the trick:

R

1

2

3

# Run Twitter Search. Format is searchTwitter("Search Terms", n=100, lang="en", geocode="lat,lng", also accepts since and until).

tweets<-searchTwitter("Obamacare OR ACA OR 'Affordable Care Act' OR #ACA",n=100,lang="en",since="2014-08-20")

Running the code results in a list called “tweets” containing 100 rows. To make this list easier to read, let’s transform it into a data frame by running the following:

R

1

2

# Transform tweets list into a data frame

tweets.df<-twListToDF(tweets)

We now have a data frame (tweets.df) with 100 tweets. You’ll notice that the data frame contains 16 columns. While you can figure out on your own what the different columns mean, the following are likely the most important ones:

Text: the text of the actual tweet

Created: the date and timestamp of creation

ID: the ID of the tweet (useful when needing to remove duplicates)

Longitude and Latitude: if available, the longitude and latitude of the tweet.

It’s important to note that the searchTwitter function allows users to enter a latitude/longitude pair. A search string that includes lat/lng is of the following format:

R

1

2

3

# Use the searchTwitter function to only get tweets within 50 miles of Los Angeles

tweets_geolocated<-searchTwitter("Obamacare OR ACA OR 'Affordable Care Act' OR #ACA",n=100,lang="en",geocode="34.049933,-118.240843,50mi",since="2014-08-20")

tweets_geolocated.df<-twListToDF(tweets_geolocated)

When this query item is used, tweets returned will be of two types:

Tweets that have a designated latitude/longitude specified

Tweets whose users specified a location in their profile within the specified radius.

The search string above will provide up to 100 tweets that specified our search terms within 50 miles of Los Angeles, since 8/20, in English.

That’s it! We’ve now saved our first set of tweets to a data frame within R. Much still needs to be done before we can actually do some interesting analysis on the text (like removing retweets, readying the text for sentiment analysis, processing the text and assigning sentiment etc). My next blog post will focus on using the Twitter Streaming API to capture tweets (different than the Search API in that we can now capture tweets in real time, vs. doing a historical search).

For your reference, see the full code below:

Full R Script to Capture Tweets using Twitter Search API

R

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

# Install and Activate Packages

install.packages("twitteR","RCurl","RJSONIO","stringr")

library(twitteR)

library(RCurl)

library(RJSONIO)

library(stringr)

# Declare Twitter API Credentials

api_key<-"API KEY"# From dev.twitter.com

api_secret<-"API SECRET"# From dev.twitter.com

token<-"TOKEN"# From dev.twitter.com

token_secret<-"TOKEN SECRET"# From dev.twitter.com

# Create Twitter Connection

setup_twitter_oauth(api_key,api_secret,token,token_secret)

# Run Twitter Search. Format is searchTwitter("Search Terms", n=100, lang="en", geocode="lat,lng", also accepts since and until).

tweets<-searchTwitter("Obamacare OR ACA OR 'Affordable Care Act' OR #ACA",n=100,lang="en",since="2014-08-20")

# Transform tweets list into a data frame

tweets.df<-twListToDF(tweets)

# Use the searchTwitter function to only get tweets within 50 miles of Los Angeles

tweets_geolocated<-searchTwitter("Obamacare OR ACA OR 'Affordable Care Act' OR #ACA",n=100,lang="en",geocode='34.04993,-118.24084,50mi',since="2014-08-20")

59 Comments

JKC

Very good article. I followed all the steps and it is working fine. But I have noticed one strange thing about the tweets Data Frame which is mentioned below :
While I am able to extract tweets having 19 variables in my PC, the same syntax and procedures yield just 16 variables in my friend’s PC. But both PCs are connected to same network.
The three extra variables / columns that i get is location, language and profileImageURL
Looks more weird.
Any idea about this ? I want those 19 variables in all the PCs connected to my office network inorder to complete the work.

Sudheer

Hi, I am sudheer from India, i have questions
1. while i am trying to extract tweets from twitter based on Date, we are getting only today,but i want 2 years data, How is it possible

2. while i am trying to extract tweets from twitter based on geocode , we are getting only “NA” in longitude,Lattitude columns,i am not getting “geocode” values. How is it possible with out using UserTimeLine function

Mohsin Raza

Mohsin Raza

Hello. Thanks for the tutorial.
I’ve a question..
Kindly tell me if you know.. Is there any way to download conversation/comments on specific tweets using R? And how can download replies on with hashtag??
Thanks

Shahd

bogdanrau

You can. And Twitter indeed also indicates that extra processing should be done when collecting data. You could remove them by ID (to get at removing the same exact tweet in the data), or remove them by the text (but then you’re not technically removing duplicate records, just duplicate sentences). You could also look at the retweet column and remove by that.

bogdanrau

I believe the search API will only give you results for about the last 2 weeks, so if your search term is not as frequent, you won’t get very many results. Try searching for a popular term, like cat, or dog, and I think you might get a lot more than just 100 tweets.

raffaele

bogdanrau

Hi Raffaele. Unfortunately, the search API does not go back further than about 2 weeks. You can try the streaming API going forward for that. The only other solution is to actually buy that data from Twitter, which could get pricey.

SHUBHAM UPADHYAYA

This write up has been very useful for me to understand search API of twitter. I want to extract only tweet ID from twitter rather than whole tweet text & all the other stuff that comes with it, since when i work with a large no. of tweets this api crashes very frequently due to amount of data involved. So is it possible to only extract the tweet ID directly from twitter.

bogdanrau

I don’t believe it’s easy to just extract the tweet ID, however, you could just run your filterStream in a timed for loop (or a cron job if running on linux), then drop all columns with the exception of tweet id. Running filterStream on a timed loop or cron job would ensure that the overall dataset never gets big enough to crash your computer.

Alishba

Hey Sir. Thanks for the tutorial. Can you tell me please how to get tweets of a specific user. I need to fetch tweets from other person’s account and i have no idea how to fetch that account’s tweets. Please help

Sami

Hey Bogdan,
Thanks for this post its very helpful
I’m having problems with running searchTwitter on tweets with Arabic hashtag or Arabic keyword . It doesn’t return tweets with the searched word. I can search for Arabic tweets but only by searching for English keyword or a hashtag in English.

Shahd

hello, Mr.Sami,
I work on a project that concerned about Sentiment Analysis on some trending Arabic hashtags, and I’de like to say that you can use a utf-8 of Arabic words in the Searchtwitter() and it must work with you because it’s worked with us very well.
Hope that helps you, even though it’s too late!

Adrian F

Hi Bogdan ! Firstly thanks to your tutorial, it’s been really helpful and I almost got the task but I’m stocked at the end because my API doesn’t return to me any tweet. I got an error (if I query 3 tweets for example): “3 tweets were requested but the API can only return 0”

bogdanrau

Twitter only has about 10% of tweets that come actually geotagged (per some article posted a while ago…might have changed recently). This means that most of your tweets won’t come with a latitude/longitude. As far as requesting 50,000 and only getting 20,000, I’m not sure how to answer that. Seeing your code might help. It could be due to the fact that you are searching for a very specific term and there are only 20,000 reports total. I believe the search API only goes back about 2 weeks.

Heather Evans

Thanks for posting this information and walking us through how to do this. I’m wondering — I’ve seen a couple different posts online about getting someone’s entire tweet history. Have you had any luck with that?
I guess I could have it running all the time…..

bogdanrau

The best I can come up with is the userStream function in the streamR package. You’d still need to have it running all the time, but you’d only be collecting from this one specific user vs. collecting all.

sharanya

Hi
This information was very much helpful.
But, I need the tweets with the location mentioned, instead of the latitude and longitude. so that I can do some health care data analysis.
Please help me in this regard

bogdanrau

The resulting data frame often has location specified in the location column, though often times it is unreliable. Your best bet is still to filter by lat/lon and then use a service like Google’s geocoder or others to get the actual location of the user. Sorry I can’t be more specific but this answer is not paragraphs long, but rather pages long.

This write up has been very useful for me to understand search API of twitter. i am facing an issue while running this script. it accepts all the code lines however, after i punch the last line tweets.df <- twListToDF(tweets), nothing really happens. do i need to punch another set of codes to view the data frame

bogdanrau

Divye

Hi
First of all this is really a helpful blog. It works perfect like a rocket however,
I have two questions.
1. I am not able to search less popular words like tagbin.(it shows 0 results whereas there are posts containing tagbin.)
2.Why are we getting only 16 variables. When i run the API from ubuntu terminal and get a json it contains 147 variables embedded in 25 elements. some of the useful variables are information about the user who tweeted the post, his details which contain 41 variables like is, name, location etc. How do i access that information?

bogdanrau

Thanks for the note & the kind words! With regards to your questions:
1. Neither the search nor the streaming API provide the entirety of tweets that Twitter collects. It is only a fraction (I believe streaming API is rumored to be ~2%). Less popular words like tagbin have a higher likelihood of not showing up. Have you tried the streaming API? I have a blog post about that as well.
2. I’ve not played around with parsing the twitter json myself, though I’m sure it can be done using RJSON, or RJSONIO, or many of the other JSON-related packages out there. I’ve seen name and location show up in my streaming data. I’ve not used search for quite a while so Twitter may have changed the way they send that information in. Have you had a look at the API docs?

Mark

bogdanrau

Based on my experience, it’s not 100% possible as many times, you’ll need additional data management steps to weed out reports. Look into the streaming API as they have some information and suggestions on your question.

Shovon

Hi,
Thanks for the walk-through. This approach worked for me while it was on my local machine. However, it just does not work when I publish it on shinyapps.io server.
Diagnosis 1) On local server it waits for input to save session. Which could be solved by
options(httr_oauth_cache=T) # Adding this line
setup_twitter_oauth(your_cons … … …. )

But, still it does not work on the shinyapp.io server.

Probably I need to authenticate differently while the R-prog runs on shiny… any idea how to resolve this? Thank you.

bogdanrau

Hmmm….are you able to see any of the console output? When you host it on an actual shiny-server, it outputs all of the console output to a log file. Does shinyapps output a log file? That would be very helpful in figuring this out.

I am following the instructions above but am getting the error:
“Error in check_twitter_oauth() : OAuth authentication error:
This most likely means that you have incorrectly called setup_twitter_oauth()”

I do get the message [1] “Using direct authentication”
Use a local file to cache OAuth access credentials between R session?
1: Yes
2: No

I have search around on google and it looks like it’s a problem with the version of the httr package but I can’t seem to find a solution that will work. Any ideas?

Wendy

Amanda

This is helping me tremendously! However, before I even apply the geo parameters R tells me that “1000 tweets were requested but the API can only return 245.” I want all the tweets since my date, but just put in 1,000 to start the process. I’m using a very popular hashtag- so I don’t see why only 245 would be returned. I worry that the sample will shrink even more once I add geo parameters. Any help is most appreciated! Thanks.

bogdanrau

I have a feeling this might be due to the fact that Twitter only gives you a sample of tweets, not ALL, so its basically saying that out of 1000 available, it was only able to get 245. Basically, there were 1000 total, but twitter is only giving you access to 245. This is total speculation, however.

bogdanrau

Can you try naming the dfs uniquely? maybe tweets_geolocated_US and tweets_geolocated_Dublin? I’m thinking R might just overwrite dfs since you’re technically writing to the same file. Let me know how that goes!

Rachit

Twitter only gives lat/lon pairs, so I don’t believe you can search by city though I believe there was an option to search by the city in the user’s profile, not the actual location from where the user sent the tweet. What you CAN do however is take those lat/lon pairs and geocode them back to cities or any other administrative region for that matter.

Roberto

Dust

bogdanrau

Interesting question. A quick search reveals that the search API has had some issues with Arabic and other languages. Has that been fixed? If so, you should be able to pass the characters in your search string the same way as you’d pass the characters in twitter’s API request. Let us know how that goes!