Although this post has been a very long time coming, I’ve gotten many good questions and requests over comment/email to create a streaming API tutorial, so here it is! This second post is a follow up to my initial Collecting Tweets Using R and the Twitter Search API post. As always, before I dig into the code, below are some notes I want to touch on.

What is the difference between the different type of streams?

Twitter explains it best, though I’ll give a short recap here. Public streams are streams of public data that flow through twitter. You can use this for following specific users or topics, and for data mining (chaching!). User streams contain data corresponding to a single user’s stream. Lastly, site streams is a multi-user version of the user stream.

Who can access the Twitter Streaming API?

Like with the search API, anyone that creates a dev account can access live streaming data. See my search API blog post on how to sign up for a dev account.

What R packages will we need?

To use the Twitter Streaming API, we will need to install the following packages:

Some of these packages have dependencies on other packages, so make sure you install all required packages before moving on. To install all of these packages in one run, just copy the following code and run it in R (or RStudio):

R Libraries for Twitter Streaming API

R

1

2

3

4

5

6

# Install and Activate Packages

install.packages("streamR","RCurl","ROAuth","RJSONIO")

library(streamR)

library(RCurl)

library(RJSONIO)

library(stringr)

The next step varies slightly from using the TwitteR package, in that we need to set up an OAuth handshake. This is accomplished using the ROAuth package. See the code below. NOTE: you will only need to do this one time, so long as you save to an .Rdata file (also included in the code). Also, please make sure that you run parts 1 and 2 separately (don’t run the entire code at once). After you run part 1, a browser window will open requesting that you authorize this application. Once you click on authorize, copy the pin provided by Twitter into the R console and hit enter. Then run part 2.

Now that we’re done with setting up the handshake, we can start with a clean (empty) R script, and start collecting data. This is done by using the filterStream function within the streamR package. The filterStream function takes the following parameters:

file.name = the name of the file where tweets will be written to.

track = a string containing keywords to track.

follow = string or vector containing twitter user IDs if we want to only track tweets of specific users.

locations = a vector of latitude and longitude pairs (southwest corner coming first) specifying a set of bounding boxes to filter incoming tweets.

language = a list of BCP 47 language identifiers.

timeout = in seconds, the max length of time to connect to the stream. Setting a timeout = 10 will end the connection after 10 seconds. Default is 0, which means that the connection is always on.

tweets = number of tweets to collect. For example, if you only want to collect 100 tweets, you could leave timeout = 0, and specify tweets = 100. The connection would end after the 100th tweet is collected.

oauth = object where we specify our oauth setup (my_oauth).

verbose = can be TRUE or FALSE, generates output to the R console with information about the capturing process.

tweets.df<-parseTweets("tweets.json",simplify=FALSE)# parse the json file and save to a data frame called tweets.df. Simplify = FALSE ensures that we include lat/lon information in that data frame.

You’ll notice that I also used parseTweets in the code above. This will go through the json file and convert all of the information into a data frame. the simplify argument, when set to true, includes geolocation information (latitude/longitude) in the data frame.

Although not feasible for a small timeout value, we could also specify where exactly we want to pull tweets from. The code below requests tweets from an area that roughly estimates Los Angeles County. Note that we likely won’t get any tweets using small timeouts.

tweets.df<-parseTweets("tweets.json",simplify=FALSE)# parse the json file and save to a data frame called tweets.df. Simplify = FALSE ensures that we include lat/lon information in that data frame.

Some things to keep in mind:

the tweets.json file will NOT be overwritten. Instead, information is appended to it. This is important as the file could get quite large over time. For new searches, I suggest using a separate file, or deleting the old one before running the filterStream function.

Not all tweets will have lat/lon information in them. It’s up to us to clear through the non-geocoded tweets.

49 Comments

rehna

rehna

congrats!
very great job!
i get the following comment on my R console
Capturing tweets…
Connection to Twitter stream was closed after 99 seconds with up to 10 tweets downloaded.
after running the below code
filterStream(file.name=”datatweet.json”, track = c(“@chennai floods”,”@helpline@chennai”,”@chennaiRains”),language=”en”,tweets = 10,oauth = my_oauth)
but i receive empty file every time with different file names.

Khandis

Vidya

Hi! Thanks for this blog. You mentioned in the above comment that we can try and store the tweets as they come in into MySQL tables. I am doing something similar but I want to write the tweets into elasticsearch as and when they come in. My problem is, if I give timeout =0 and file.name=””, I am not able to see/print/access/process any of the tweets as they come in. My code just says “Capturing tweets …”.
Here is the line of code I am using.
filterStream(file.name=””,track=c(“Obama”),timeout=0,oauth=my_oauth, verbose=TRUE)

Is there anyway for me to store the tweets into elasticsearch/mysql table as soon as each tweet comes in?

bogdanrau

There are a few options I think. One would be to actually save the tweets to a file (so use the file.name option). You could then parse what’s in the file on a regular basis (every x minutes?) and import them into elasticsearch/mysql. The other option would be to not use timeout. So turn on streaming with a timeout of 10 minutes, import, turn on for another 10 minutes, import, and so forth. This would require either a cronjob or a task.

utsav

Thank you for this post. I am a newb here and this really helped me. I have a couple of questions though.
1) Is there a solution where I can just continuously download, append and parse tweets real-time?
2) How can I download tweets from the brexit day? I doing a rookie analysis on it. Or do you know any place I can just download the REST data of that day.

bogdanrau

1. I think you’d need to save them to a db somewhere. You can use the streaming API to continuously stream tweets to a file, which then can be parsed and stored by a separate process.
2. You could use the search API instead of streaming. This post discusses the search API, though it might be outdated. I also know that there are a variety of new packages out there that might help in that process. Keep in mind that, to my knowledge, the search API only goes about 2 weeks in the past. Anything more than that and you’d need to go to one of Twitter’s data services (and pay $$$).

bogdanrau

CarlosFra

Hi
I am having an issue.
After run the code , I have
“Capturing tweets…
Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded.”
The APi was set up correctly and no modification of you code.
Any idea?
Thanks for your help

bogdanrau

Hello, first, thanks for the example. I already have used twitteR other time. Recently im digging into the stream API. So my question is, when we say “search api goes back in time” and “stream api is for real time data”. How this historical data works? I mean, if i run my script with search api now, im able to get tweets posted like 20 minutes ago right? So how can a tweet be in stream api and not in search api? Why we cannot just use search api instead?

bogdanrau

The difference between streaming and search is that streaming is a real-time, always-on API service. This means that you can start listening to tweets at between 10:00 and 11:00, and you will get ~1% of twitter’s stream during that 1-hour interval, and nothing else. Nothing before 10:00 or after 11:00. Search API searches twitter’s historical tweets within the last ~2 weeks, meaning that if you run the search API at 10:00, you’ll get search results from ~2 weeks ago up until 10:00. I’m not sure if the search API returns ~1% or not and I don’t think the documentation mentions this either. I hope this clarified the difference.

Katrin

Hi Bogdan,
Thanks so much for your explanations. They were very helpful and the code runs perfectly.
I have a question regarding data collection and storage. I would like to start collecting tweets and am unsure about the right way to do it. As I understand it the streamR package collects data in realtime, so if I set a timeperiod –say an hour- I would have to restart the code every hour to get all tweets? Is that correct or is there another (automated) way to do it? I read that you suggest the streamR package and filterStream() function to collect data. What would be the advantage over the twitter package and the searchTwitter() function if you want to collect longer (and complete) timeseries.?
Thank you!

bogdanrau

To do it purely in R, you’d need to run it on a server and run the R data collection script on a cron job. There are other tools out there made for ingesting data from streaming APIs (Apache Hive), but I am not familiar with that implementation. The difference between twitteR and streamR is that the former uses Twitter’s SEARCH API, which only goes back about 2 weeks or so. StreamR uses Twitter’s STREAMING API, which only collects data in real time (no historical data). You could use both at the same time, though you’d need to de-duplicate.

jeremy

bogdanrau

What would be the use case here? Collect and display, or allow users to specify what to collect? You can store the credentials after the handshake locally and load it each time you make a call to the API.

Francesco Piccinelli

bogdanrau

You can use the search API to get tweets from about 2-weeks back. I don’t believe they have an API where you can historically mine the data. So your options are to start collecting now (with streaming) until you have enough sample, or use streaming and add on top the search API data. There are a few data vendors that are authorized by Twitter to sell historic data. Depending on your budget, you might want to reach out to them. Have a look at gnip: https://gnip.com/historical/

Amar

I am not able to proceed after this line.
my_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”,package = “RCurl”))
After running this line, browser opens for login and then it gets disconnected.

According to the Twitter streaming API webpage, this won’t work as you want, because you can only chose EITHER location OR words (track) but with streaming API, you can’t search for both at the same time. I.e. this should give you all tweets in SF rather then those with the terms “Affordable Care” etc.
Or did I misunderstand the streamingAPI documentation?

If you were to use TwitteR instead to look into the last 2 weeks, how would you set up the authentication? The searchTwitter() function doesn’t take the oauth=my=oauth Argument. I have been trying for a while. Any help would be appreciated.

bogdanrau

Thank you for pointing that out, and you are correct. Looks like the streaming API matches on EITHER, but not both a location bounding box and a search term. This might have been a change in their API. They do mention that additional filtering steps need to be taken so you could just do the filtering in R (ie. collect all tweets in a location, then filter those again using grep or something else). With regards to twitteR, see this: http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-search-api/. It was written a while ago so it might not be fully relevant, though it should hopefully help.

Sruteesh

Hi,
Very Helpful Article.
I have a small query. I tried running the above code for 120 secs and was able to stream only 5 tweets which I feel is very slow where as SearchTwitter from TwitteR package is much faster. Can u explain how streaming tweets using the above mentioned method is useful?

bogdanrau

The twitteR package I believe uses the SEARCH API, which can only look into the past about 2 weeks or so. The streamR package allows you to take in “real-time” tweets as opposed to searching for them in the past. Depending on what your search terms are, you may get few records if any.

bogdanrau

Some search results for that error reveal that some IP addresses might have been blacklisted. Given that you were able to search the archive, I’m not certain that might be the case. Can you try switching from https to http in your url?

bogdanrau

julka

Thank You very much for this, it’s really great! However, when I get to this line I get an error message (please see below)
my_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”))
Error in function (type, msg, asError = TRUE) :
SSL certificate problem: unable to get local issuer certificate

I have been looking on Stackoverflow, but can’t find a reply (granted I’m quite a noob at R, so I might be not looking for the right thing).

bogdanrau

Thanks for your question! See the link below for a possible solution. Sounds like you might just need to download the cacert.pem file again. I run all my code on a server, so I can’t say 100% that this would work, but I remember running into something similar and that did the trick!

bogdanrau

I’ve not found any tutorials online regarding your question, however, I’ve dabbled a bit in doing just that. You’ll want to write a script that collects the data and writes it to a database (or file), and run that either on a cron job, or continuously (which might cause problems since the API sometimes goes down).

Alina

Hey!
Your problem is that after running the code line “my_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”))” it opens the website, you click the allow application button and then got blocked with the message “No page available”? I had this problem and the solution is to remove the callback URL from the twitter application. Set the callback URL just when you want to download historical data.
Best,
Alina

bogdanrau

I´m streaming tweets by location setting the locations argument and it works fine. The structure of my code is like yours: I first stream, writing a file to memory and then parse and write the file to memory again. I would like to run the code on a windows server and leave the connection open all time by setting the timeout argument equal to 0.

Any idea how to write the code so that it´d write a file every hour while leaving the connection open? I´ve been trying around with a second R code loading the json file and parsing it, but I didn´t manage to get it work. Any hint is highly appreciated!

bogdanrau

That’s an excellent question, and one that I’ve thought about myself, though haven’t had time to investigate. This is the way I would try and develop a solution (there’s never a single solution so this is probably one of many):

1. The code to start collecting should stay the same as long as timeout = 0. I’m not sure if there are any rules on Twitter’s end that you’d need to be aware of, so definitely check their API docs for that.

2. You’ll need to create a function that uses the Sys.time() function in R. You basically set your start time at hh:mm:ss, then that function would then copy your “collection” file into a separate file (could use a for loop to name it accordingly – maybe include some date/time info in there), empty the “collection” file after copying, so that R can continue filling it.

I’m not sure if this makes sense, or if there’s a better way to do it. The other way that comes to mind is to use RMySQL to write to SQL tables that you have set up (you’d need a server for that). So basically, each time a new tweet comes in, write it directly to the database. This would basically continue updating your tables for infinity, so long as nothing breaks.

Hope this helps! Please do let me know if you find a solution. Very intriguing question!