The main drawback to the ASCII CSV parser and the csv library and is that it can’t handle unicode characters or objects. I want to be able to make a csv file that is encoding in UTF-8, so that will have to be done from scratch. The basic structure follows the previous ASCII post so the json Python object description can be found on the previous tutorial.

io.open

First, to handle the UTF-8 encoding, I used the io.open class. For the sake of consistency, I used this class for both reading the JSON file and writing the CSV file. This actually doesn’t require much change to the structure of the program, but it’s an important change. The json.loads() reads the JSON data and parses it into an object you can access like a Python dictionary.

Python

1

2

3

4

5

6

7

8

importjson

importcsv

importio

data_json=io.open('raw_tweets.json',mode='r',encoding='utf-8').read()#reads in the JSON file

Unicode Object Instead of List

Since this program uses the write() method instead of a csv.writerow() method, and the write() method requires a string or in this case a unicode object instead of a list. Commas have to be manually inserted into the string to properly. For the field names, I just rewrote the line of code to be a unicode string instead of the list used for the ASCII parser. The u'*string*' is the syntax for a unicode string, which behave similarly to normal strings, but they are different. Using the wrong type of string can cause compatibly issues. The line of code that uses the u'\n' creates a new line in the CSV. Once again this is need in this parser needs to insert the new line character to create a new line in the CSV file.

The for loop and Delimiters

This might be the biggest change relative to the ASCII program. Since this is a CSV parser made from scratch, the delimiters have to be programmed in. For this flavor of CSV, it will have the text field entirely enclosed by quotation marks (") and use commas (,) to separate the different fields. To account for the possibility of having quotation marks in the actual text content, any real quotation marks will be designated by double quotes (""). This can give rise to triple quotes, which happens if a quotation mark starts or ends a tweet’s text field.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

forline indata_python:

#writes a row and gets the fields from the json object

#screen_name and followers/friends are found on the second level hence two get methods

row=[line.get('created_at'),

'"'+line.get('text').replace('"','""')+'"',#creates double quotes

line.get('user').get('screen_name'),

unicode(line.get('user').get('followers_count')),

unicode(line.get('user').get('friends_count')),

unicode(line.get('retweet_count')),

unicode(line.get('favorite_count'))]

row_joined=u','.join(row)

csv_out.write(row_joined)

csv_out.write(u'\n')

csv_out.close()

This parser implements the delimiters requirements of the text fields by

Replacing all quotation marks with double quotes in the text.

Adding quotation marks to the beginning and end of the unicode string

Python

1

'"'+line.get('text').replace('"','""')+'"',#creates double quotes

Joining the row list using a comma as a separator is a quick way to write the unicode string for the line of the CSV file.

I outlined some of the potential hurdles that you have to overcome when converting Twitter JSON data to a CSV file in the previous section. Here I outline a quick Python script that allows you to parse your Twitter JSON file with the csv library. This has the obvious drawback in that it can’t handle the utf-8 encoded characters that can be present in tweets. But this program will produce a CSV file that will work well in Excel or other programs that are limited to ASCII characters.

The JSON File

The first requirement is to have a valid JSON file. This file should contain an array of Twitter JSON objects, or in analogous Python terms a list of Twitter dictionaries. The tutorial for the Python Stream Listener has been updated to make the correctly formatted file to work in Python.

Python

1

[{Twitter JSONObject},{Twitter JSONObject},{Twitter JSONObject}]

The JSON file is loaded into Python and is automatically parsed into a Python friendly object by the json library using the json.loads() method. This opens and reads the file in as a string in the open() line, then decodes the string into a json Python object which behaves similar to a list of Python dictionaries — one dictionary for each tweet.

Python

1

2

3

4

5

importjson

importcsv

data_json=open('raw_tweets.json',mode='r').read()#reads in the JSON file into Python as a string

data_python=json.loads(data_json)#turns the string into a json Python object

The CSV Writer

Before getting too ahead of things, a CSV writer should create a file and write the first row to label the data columns. The open() line creates a file and allows Python to write to it. This is a generic file, so anything could be written to it. The csv.writer() line creates an object which will write CSV formatted text to file we just opened. There are some other parameters you are able to specify, but it defaults to Excel specifications, so it those options can be omitted.

The purpose of this parser is to get some really basic information from the tweets, so it will only get the date and time, text, screen name and the number of followers, friends, retweets and favorites [which are called likes now]. If you wanted to retrieve other information, you’d would create the column names accordingly. the writerow() method writes a list with each element being a value which is separated by the comma in the CSV file.

The json Python object can be used in a for loop to access the individual tweets. From there each line can be accessed to get the different variables we are interested in. I’ve condensed the code so that is all in one statement. Breaking it down the line.get('*attribute*') retrieves the relevant information from the tweet. The line represents an individual tweet.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

forline indata_python:

#writes a row and gets the fields from the json object

#screen_name and followers/friends are found on the second level hence two get methods

If the encode() method isn’t included, unicode characters (like emojis) are included in their native encoding. This will be sent to the csv.writer object, which can’t handle those characters and fail. This would be necessary for any field that could possibly have a unicode character. I know the other fields I chose cannot have non-ASCII characters, but if you were to add name or description, you’d have to make sure they do not have incompatible characters.

The unicode escape rewrites the unicode as a string of letters and number much like \U0001f35f. These represent the characters and can actually be decoded later.

Before diving into the problem of how to save tweets in a CSV file, let me say there are a 1,000 ways to do this and about 100 complications that arise depending which way you want to accomplish this. I will devote two posts which covers using both ASCII and UTF-8 encoding because many tweets contain characters beyond the normal Latin alphabet.

Let’s look at some of the issues with writing CSV from tweets.

Tweets are JSON and contain a massive amount of metadata. More than you probably want.

The JSON isn’t a flat structure; it has levels. [Direct contrast to a CSV file.]

The JSON files don’t all have the same elements.

There are many foreign languages and emoji used in tweets.

Tweets contain many different grammatical marks such as commas and quotation marks.

These issues aren’t incredibly daunting, but those unfamiliar will encounter frustrating errors.

Tweets are JSON and contain a massive amount of metadata. More than you probably want.

I’m always in favor of keeping as much data as possible, but tweets contain a massive amount of different metadata attributes. All of these are designed for the Twitter platform and for the associated client apps. Some items like the profile_background_image_url_https really don’t have much of an impact on any analysis. Choosing which attributes you want to keep will be critical before embarking on a process to parse the data into a CSV. There’s a lot to choose from: timestamp data, user data, retweet data, geocoding data, hashtag data and link data.

The JSON isn’t a flat structure; it has levels.

This issue is an extension of the previous issue, since tweet JSON data isn’t organized into a flat, spreadsheet-like structure. The created_at and text elements are located on the top level and are easy to access, but something as simple as the tweeter’s name and screen_name are located in the user nested object. Like everything else mentioned in this post, this isn’t a huge issue, but the structure of a tweet JSON file has to be considered when coding your program.

The JSON files don’t all have the same elements.

The final problem with JSON files is the fields aren’t necessarily present in every object. Many geo related attributes do not appear unless geotagging is enabled. This means if you write your program to look for geotagging data, it can throw a key error if those keys don’t exist in that specific tweet. To avoid this you have to account for the exception or use a method that already does that. I use the get() method to avoid these key errors in the CSV parser.

There are many foreign languages and emoji used in tweets.

I quickly addressed this issue in a few posts, and it’s one of the reasons why I like to store tweets in MongoDB. Tweets contain a lot of of [read: important] unicode characters. These are typically many foreign language characters and the ubiquitous emojis. This is important because the presence of UTF-8 unicode characters can and will cause encoding errors when parser a file or loading a file into Excel. Excel (at least the version on my computer) can’t handle these characters. Other tools like the built-in CSV writer in Python can’t handle unicode out of box. Being able to deal with these characters is critical to compatibility with other software as long as the integrity of your data.

This issue forces me to write two different parsers for examples. I have a CSV parser that outputs ASCII that imports well into Excel along with a UTF-8 version which allows you to natively save the characters and emojis in a human-readable CSV file.

Tweets contain many different grammatical marks such as commas and quotation marks.

This is a problem that I had when I first started working with Twitter data and tried to write my own parser — characters that are part of your text content sometimes get confused with the delimiters. In this case I’m talking about quotation marks (") and commas (,). Comma sseparate the values for each ‘cell’, hence the acronym CSV. If you tweet you’ve probably tweeted using one of these characters. I’ve stripped them out of the text previously to solve this problem, but that’s not a great solution. The way Excel handles this is to enclose any elements that contain commas with quotation marks then to use double quotation marks to signify an actual quotation mark and not enclosed text. This will be demonstrated in the UTF-8 parser since I made that from scratch.

In the first three sections of the Twitter data collection tutorial, I demonstrated how to collect tweets using both R and Python and how to store these tweets first as JSON files then having R parse them into a .csv file. The .csv file works well, but tweets don’t always make good flat .csv files, since not every tweet contains the same fields or the same structure. Some of the data is well nested into the JSON object. It is possible to write a parser that has a field for each possible subfield, but this might take a while to write and will create a rather large .csv file or SQL database.

MongoDB

Fortunately, NoSQL databases like MongoDB exist and it greatly simplifies tweet storage, search, and recall eliminating the need of a tweet parser. Installation and setup of MongoDB and the pymongo library is beyond the scope of this tutorial, but I can quickly explain what MongoDB does. It is a document-based database that uses documents instead of tuples in tables to store data. These documents look like just like JSON objects using key-value pairs, but they are called BSON [since it’s stored as binary]. From a programming prospective, they have similar properties as both JS objects and Python dictionaries.

Since JSON and BSON are so similar, storing a tweet in a MongoDB database is as easy as putting the entire content of the tweet’s JSON string into an insert statement. Recalling or searching the tweets is rather simple as well; it does require an OOP mindset over the traditional SQL command structure.

[I’m writing this from the perspective from ad hoc small-scale research. There might be performance issues that make other storage options much more desirable. Knowing the specific metadata from a tweet you want to keep will make any analysis faster or require less store space. MongoDB allows you to store all the information the API returns to you.]

Storing Tweets in MongoDB

I am going to assume that you have MongoDB running on your local computer for all the code examples.

Storing tweets is rather simple if you already have the Python stream listener built from Part III of the tutorial, since there are only a few changes to be made to the code. The first change will be calling the libraries: pymongo and json. The json library is available by default in Python, but you’ll have to install pymongo using pip install pymongo if you have the pip installer. The bulk of the changes will be in the listener child class.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

frompymongo importMongoClient

importjson

classlistener(StreamListener):

def__init__(self,start_time,time_limit=60):

self.time=start_time

self.limit=time_limit

defon_data(self,data):

while(time.time()-self.time)&lt;self.limit:

try:

client=MongoClient('localhost',27017)

db=client['twitter_db']

collection=db['twitter_collection']

tweet=json.loads(data)

collection.insert(tweet)

returnTrue

exceptBaseException,e:

print'failed ondata,',str(e)

time.sleep(5)

pass

exit()

defon_error(self,status):

printstatuses

The major change in the code includes:

Python

1

2

3

4

5

6

client=MongoClient('localhost',27017)

db=client['twitter_db']

collection=db['twitter_collection']

tweet=json.loads(data)

collection.insert(tweet)

MongoClient creates the MongoClient instance which will be used to interface with the database. The client[‘twitter_db’] call designates the database that is going to be used, and the db[‘twitter_collection’] call selects the collection where the documents will be stored. The json.loads() call converts the string returned from the Twitter API into a json object in Python. Finally, the collection.insert() call inserts the json object into the MongoDB database. From this rather simple change to the Python stream listener all the tweets can be saved into a MongoDB database.

Recalling Tweets from MongoDB

Recalling the tweets from MongoDB database is not too difficult if you understand the basics of Python for loops and dictionaries. The function to retrieve any documents from the database is collection.find(). You are able to specify what you want to find or leave it blank and get all the documents (tweets) returned. For this example, I’ll first just leave it blank to get all the tweets.

After calling the .find() method, Python will return a MongoDB cursor, which can be iterated through in a for loop. The for loop runs the loop for each object in the iterator. If you wanted to print the text from every tweet you would write:

Python

1

2

3

tweets_iterator=collection.find()

fortweet intweets_iterator:

printtweet['text']

tweet contains one document [or in this case a tweet JSON object] in the sequence that tweet_iterator produces. The loop will change this document to another one until the for loop runs through every document in the iterator.

Since tweets in JSON format contain many subdocuments, it’s important to know what data you are looking and where to find it. The following code snippet is an example of different fields available to examine.

The last [‘field’] represents a property and any [‘fields’] before the last represents subdocuments. The text field is on the top level on any tweet document, this is the the text that is written in the tweet. There is a user subdocument with a lot of information in there. The code above pulls the screen_name and the user’s given name and content from retweeted tweets. If I were to retweet Barack Obama, you’d be to pull this data about Obama’s tweet from my retweet. I’ve used this to analyze retweet behavior.

Since MongoDB is a database, you are able to query it; you just can’t use SQL. The collection.find() is the method used for querying. Until now I’ve only used empty parameters in the .find() method to return the entire collection. Querying is done in a style similar to JSON.

To find an exact match to a string:

Python

1

collection.find({'text':'This will return tweets with only this exact string.'})

The previous command will find only that exact string in a top level attribute. This isn’t helpful in a practical sense since exact searches aren’t very useful, but it’s the most basic find command. Having MongoDB pull twitter by a given user’s screen_name has some uses, but it is in a a subdocument so it requires some new syntax "document.subdocument":

Python

1

tweets=collection.find({'user.screen_name':'exactScreenName'})

The above code will search the screen_name property in the user subdocument. Since I’ve shown exact searches, you can search for particular words using a regular expressions operator. This will search the text property to see if it can finds ‘word’ anywhere and return the entire tweet.

Python

1

tweets=collection.find({'text':{'$regex':'word'}})

Since in the MongoDB the tweets might not have the same fields or properties, sometimes just searching to see if a property exists is useful. For example, if you wanted to find all the native retweets in your collection the following snippet is will return any tweet with a retweeted_status property. [The retweeted_status is typically a subdocument containing all the information about the retweeted tweet.]

Python

1

collection.find({"retweeted_status":{"$exists":"true"}})

Conclusion

While using MongoDB has a learning curve, it can be rather useful to store data like tweets. It eliminates the need to write a parser since you effectively parse the data when you retrieve it. Knowing the subdocument structure of the documents in your database and thinking like a programming rather than a SQL database user will help you successful execute analyses in Python using MongoDB for Twitter data.

I use the term stream listener [2 words] to refer to program build with this code and StreamListener [1 word] to refer to the specific class from the tweepy package. The two are related but not the same. The StreamListener class makes the stream listener program what it is, but the program entails more than the class.

While using R and its streamR package to scrape Twitter data works well, Python allows more customization than R does. It also has a steeper learning curve, because the coding is more invovled. Before using Python to scrape Twitter data, a software package like tweepy must be installed. If you have the pip installer installed on your system, the installation procedure is rather easy and executed in the Terminal.

Call Tweepy Library

Terminal:

Shell

1

$pip install tweepy

After the software package is installed, you can start writing a stream listener script. First, the libraries have to be imported.

Python

1

2

3

4

5

importtime

fromtweepy importStream

fromtweepy importOAuthHandler

fromtweepy.streaming importStreamListener

importos

The three tweepy class imports will be used to construct the stream listener, the time library will be used create a time-out feature for the script, and the os library will be used to set your working directory.

Set Variables Values

Before diving into constructing the stream listener, let’s set some variables. These variables will be used in the stream listener by being feed into the tweepy objects. I code them as variables instead of directly into the functions so that they can be easily changed.

Python

1

2

3

4

5

6

7

8

ckey='**CONSUMER KEY**'

consumer_secret='**CONSUMER SECRET KEY***'

access_token_key='**ACCESS TOKEN**'

access_token_secret='**ACCESS TOKEN SECRET**'

start_time=time.time()#grabs the system time

keyword_list=['twitter']#track list

Using and Modifying the Tweepy Classes

I believe that tweet scraping with Python has a steeper learner curve than with R, because Python is dependent on combining instances of different classes. If you don’t understand the basics of object-oriented programming, it might be difficult to comprehend what the code is accomplishing or how to manipulate the code. The code I show in this post does the following:

Creates an OAuthHandler instance to handle OAuth credentials

Creates a listener instance with a start time and time limit parameters passed to it

Creates an StreamListener instance with the OAuthHandler instance and the listener instance

Before these instances are created, we have to “modify” the StreamListener class by creating a child class to output the data into a .csv file.

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

#Listener Class Override

classlistener(StreamListener):

def__init__(self,start_time,time_limit=60):

self.time=start_time

self.limit=time_limit

self.tweet_data=[]

defon_data(self,data):

saveFile=io.open('raw_tweets.json','a',encoding='utf-8')

while(time.time()-self.time)<self.limit:

try:

self.tweet_data.append(data)

returnTrue

exceptBaseException,e:

print'failed ondata,',str(e)

time.sleep(5)

pass

saveFile=io.open('raw_tweets.json','w',encoding='utf-8')

saveFile.write(u'[\n')

saveFile.write(','.join(self.tweet_data))

saveFile.write(u'\n]')

saveFile.close()

exit()

defon_error(self,status):

printstatuses

This is the most complicated section of this code. The code rewrite the actions taken when the StreamListener instance receives data [the tweet JSON].

Python

1

2

3

4

5

saveFile=io.open('raw_tweets.json','w',encoding='utf-8')

saveFile.write(u'[\n')

saveFile.write(','.join(self.tweet_data))

saveFile.write(u'\n]')

saveFile.close()

This block of code opens an output file, writes the opening square bracket, writes the JSON data as text separated by commas, then inserts a closing square bracket, and closes the document. This is the standard JSON format with each Twitter object acting as an element in a JavaScript array. If you bring this into R or Python built-in parser and the json library can properly handle it.

This section can be modified to or modify the JSON file. For example you can place other properties/fields like a UNIX time stamp or a random variable into the JSON. You can also modified the output file or eliminate the need for a .csv file and insert the tweet directly into a MongoDB database. As it is written, this will produce a file that can be parsed by Python’s json class.
After the child class is created we can create the instances and start the stream listener.

Python

1

2

3

4

5

6

auth=OAuthHandler(ckey,consumer_secret)#OAuth object

auth.set_access_token(access_token_key,access_token_secret)

twitterStream=Stream(auth,listener(start_time,time_limit=20))#initialize Stream object with a time out limit

twitterStream.filter(track=keyword_list,languages=['en'])#call the filter method to run the Stream Object

Here the OAuthHandler uses your API keys [consumer key & consumer secret key] to create the auth object. The access token, which is unique to an individual user [not an application], is set in the following line. Unlike the filterStream() function in R, this will take all four of your credentials from the Twitter Dev site. The modified StreamListener class simply called listener is used to create an listener instance. This contains the information about what to do with the data once it comes back from the Twitter API call. Both the listener and auth instances are used to create the Stream instance which combines the authentication credentials with the instructions on what to do with the retrieved data. The Stream class also contains a method for filtering the Twitter Stream. This method works just like the R filterStream() function taking similar parameters, because the parameters are passed to the Stream API call.

Python vs R

At this stage in the tutorial, I would recommend parsing this data using the parser in R from the last section of the Twitter tutorial or creating your own. Since it’s easier to customize the StreamListener methods in Python, I prefer to use it over other R. Generally, I think Python works better for collecting and processing data, but isn’t as easy to use for most statistical analysis. Since tweet scraping would fall into the data collection category, I like Python. It becomes easier to access databases and to manipulate the data when you are already working in Python.

11-10-2015 — I’ve updated the StreamListener to output properly formatted JSON. The old script which works well with R’s tweetParse is still available on my GitHub.