Tag Course Material

Now, we focus on setting up our data collection and getting acquainted with Twitter’s APIs. We start by discussing the research process with digital trace data. Following this, we will prepare your machine for working with Python. We then will get you access to Twitter’s API. We then use a couple example scripts provided in our tutorial to get some first practice in collecting data on Twitter.

To prepare for this session or to build on the issues we discussed you could have a look at Twitter’s API documentation. Make sure to have an extended look at the data fields provided by Twitter’s API in its API objects. This will show you which information Twitter provides you with through their API and which information is available to you for your analyses.

Also, keep in mind that the command line might make trouble accessing a file on your machine that has spaces in its name or in its path. In these cases, either rename your file or put quotation marks around the complete file path (e.g. cd “/Users/(…)/twitterresearch”).

Also please prepare your machine for the following sessions. First, make sure you have a Python distribution up and running. For the purposes of this course, I recommend Continuum Analytic’s Anaconda. No matter what installer you choose in the end, make sure to get an installer for Python 3.x, not 2.x! This is also important to keep in mind when choosing tutorials or example code online. The syntax of Python 2.x diverges slightly from that of 3.x resulting in examples and recipes developed for one or the other version to not necessarily work in the other. So mind the versions!

Now, make sure your system is prepared to work with Python. If you are using a Mac make sure you have Apple’s Xcode installed. If you are using a PC please install a current version of Microsoft’s Visual Studio. Please make sure your version includes both “Visual C++” and “Common Tools for Visual C++ 2015”. This is important as both programs are needed for you to run specific Python modules. In case you run in any troubles maybe have a look at this comment thread.

Welcome to the course! You have interesting sessions to look forward to. At the end of which, I hope you are at least as excited by the work with digital trace data as you are now but of course much more able to translate that excitement into actual scientific projects.

We then will focus on two prominent fallacies in the work with digital trace data:

The n = all fallacy;

The mirror hypothesis

Both fallacies can be found explicitly or implicitly in prominent works based on digital trace data. They are central to limiting the value of research based on digital trace data and to raising false expectations of which types of insight these data type can actually deliver.

Central to avoiding these fallacies are three often neglected steps:

Start by clearly thinking about research design in working with digital trace data.

Keep the data generating process in mind that led to the production of specific data sets. Doing so will help you in deciding and justifying for which social or political phenomena specific sets of digital trace data might hold promising insights.;

After this, we will close by discussing a series of interesting questions in political science closely related to the data generating process leading to the publication of tweets and, therefore, closely connected with digital trace data.

After getting the hang of collecting data on Twitter and preparing them for analysis, it is time for you to design your own research project. As always, the best research questions are not purely data driven or solely motivated by opportunities provided by access to specific data sets. Instead, make sure to anchor your research design within larger questions connected with political or social science.

One approach to finding a promising question might be: “What aspect of social or political life is closely connected with political Twitter activity and might, therefore, be illustrated by data collected on Twitter?”

In the beginning, this might seem a little awkward or challenging but stick with it. If you will not ask these questions someone else will. This way, you’ll make sure you have a good answer once you are asked in front of a room full of people. Also, choosing projects based on answers to these questions will make for rewarding projects and ultimately better chances for publication.

Before you settle on a question, make sure to read up on what has already been done with Twitter data. A helpful overview on Twitter-based research on electoral campaigns can be found in the required readings for this session:

After getting a first overview on some of the work that has been done on and with Twitter in political contexts, it might pay off to read up on some of the conceptual issues in the use of digital trace data for social science research. Here is a short list of papers offering you a good window into current methodological debates.

Literature reviews and conceptual debates are all well and good but nothing stimulates your intuition as reading primary research directly. For this purpose, I provided you with a slightly longer list on innovative studies focusing on political uses of Twitter or using Twitter data in research. Of course, this can only be a small selection and is by no means exhaustive. Still, this should provide you with a running start.

In case you are interested in Twitter activities of German politicians there are potential some short-cuts available to you. The GESIS has published a dataset documenting all publicly available tweets published by candidates running in the 2013 German Federal Election. The dataset also contains mentions by other Twitter users of these politicians and a set of tweets containing topically relevant hashtags.

In this session we focus on querying the data contained in our database to extract information allowing us a series of typical analyses often performed with Twitter data. The three analytical approaches are counts, time series, and network analysis. In the tutorial we describe these analytical approaches in detail and list exemplary studies illustrating these approaches (see Jürgens & Jungherr, pp. 42-79).

We will query the database from Python using a series of predefined commands. As before, we use peewee to communicate with our SQLite database from Python. Make sure to examine the workings of these commands in detail. You find them listed in our script “database.py“.

The example scripts provided for us are specified to work with an example dataset collected by us during on the Republican Primary debates in the autumn of 2015. You can download a replication dataset through Twitter’s “hydrate” function following the instructions in Jürgens & Jungherr, p. 42. Of course you can adapt our commands provided in the files “example.py” and “database.py” according to your interest. Still, presently they are optimized to working with our sample dataset.

During this session, we will learn how to load data downloaded on Twitter into a database. Loading tweets into a database will create a little overhead in work at the beginning but trust me, this will more then pay off for you further down the line when you are working with large data sets.

In this session, we will discuss the workings of the file “database.py” provided in our script set accompanying the Jürgens & Jungherr (2016). The script will load downloaded tweets into a SQLite database object and establish a predefined database structure allowing you a set of typical analytical approaches to Twitter data. The script uses peewee to interact with SQLite from Python.

To retrace your steps after the session have a look at Jürgens & Jungherr (2016), pp. 29-41. Also, have a look at the example code provided at the end of the post.

In this session, we will focus on different approaches of getting data from Twitter’s various APIs. We will be using the example scripts provided in our tutorial.

We will collect data through Twitter’s Streaming APIs using as selectors keywords, hashtags, and user names. We will also use Twitter’s REST APIs to collect messages using specific keywords or hashtags posted during the last seven days and to collect messages from users’ tweet archives.

In preparation for the session have a look at the mandatory readings and Twitter’s documentation of its Streaming and REST APIs as well as the API objects found in metadata to specific tweets.

In this session, we will learn some fundamentals in working with Python. So make sure you have a working copy of Python running on your machine.

In this session, we will concentrate on very basic functionality in using Python as to allow you to read and modify some of the example scripts provided the Python tutorial underlying the course. We will have a look at some basic commands in Python, how to write and open Python scripts from the command line, flow control in scripts, and the definition and loading of functions.

The examples given in the session will largely follow those given in two excellent introductory tutorials.

We will have only time to discuss a small selection of content covered in these tutorials but make sure to spend some time after the course working through these tutorials. This will help you massively in becoming more self-proficient in the use of Python and ultimately allow you much more flexibility in collecting and analyzing digital trace data.

Another option for teaching yourself the basics of Python is a free interactive introductory course to Python offered by codecademy.

As you have probably gathered by now, this session will only offer you the most preliminary of introductions to the use of Python. But do not worry. If you caught the bug, there are excellent guides available to you taking you further down the rabbit hole.

In our second session, we focus on setting up our data collection and getting acquainted with Twitter’s APIs. We start by discussing the research process with digital trace data. Following this, we will prepare your machine for working with Python. We then will get you access to Twitter’s API. We then use a couple example scripts provided in our tutorial to get some first practice in collecting data on Twitter.

To prepare for this session or to build on the issues we discussed you could have a look at Twitter’s API documentation. Make sure to have an extended look at the data fields provided by Twitter’s API in its API objects. This will show you which information Twitter provides you with through their API and which information is available to you for your analyses.

Also, keep in mind that the command line might make trouble accessing a file on your machine that has spaces in its name or in its path. In these cases, either rename your file or put quotation marks around the complete file path (e.g. cd “/Users/(…)/twitterresearch”).

Also please prepare your machine for the following sessions. First, make sure you have a Python distribution up and running. For the purposes of this course, I recommend Continuum Analytic’s Anaconda. Please install Anaconda for Python version 3.xx.

Now, make sure your system is prepared to work with Python. If you are using a Mac make sure you have Apple’s Xcode installed. If you are using a PC please install a current version of Microsoft’s Visual Studio. Please make sure your version includes both “Visual C++” and “Common Tools for Visual C++ 2015”. This is important as both programs are needed for you to run specific Python modules. In case you run in any troubles maybe have a look at this comment thread.

Welcome to the course! You have an interesting four days to look forward to. At the end of which, I hope you are at least as excited by the work with digital trace data as you are now but of course much more able to translate that excitement into actual scientific projects.

We then will focus on two prominent fallacies in the work with digital trace data:

The n = all fallacy;

The mirror hypothesis

Both fallacies can be found explicitly or implicitly in prominent works based on digital trace data. They are central to limiting the value of research based on digital trace data and to raising false expectations of which types of insight these data type can actually deliver.

Central to avoiding these fallacies are three often neglected steps:

Start by clearly thinking about research design in working with digital trace data.

Keep the data generating process in mind that led to the production of specific data sets. Doing so will help you in deciding and justifying for which social or political phenomena specific sets of digital trace data might hold promising insights.;

After this, we will close by discussing a series of interesting questions in political science closely related to the data generating process leading to the publication of tweets and, therefore, closely connected with digital trace data.

Over the course, students will learn fundamental techniques of data collection preparation, and analysis with digital trace data in the social sciences. In this, we will focus on working with the microblogging-service Twitter. Over the course, students are expected to become proficient in the use of two programming languages, Python and R. The course will be offered as a Blockseminar on two weekends in October and November.

The course is designed for students without prior training in programming or exmploratory data analysis. Still, by the end of course students are expected to independently perform theory-driven data collections on the microblogging-service Twitter and use these data in the context of a series of specified prototypical analyses.

We will start the course by focusing on conceptual issues associated with the work with digital trace data. Students will then learn to use fundamental practices in the use of the programming language Python. Following this, we will collect data from Twitter’s APIs through a set of example scripts written in Python. After downloading data from Twitter through Python, we will load these data into a SQLite database for ease of access and flexibility in data processing tasks. Finally, we will discuss a series of typical analytical procedures with Twitter-data. Here, we will focus on counting entities and establishing their relative prominence, time series analysis, and basic approaches to network analysis. For these analyses, we will predominantly rely on R.