Since presidential campaigns have incorporated social media into their strategic messaging, it has become more challenging for journalists to cover the election in depth, because of the large amount of data generated by candidates and the public every day. Journalists tend to focus on single quotes or tweets rather than providing analysis and reporting on the aggregate of messages on social media. But single messages may not give people a full appreciation for the style of campaigning or the substance of the rest of the tweets.

To predict presidential campaign message type, we used gubernatorial campaign data from 2014 to build initial categories and to train machine-learning models. And then, we tested the reliability of the best models built from the gubernatorial data and applied them to classify messages from the 2016 presidential campaign in real time. We’ve been collecting all of the announced major party candidates’ Twitter and Facebook posts since they declared their presidential bids. In all we have filled 6 servers with 24 presidential candidates’ social media messages, and of course we’re still collecting. The diagram below demonstrates how we use machine learning to train the models.

Diagram 1: Models training

To understand candidates’ social media message strategy, we collected the Facebook and Twitter messages produced by the campaign accounts of 79 viable candidates who ran for governor. The collections started September 15th when all states had completed their primaries and shifted into the general election phase, and continued through November 7th, three days after the election. We ended up with a total of 34,275 tweets and 9,128 Facebook posts. We categorized these messages by their performative speech categories of strategic message, informative, call-to-action, and ceremonial. Besides these, we also added non-English, and conversational categories (conversational only applies to Twitter). These categories allow us to understand when candidates advocate for themselves, go on the attack, urge supporters to act, and use the affordances of social media to interact with the public.

These categories were developed deductively and were revised based on inductive analysis. We trained annotators and refined the codebook over several rounds until two or more annotators could look at the same message and agree on the category. We generate an inter-coder agreement score to determine how easy or hard it is for humans to categorize the messages, and also to make sure our categories are clearly defined and mutually exclusive as much as possible. Our score of that agreement is Krippendorff’s Alpha of .70 or greater on all categories. After annotating data independently, annotators developed gold standard annotations, which means that two coders categorized the same messages and then where they disagreed on a category, they talked it through and decided which was the “best” category for that message. This labeled Twitter and Facebook messages by the candidates, generating 4,147 tweets and 2,493 Facebook messages as gold standard data.

We used these gold standard data as training data to build models, and then applied the best models to un-labeled candidates’ messages in Facebook and Twitter. Prior to models building, we represented text, added relevant language features and political characteristics for models training purpose. For example,

We replaced instances of user tagging (e.g. @abc) and URLs (e.g. http://abc) with the general features USERNAME and URL;

We used part of speech tagging, e.g. tweets or Facebook posts starting with verb;

Given characteristics of election data, we also added some relevant political features, e.g. political party (Republican, Democrat, Third Party) to help the model training.

For algorithm building, using Scikit-Learn, we performed several experiments with the following multi-class classification algorithms: Support Vector Machine (SVM), Naïve Bayes (NB), MaxEnt/Logistic Regression, and Stochastic Gradient Descent (SGD). All classification tasks were evaluated with 10-fold cross validation. We use a micro-averagedF1 score to measure prediction accuracy (F1 score reaches its best value at 1 and worst at 0). For Twitter data, the best micro-averaged F1 score is 0.72, as shown in Table 1, by using a SGD classifier with a Boolean feature, with tweets feature starting with @_username, verb first and party feature. The F1 value of strategic message is up to 0.75. For Facebook data, the best micro-averaged F1 value is 0.73, by using Linear SVC classifier with a Boolean feature and party feature. The F1 value of call-to-action is up to 0.80. By comparison, the majority baseline for Twitter data is 37.6% (1559/4147) and 40.1% (999/2493) for Facebook. It should be noted that the F1 score of ceremonial messages is low. The reason for the lower score for this category is that there are far fewer of these messages, and they often express a wider range of features, making them harder to classify.

Table 1: Machine prediction performance for Main Categories in Tweeter and Facebook

For Strategic Message type prediction, we trained the classifiers with training data labeled as Strategic Message: 1,559 tweets and 860 Facebook posts. Each message is classified as either Advocacy or Attack. As shown in Table 2, the micro-averaged F1 scores of Twitter and Facebook data are 0.80 and 0.84. By comparison, the majority baseline for Twitter data is 69.4% (1082/1559) and 62.8% (540/860) for Facebook. Similarly, our Strategic Message’s focus classifiers were trained with the messages labelled as Strategic Message as well. Each message can be either Image, Issue or Endorsement. As shown in Table 3, the micro-averaged F1 scores of Strategic Message focus category in both Twitter and Facebook are 0.77. By comparison, the majority baseline for Twitter data is 48.2% (751/1559) and 50.6% (435/860) for Facebook.

We can see that all the micro-averaged F1 scores reported above are much higher than the baseline scores. This suggests that the machine-annotating models have been trained to predict candidate-produced messages well.

We are still testing the reliability of the current best models on presidential campaign data. When using the above reported best models on 2989 human-corrected presidential Twitter data and 2638 Facebook data, we found that generally the models still worked well, as shown in Table 4. But, F1 score of Conversational category dropped 20%. We guess that there should be some differences between gubernatorial campaign data and presidential campaign data in this category, and we are currently investigating the possible reasons.

We also did experiments only including presidential data as training data to test the model performance. For Facebook, we found that the model performs pretty well on predicting strategic messages (F1=0.77) and call-to-action (F1=0.86), as shown in Table 5.

Our next step is to do more experiments to improve the model, e.g., experimenting with binary classification, adding opinion classification and sentiment classification. We are applying the best models to predict category for messages generated by candidates in the 2016 presidential campaign now. In our Illuminating 2016 website, reporters and the public can understand presidential campaign messages type instantaneously. We are pulling public commentary on the election from social media and categorizing them now. You will see public commentary analysis in our website in August.

Acknowledgement:

Thanks to Sikana Tanupambrungsun and Yatish Hegde at the School of Information Studies at Syracuse University for data collection and model training.

Categories

Illuminating 2016 is supported by the Tow Center for Digital Journalism at Columbia University and the Center for Computational and Data Sciences at Syracuse University's School of Information Studies.