Text Analytics

This is a blog to accompany a lecture I gave to the Altervilles Masters Programme in September 2014. In my talk I laid out the following argument: the Urban policy actors are engaging in practices of hashtag politics. That is, they are coining memorable and uniquely named policies, and these names give such policy a life and ultimately a death. I argue that social media is aiding the process but playing a role in their demise. As a public policy academic I think we are well placed to explain these transitions but we must continue to adapt our repertoire of analytical tools to include new forms of data.

My interest in this comes from ten years of encountering policy ideas – something I first saw when exploring Birmingham’s renaissance, how those I interviewed talked without hesitation of how Birmingham during the 1990s had returned to its former glory. I argued at the time that such a discourse had to be continually reiterated to maintain its efficacy. In the early 2000s one such mechanism was a policy idea called “Flourishing Neighbourhoods” – here I encountered tens of different ways of expressing the idea, with associations made with everything from civic pride to feeling safe, involvement and sustainable communities. Such ideas often have vague beginnings and few can remember from where they came. Many will claim it was their idea – name times and places where it was first adopted, stolen, adapted, coopered, corrupted. Few however can remember when it was forgotten. Such is the way with policy ideas – policy actors prefer to discuss the latest and loudest.

In my first project as a post-doc I visited Rotterdam and Copenhagen, there too policy actors framed their work in terms of policy ideas, such as city citizenship and integrated Copenhagen. I am sure if I returned there some 7 years on, many of the practices would be familiar but the branded policy ideas different. Every project I have been involved in has followed similar patterns – Transforming Birmingham, Total place, the big society. They operate at different scales and degrees of scope but they share common properties – they are branded ideas and visions with instrumental intent and container-like elasticity.

Around the time of studying the big society – 2010 or thereabouts – something else was happening. Watch the tour de France and you will see it on the roadside – compare audiences of the Pope in 2005 and 2013 and you will see something different. Every month websites will inform you that 1.19billion people are active on Facebook, 1.3 billion on YouTube and Google-plus, 232 million on Twitter, with growing numbers on the likes of Instagram, Pintrest and Tumblr. As smart phone ownership grows so do these numbers. Particular social networks will come and go, but mobile internet is here to stay.

Not every country is the same – in the UK Twitter is extremely popular in policy circles; in France, for now, less so, but it is growing. Facebook is extremely strong in France and DJs and footballers have followerships of over 15 million. French President Francois Holland has almost 700k followers and has tweeted over 4000 times – a similar number to David Cameron, although his office has 2.8 million. It is where news is announced – and it is where people react. Some call this the biggest focus group in the world, with people retweeting messages and replying to authors or mentioning users.

Cities too are now on Twitter: although mainly remaining in the communications department, cities are working out how to communicate messages and interact with residents. Who has control of the account remains a massive learning curve, just as airlines, train companies and brands have found when trying to promote their brands and defend their reputations. Government departments launch hashtag campaigns – like the Home Office in the UK promoting the police commissioner elections with #MyPCC and photos to make it seem personal. As the NYPD found, these campaigns can easily backfire, with users hijacking the #MyNYPD hashtag to reveal photos of police brutality. Similarly the highly effective #BringBackOurGirls spearheaded by the likes of Michelle Obama clearly put pressure on governments to respond to the plight of kidnapped Nigerian schoolgirls – and yet opened up those who got involved to harsh scrutiny of the Twitterarti.

So thinking about the big society back in 2010 led me down the route of asking two questions: What does all this mean for policy practice, and what does this mean for policy research?

Starting with practice, it leads to questions like how is Twitter changing how policy is made, or named, or implemented; how is it helping or hindering those who promote policy? When you delve into the literature it is somewhat fragmented. The biggest chunk is unsurprisingly focused on the profit motive – how can brands launch successful campaigns, how much does dialogue matter, how much does sentiment matter, can we predict sales or market performance or measure return on investment, and understand the needs of our customers? These questions have a policy relevance because businesses are the early adopters of Twitter – they have realised that two-way communication is advantageous yet difficult, but possible. They have realised that automated sentiment meters are unreliable, that there are many kinds of twitter user, that the mass media continue to exercise a hold on whether brands sink or swim. Furthermore they have sophisticated metrics that can understand which Tweets get more shares and our motivations to do so. But the literature is not all about profits and bottom lines. Much is made of the democratic potential of Twitter – the ability to mobilise and organise mass movements for or against a cause. Here they are interested in the contagion, the virility of a tweet and how these can redress democratic inequality. These literatures give many examples of countries where revolutions were fuelled by Twitter, where governments tried to switch it off or mount astroturf movements. It is changing so many ways in which professions operate – journalists sourcing stories, broadcasters creating viewertariats, emergency planners maintaining lines of communication, intelligence agencies listening in, health officials dissuading smoking, charities promoting donations. Political actors are also discussed in this literature – cultivating their micro-celebrity, self promoting, interacting, hoping to improve election outcomes, while others use Twitter data to predict the outcome, or exercise a Trial by Twitter, pointing out scandals and malpractice.

But Twitter has implications beyond just the practice of psephologists: it has wider implications for policy research. It begs questions like – what are these data – what is the role they play in the rise and demise of contemporary policy ideas – the modern day flourishing neighbourhoods, Total Place or indeed the Big Society (an idea we do not mention any more)? What can we do beyond our existing methods of counting newspaper articles and such? One of the main implications is a clamour for bigger and bigger datasets. This is comfort-data. Researchers have a classic fear of missing out – so try to collect everything, try to capture the full transaction of searchable talk. All of this brings with it issues of ethics and challenges of data management. Some choose to subscribe to companies that index social data, others essentially record it from Twitter through the APIs, and increasing numbers are buying packages of historic tweets from resellers.

Beyond the challenges there are some obvious exciting potentials – alternatives to costly case studies, alternatives to interview and document analysis, an opportunity to get to where debate is taking place, and permit the introduction of innovative methods and automated qualitative processes.

In the talk I set out a range of techniques that might be used, ranging from the most hands off to the most hands on. At the most macro levels, it is possible to count frequencies with relative ease – to track the fluctuation of mentions of a given policy and draw conclusions as to its vitality. We can drill down and explore the influencers, including a new generation of Twitter celebrities and mega-users who have tweeted over 120,000 times. We can apply algorithms to divide messages or parse parts of messages into positive and negative sentiment without the need for human input. If you are dissatisfied or worried about the reliability of such automation then you can capture the tweets and sift through them manually. If desired you can work in small teams coding and classifying items in order to isolate just those tweets expressing an opinion on the policy idea in question and then explore how the debate is changing shape over time. At perhaps the most hands on is to take samples of tweets and feed them back into Twitter itself, asking users to rank order tweets into order of preference and then subjecting these rankings to factor analysis. This Q-methodology reveals shared viewpoints that structure the data and in turn can inform further research – be it qualitative or quantitative by design.

What I conclude is that Twitter data offers a wealth of research possibilities. Whilst user numbers continue to grow and how it is used continues to evolve, so must policy research to keep up. It might be early days for some countries, but all signs suggest more and more policy actors across the world are taking to Twitter to launch and foster policy ideas. I am confident we are now well placed to be able to explain what is happening using a variety of hands on and automated tools. Returning to where we began – although social media has a relatively recent history, the techniques of policy making using memorable branded policy ideas has a much longer past; what is different are the tools that make and break them.

This talk draws on my most recent book: Interpreting Hashtag Politics: Policy Ideas in an Era of Social Media, published by Palgrave and available in a range of formats on Amazon.

Next week I am off to Grenoble to present a new paper at Session 3 New Ideational Turns, as part of panel 84 New directions in the study of public policy, convened by Peter John, Hellmut Wollmann and Daniel A. Mazmanian, the 1st International Conference on Public Policy, Grenoble, France, June 26-28. Friday 28 June, 8.30-10.30, Sciences Library Auditorium.

This paper argues that the discussion of public policy online is offering new and exciting opportunities for public policy research exploring the role of policy ideas. Although considerable work focuses on political ideas at the macro or mid-range, specific policy ideas and initiatives are overlooked, thought to be “too narrow to be interesting” (Berman, 2009, p. 21) .This paper argues that the prolific use of social media among policy communities means it is now possible to systematically study the micro-dynamics of how policy ideas are coined and fostered. Policy ideas are purposive, branded initiatives that are launched with gusto; flourish for around a thousand days; and then disappear with little trace as attention shifts to the latest and loudest. At best, media reports will document that Birmingham’s Flourishing Neighbourhoods initiative has been “scrapped”, “Labour’s Total Place programme has been “torn up”, or the Coalition’s big society policy is “dead”. Save for a return to the policy termination literatures of the late 1980s, our impotence in conceptualising such death-notices reveals how little effort has been invested in understanding and theorising the lifecycle of policy ideas. In response, this paper conceptualises policy ideas, their life, death and succession. The paper draws on a case of the recent Police and Crime Commissioner elections held across England and Wales in November 2012, and the attempts of the Home Office to coin and foster the hashtag #MyPcc.

You hear rumblings that we miss up to 99pc of the conversation using the Twitter search API rather than PowerTrack but we also hear it depends on the velocity of the topic.

In the post below I conclude that the proportion of the conversation you miss out on using API a rather than PowerTrack depends on the velocity of the topic, and based on the example of Mrs Thatcher’s funeral the answer is either a tenth or two-thirds.

At 8:15 on 17th April 2013 we set up a GNIP fetch and an API fetch for the following terms: Thatcher funeral
Thatcher’s
funeral cost
#thatcherfuneral

How did they compare?
By 11.28 GNIP was at 65,869 and API 24,354 – around 36.pc of the “full set”. By 12:25 it was 96,636 for the GNIP and API 31,345 – around 32.4pc . By 12:49 it was 105,022 for GNIP and 35,316 API – 33.6pc
By 13:24 it was 113,519 for GNIP and 41,842 API – 36.8pc

By 8:15 the following morning (18th April 2013) 190,240 GNIP and 109,212 API -42.5pc missing

(note – the figures quoted for the API includes 10,168 historical tweets captured between 21:20 on 16th April 2013 and 8:14 on 17th 2013).

It shows that the API is not suitable for major issues or event Tweeting – where the rate is up and beyond 10,000 tweets per hour.

However where the rates are closer to 1000 per hour – the API stands up fine. For example during the same period we pulled in feeds for the words ‘state & funeral’. – At 12:56 – 3914 GNIP and 3,498 API (minus 835 historic tweets)- 89% .

So it depends on the velocity. Some will add that you get a better quality of metadata with the PowerTrack – but if you are researching a low velocity topic over an extended period of time – you might just be fine with the API.

For the last few months we have been collecting discussion of public policy on social media using DiscoverText. We are trying to understand how public policy is discussed online. To date we have collected just under half a million Facebook posts and YouTube comments on 26 different policies and issues.

The work to understand the shape of the debate starts by de-duplicating exact and near duplicates, then we check the tweets are on-topic, and not just opportunist hashtag spam. We then identify those that express opinion about the topic and divide them up by theme. We draw on a dispersed team of real life human coders who code portions of the datasets. We check for inter-coder agreement and validity. We use the human coding to train custom machine classifiers to classify large portions of the datasets, reducing the need for human coding. One further way of getting a sense of the emerging shape of the discussion is to ask a group of people to Q sort a diverse sample of items using crowdsortq.com. The analysis identifies shared viewpoints and informs further rounds of coding.

“We’re pulling in every tweet, every post, every outward link, every update…It is a huge challenge – the platforms make it difficult for us, they keep changing how we can get hold of it…The APIs don’t give us enough – especially when things trend –but there are ways around it. We find a way around it…We’ve got big data”.

There is a new kind of social scientist: the big data social scientist. Top of their Desert Island Disc choices might be Queen’s I want it all (you know the one – “I want it all, I want it all, I want it all, and I want it now!”)

There is something in the “I want it all” aspiration that reminds me of when I travel on the train, and see the man at the end of the platform, notebook and SLR camera around his neck, sandwiches and flask of coffee packed carefully in his knapsack. He has in his hand a book with the number of the item of rolling stock currently in service. He has photographed and recorded 50% of them, he knows he has another 50% to go. He also wants it all. So that’s him, the new generation of big data social scientists, and Freddy Mercury: they all want it all.

When I recently started to capture (or harvest, or some say ‘scrape’) tweets about the police and crime commissioner elections, I found myself with a spreadsheet of 100,000 rows and ten fields of metadata – that’s 1m datapoints. For a first timer to this world, I had myself big data. It was exciting. I had it all.

Then I started to learn more about the mechanism I was using to pull in these tweets. Blogs and websites were warning me that using the API of Twitter to do this gives you sometimes as few as only 1% of the actual tweets. The limitations of 1500 an hour mean that you can’t get everything. For people who like to collect tweets about Obama or Occupy, there are times of the day that you could easily just end up with a tiny sample of a huge volume of tweets. But there are solutions, these people tell you – pay us a few hundred dollars and we will get it all for you. Yes, all of the tweets. No restrictions. You can have it all.

Meanwhile, imagine the big data social scientist, Ipod strapped to his arm, out for a 5K run to relieve some pressure, singing along, mulling the proposition over…

“Not a man for compromise and where’s and why’s and living lies So I’m living it all, yes I’m living it all,
And I’m giving it all, and I’m giving it all,
It ain’t much I’m asking, if you want the truth,
Here’s to the future, hear the cry of youth,
I want it all, I want it all, I want it all, and I want it now”,

And you can have it now my friend. Just enter your card details and you can have it all.

Looking back at my spreadsheet of big data, it doesn’t seem as big any more. I’ve just got an unknown sample of tweets. And to add to that I don’t have their Facebook activity, or their LinkedIn or what they wrote on the Guardian article or BBC news site, or their blog post. I really don’t have very much. I really only have a little bit.

The fantasy of having ”it all” seems like a possibility because we have the technology, or at least we have come close to it. The new generation of big data social scientists will tell you it was easier a couple of years ago – the platforms were less protective, whereas now they are becoming risk averse or enlightened to how they can monetize and exploit their big data. But they battle on. “If you don’t know how to hack, code or have the means to pay ”, they will tell you, “then you need to think carefully before getting involved with the world of big data. You might be better suited to just regular ‘data’ ”.

But hang on. Let’s look at that spreadsheet again – the one with the 1m data points. There’s quite a bit in there. We should stop ourselves from judging our data by what we don’t have and instead think what can we learn from what we’ve got. It is a simple point, but the quality of your data depends on the questions you are asking and the claims you want to make. There are as many unanswered questions in this spreadsheet as there are tweets. The key is not to try and answer them all, nor is it to be led completely by the availability of data, but rather we need to be creative with our questions and to exploit what we have.

It’s time to be happy with our lot – time to change the playlist – what’s that tune by Bobby McFerrin?

I remember a bit of a row on the Q method.org listserve a while back. There was a discussion about Q Assessor as a tool that allowed both the initiated and the unwashed to do Q studies faster. It clearly riled some members, arguing that
Q was not something to be rushed. And in part I agree. But in the last year I have been starting to look at Twitter data as a source of statements for my concourse and it revealed to me reasons why we need to do Q faster. Let me explain.

Although I have used Q in a number of ways in the past, my main reason for using Q is to understand the subjectivity that surrounds policy ideas. Anybody remember that slogan ‘war on terror’, or the reframing of global warming and associated concerns as ‘climate change’? In the UK an example would be something like the "big society" that seems to have had a three year life expectancy despite it being trailed as the Prime Minister’s ‘big idea for politics’. The current media consensus is that the idea is now dead and defunct. My hunch about all of this is that the formative stages of a policy idea’s life in the spotlight matter. Usually once the launch is over, the report published and the press release issued the policy communities take to Twitter to express their view. They express their views often with humour, irony and the popular ones are cascaded through networks of followerships propelling the message in what, some call, going viral. Policy ideas live and die by the web.

Q methodologists go to great lengths to draw on multiple sources of concourse: newspaper archives, documents, observations, interviews and literature reviews. They bring them together, sample them down and then administer the Q sorts. Fine and long may this continue. But whilst I was collecting Twitter data surrounding a recent policy idea, to have elected police commissioners in England and Wales, I noticed something interesting. If you took the tweets running up to the election as a whole, 50,000 over a couple of weeks, you can see what were the most commonly occurring words and terms. They give you a sense of the common descriptors for how the policy was viewed. Focus on the data a day at a time and we, as Steve Brown himself would say, finds us turning up the microscope. The language varies day to day, certain new phrases come in and stick.

Let me give you a few examples – "spoil" emerged as a frequently mentioned term as the campaign mobilised to encourage people to "spoil" their ballots. "shambles" came in to describe how the election was being administered. #MyPCC the hashtag of the Home Office campaign responsible almost disappeared. What I am trying to say is a simple point – that concourse, the volume of debate surrounding a topic, is not static. Perhaps we could grandly call these daily concourses, or micro-concourses, I don’t mind, but you get my point.

If we are to understand the formation of concourse around emergent ideas, policies whatever, then we need the capability of capturing voluminous discussion. We need tools that can take datasets of 20, 30,100 thousand Tweets or Facebook posts and pull out potential statements. So maybe we do need to do Q faster. I’ll think on.