Qualify your audience on Twitter (part 1/2)

Have you already tried to qualify your audience on Twitter? Several tools already exist, but I have been curious to do it with the ones I already have;

R to extract data, and for social network analysis,

and Gephi to visualise my followers communities.

In these article, I propose you to explore you Twitter audience, from the data extraction with R to identify the centers of interest of your followers. I’ll manipulate exploratory methods: social network analysis (SNA) and Natural Language Processing (NLP). This 1st post will be dedicated to the data extraction on Twitter, and to community detection among your followers.

Authentification Process on Twitter

This small paragraph is dedicated to twitter API newbies. You can already find a lot of resources about Twitter data extracction on Internet. So this step will be short:

What do we extract?

I propose you to identify your followers communities. Who are they? And how do they interact? We have first to define the data we want to extract. It’ll be our followers and their follow and friendships among our own followers community.

Detect communities with Igraph (R)

In order to detect communities among my followers, I used the Spinglass clustering from the R Igraph package(Reichardt et Bornholdt, 2008 ; Traag et Bruggeman, 2009). The originality of this algorithm is to consider the network as an energetic system, taking into account not only the links in between entities but also the missing links in between entities, according to an attraction/repulsion principle. The communities are built according to the cut that minimizes energy in the system.

Optimize the clustering

Like a classical classification algorithm (knn), the spinglass algorithm needs an apriori definition of the number of groups (called spins). The default number of spins in Igraph is 25. This initial number of spins will not define the final number of communities.

As a first step, it’s this number of spins that I’ll try to optimise, to have the best clustering possible in between 10 and 50 spins. From those 40 clusterings, I’ll choose the best, that is to say, the one who is maximizing my network’s modularity.

The result: for my 215 followers network (and 2074 links) it’s an 11 initial spins that works the best. (note that the difference is not really huge in between the clustering, but maybe one day, it can make a huge difference)

Verify that your communities are balanced:

results clustering

R

1

2

3

4

5

6

7

8

library("pander")

pandoc.table(table(cluster$membership))

------------------------

12345

--------------------

4140338022

------------------------

Export your data for a Gephi vizualisation

This is the last part of our code; that will allow us to export our data to visualize it on Gephi. We have to prepare 2 files:

an edgelist that describes the friendships and follows links on Twitter,

a node file describing each of my follower with: an Id, a Label (screenName), a Weight (his followers number), and his community index, we computed with the spinglass clustering