Online Social Networks (OSNs) are offering an experience that goes beyond communication,
news or entertainment. With a total user base that reaches the one third of the world
population and an average daily engagement of three hours, OSNs have become a major
phenomenon that affects our society in a variety of ways. Also OSNs have already a history of
almost 30 years of constant growth, creating a sizable market that attracts considerable
funding and innovation. Inline with this growth, there is a parallel increase of interest from
the scientific community that attempts to study OSNs from various perspectives. Without
being complete, these perspectives can be delineated according to the way the community
treats an OSN as a research object.
First of all, an OSN can be perceived as a complex system represented by a social graph that
is continuously changing. A second perspective is as a social phenomenon that hides many
dangers from which the public should be informed and protected. A final view of OSNs is as a
tool, through which we can focus on some interesting trends and tendencies inherent in the
public sphere. This dissertation presents some fundamental contributions in these areas and
uses Twitter as a testbed for experimentation and validation. Initially, we present an effort to
model the temporal evolution of the growth of the social graph. Towards this goal, we collect
two datasets containing daily snapshots of the social graph, one for the early and another for
the later period of Twitter. By fitting this dataset to a well-known but previously untested
model, we are able to graph the evolution of Twitter for a period of 8 years. Additionally, we
annotate the observed fluctuations of this growth with real events and demonstrate how
efficient spam control and service robustness can affect the growth of an OSN.
We proceed to study one of the most common strategies for spam propagation in OSNs. This
is the deliberate mix of popular topics with spam content. By using Machine Learning
methods, we show that the use of trending topics has the maximum discriminatory efficiency
between spam and legit content. Also, we uncover a spam masquerading technique and we
show how we can mitigate spam with simple graph analysis and computationally modest
machine learning models. Finally, we delve into content analysis.
Specifically, we apply a combination of Natural Language Processing techniques to infer how
users express themselves during a real and turbulent electoral event. Towards this, we apply
Named Entity Recognition, Volume analysis, Sarcasm detection, Sentiment analysis and Topic
analysis in order to extract among other, the semantic proximities of different political parties
and the temporal sentiment variation of different groups of voters.