Who Will Follow You Back? Reciprocal Relationship Prediction

Transcription

1 Who Will Follow You Back? Reciprocal Relationhip Prediction John Hopcroft Department of Computer Science Cornell Univerity Ithaca NY 4853 Tiancheng Lou Intitute for Interdiciplinary Information Science Tinghua Univerity Beijing 84, China Jie Tang Department of Computer Science Tinghua Univerity Beijing 84, China ABSTRACT We tudy the extent to which the formation of a two-way relationhip can be predicted in a dynamic ocial network. A two-way (called reciprocal) relationhip, uually developed from a one-way (paraocial) relationhip, repreent a more trutful relationhip between people. Undertanding the formation of two-way relationhip can provide u inight into the micro-level dynamic of the ocial network, uch a what i the underlying community tructure and how uer influence each other. Employing Twitter a a ource for our experimental data, we propoe a learning framework to formulate the problem of reciprocal relationhip prediction into a graphical model. The framework incorporate ocial theorie into a machine learning model. We demontrate that it i poible to accurately infer 9% of reciprocal relationhip in a dynamic network. Our tudy provide trong evidence of the exitence of the tructural balance among reciprocal relationhip. In addition, we have ome intereting finding, e.g., the likelihood of two elite uer creating a reciprocal relationhip i nearly 8 time higher than the likelihood of two ordinary uer. More importantly, our finding have potential implication uch a how ocial tructure can be inferred from individual behavior. Categorie and Subject Decriptor H.2.8 [Databae Management]: Data Mining; J.4 [Social and Behavioral Science]: Micellaneou; H.4.m [Information Sytem]: Micellaneou General Term Algorithm, Experimentation Keyword ocial network, reciprocal relationhip, ocial influence, predictive model, link prediction, Twitter Author are in alphabetic order. The work were done when the lat two author were viiting Cornell Univerity. Permiion to make digital or hard copie of all or part of thi work for peronal or claroom ue i granted without fee provided that copie are not made or ditributed for profit or commercial advantage and that copie bear thi notice and the full citation on the firt page. To copy otherwie, to republih, to pot on erver or to reditribute to lit, require prior pecific permiion and/or a fee. Copyright 2X ACM X-XXXXX-XX-X/XX/XX...$5... INTRODUCTION Online ocial network (e.g., Twitter, Facebook, Mypace) ignificantly enlarge our ocial circle. One can follow any elite (celebritie), e.g., politician, model, actor, and athlete, or cloe in her phyical ocial network. An intereting quetion here i: when you follow a number of uer, who will follow you back? A more pecific quetion i: if you follow thoe celebritie (elite uer) on Twitter, do you think they will follow you back? The anwer i often No, but alo Ye ometime. There are a number of top uer with ten of thouand of follower, who will follow everyone back. Some even ue tool to do follow-back automatically, while other go through the following lit and add their new follower manually. Awarene of how thee relationhip are created can benefit many application uch a uggetion, community detection, and word-of-mouth product promotion. In ocial cience, relationhip between individual are claified into two categorie: one-way (called paraocial) relationhip and two-way (called reciprocal) relationhip [9]. The mot common form of the former are one-way relationhip between celebritie and audience or fan, while the mot common form of the latter are two-way relationhip between cloe. Twitter and Facebook are repectively typical example of the two type of ocial relationhip. Social relationhip form the bai of the ocial tructure. Indeed, ocial relationhip are alway the baic object of analyi for ocial cientit, for intance, in Max Weber theory of ocial action [29]. Undertanding the formation of ocial relationhip can give u inight into the micro-level dynamic of the ocial network, uch a how an individual uer influence her/hi through different type of ocial relationhip [26], and how the underlying ocial tructure change with the dynamic of relationhip formation [23]. Employing Twitter a the bai of our analyi, we tudy how a two-way (reciprocal) relationhip ha been developed from a one-way (paraocial) relationhip. Specifically, we try to anwer: when you follow a particular uer (either an elite uer or an ordinary uer), how likely will he/he follow you back?. Thi problem alo implicitly exit in other ocial network uch a Facebook and LinkedIn: when you end a requet to omebody, how likely will he/he confirm your requet? Previou reearch on ocial relationhip can be claified into three categorie: link prediction [2, 6, 7, 23], relationhip type inferring [4, 5, 27], and ocial behavior prediction [, 25, 33]. Backtrom and Lekovec [2] propoed an approach called upervied random walk to predict and recommend link in ocial network. Crandall et al. [4] invetigated the problem of inferring ocial tie between people from co-occurrence in time and pace.

2 Wang et al. [27] propoed an unupervied algorithm to infer advior-adviee relationhip from a publication network. However, little reearch ytematically tudie how two-way relationhip can be developed from one-way relationhip. More fundamentally, what are the underlying factor that eentially influence the formation of two-way relationhip? and how exiting ocial theorie (e.g., tructural balance theory and homophily) can be connected to the formation proce? In thi paper, we try to conduct a ytematic invetigation on the problem of two-way (reciprocal) relationhip prediction. We preciely define the problem and propoe a Triad Factor Graph (TriFG) model. The TriFG model incorporate ocial theorie into a emiupervied learning model, where we have ome labeled training data (two-way relationhip) but with low reciprocity [3]. Given a hitoric log of uer following action from timetot, we try to learn a predictive model to infer whether uer A will add a followback link to uerb at time(t+) if uerb create a new follow link to uer A at time t. We evaluate the propoed model on a Twitter data coniting of 3,442,659 uer and their profile, tweet, following behavior (new following or follow-back link) for nearly two month. Reult We how that incorporating ocial theorie into the propoed factor graph model can ignificantly improve the performance (+22%-+27% by F-Meaure) of two-way (reciprocal) relationhip prediction compared with everal alternative method. Our tudy alo reveal everal intereting phenomena:. Elite uer tend to follow each other. The likelihood of an elite uer following back another elite uer i nearly 8 time higher than that of two ordinary uer and 3 time that of an elite uer and an ordinary uer. 2. Two-way relationhip on Twitter are balanced, but one-way relationhip are not. More than 88% of ocial triad (group of three people) with two-way relationhip atify the ocial balance theory, while one-way relationhip are unbalanced (merely 25% of them atify the balance theory). 3. Social network are going global, but alo tay local. No matter how far a uer i from you, the likelihood that he/he follow you back i almot the ame. While, on the other hand, the number of two-way relationhip between uer within the ame time zone i 2 time higher than the number of uer from different time zone. Organization Section 2 formulate the problem. Section 3 introduce the data et and our analye on the data et. Section 4 explain the propoed model and decribe the algorithm for learning the model. Section 5 preent experimental reult that validate the effectivene of our methodology. Finally, Section 6 review the related work and Section 7 conclude thi work. 2. PROBLEM DEFINITION In thi ection, after preenting everal definition, we formally define the targeted problem in thi work. We formulate the problem in the context of Twitter to keep thing concrete, though adaptation of thi framework to other ocial-network etting i traightforward. The Twitter network can be modeled a a directed graph G = {V,E}, where V = {v,v 2,...,v n} i the et of uer, and E V V i the et of directed link between uer. Each directed link e ij = (v i,v j) E indicate that uerv i follow uer v j. The Twitter network i dynamic in nature, with link added and removed from over time. However, our preliminary tatitic on a large Twitter data et how that uer tend to add new link much more frequently than to remove exiting link (e.g., 97% of change to link are adding new link). Therefore, adding new link form the tructure of the Twitter network. A new link reult when a uer perform a behavior of following another uer (back) in Twitter. Particularly, we define two type of the link behavior: Definition. New-follow and Follow-back: Suppoe at time t, uerv i create a link tov j, who ha no previou link tov i, then we ayv i perform a new-follow behavior onv j. When uerv i create a link tov j at timet, who already ha a link tov i before timet, we ayv i perform a follow-back behavior on v j. The new-follow and follow-back behavior repectively correpond to the one-way (paraocial) relationhip and the two-way (reciprocal) relationhip in ociology. In thi work, we focu on invetigating the formation of follow-back behavior. For implicity, let y t ij = denote that uer v i follow back v j at time t and y t ij = denote uer v i doe not follow back. We are concerned with the following prediction problem: Problem. Follow-back prediction. Let <,...,t > be a equence of time tamp with a particular time granularity (e.g., day, week etc.). Given Twitter network from time to t, {G t = (V t,e t,y t )}, wherey t i the et of follow-back behavior at time t, the tak i to find a predictive function: f : ({G,,G t }) Y (t+), uch that we can infer the follow-back behavior at time(t+). It bear pointing out that our problem i very different from exiting link prediction [2, 7, 23] and ocial action prediction problem [25, 33]. Firt, a the twitter network i evolving over time, it i infeaible to collect a complete network at timet. Thu it i important to deign a method that could take into conideration the unlabeled data a well. Second, it i unclear what are the fundamental factor that caue the formation of follow-back relationhip. Finally, one need to incorporate the different factor (e.g., ocial theorie, tatitic, and our intuition) into a unified model to better predict the follow-back relationhip. 3. DATA AND OBSERVATIONS 3. Data Collection We aim to find a large et of uer and a continuouly updated network among thee uer, o that we can ue the data et a the gold-tandard to evaluate different approache for our prediction. To begin the collection proce, we elected the mot popular uer on Twitter, i.e., Lady Gaga, and randomly collected, of her follower. We took thee uer a eed uer and ued a crawler to collect all follower of thee uer by travering following edge. We continue the travering proce, which produced in total 3,442,659 uer and 56,893,234 following link, with an average of 728,59 new link per day. The crawler monitored the change of the network tructure from /2/2 to 2/23/2. We alo extracted all tweet poted by thee uer and in total there are 35,746,366 tweet. In our analyi, we alo conider the geographic location of each uer. Specifically, we firt extracted the location from the profile of each uer 2, and then fed the location information to the Google Map API to fetch it correponding longitude and latitude value. In thi 2 For example, Lady Gaga location information i: Location: New York, NY.

3 follow back probability avg time zone difference (a) Global #follow back average time zone difference (b) Local Figure : Geographic ditance correlation. X-axi: time zone difference ( indicate that uer are located in the ame time zone); Y-axi: (a) probability that one uer follow back another uer, conditioned on the time zone difference of the two uer. (b) number of two-way relationhip among uer from the ame time zone or different time zone. way, we obtained the longitude and latitude of about 59% of uer in our data et. More detailed analyi and an online demontration i publicly available Obervation We firt engage in ome high-level invetigation of how different factor influence the formation of follow-back (reciprocal) relationhip, ince a major motivation of our work i to find the underlying factor and their influence to thi tak. In particular, we tudy the interplay of the following factor with the formation of followback: () Geographic ditance: Do uer have a higher probability to follow each other when they are located in the ame region? (2) Homophily: Do imilar uer tend to follow each other? (3) Implicit network: How doe the following network on Twitter correlate with other implicit network, e.g., retweet and reply network? and (4) Social balance: Doe the two-way relationhip network on Twitter atify the ocial balance theory [6]? To which extent? Geographic ditance Figure how the correlation between geographic ditance and the probability that two uer create a two-way relationhip (i.e., follow back each other). Interetingly, it eem that online ocial network indeed go global: Figure (b) how the likelihood of a uer following another uer back when they are from the ame time zone or from different time zone. Clearly, the geographic ditance i already not a factor to top uer from developing a trutful (reciprocal) relationhip. Figure (a) how another tatitic which indicate a different perpective that the Twitter network (in ome ene) till tay local: the average number of twoway (reciprocal) relationhip between uer from the ame time zone i about 5 time higher than the number between uer with a ditance of three time zone. Homophily The principle of homophily [5] ugget that uer with imilar characteritic (e.g., ocial tatu, age) tend to aociate with each other. In particular, we tudy two kind of homophilie on the Twitter network: link homophily and tatu homophily. For the link homophily, we tet whether uer who hare common link (follower or followee) will have a tendency to aociate with each other. Figure 2 clearly how that the probability of two uer following back each other when they hare common neighbor i much higher than uual. When the number of common neighbor with two way relationhip increae to 3, the likelihood of two uer following back each other alo triple. The effect i more pronounced when the number increae to. But it i worth noting that thi only work for two-way (reciprocal) relationhip follow back probability one way two way #common neighbor Figure 2: Link homophily. Y-axi: probability that two uer follow back each other, conditioned on the number of common neighbor of two-way relationhip (or one-way relationhip). and doe not hold for the one-way (paraocial) relationhip (a indicated in Figure 2). For the tatu homophily, we tet whether two uer with imilar ocial tatu are more likely to aociate with each other. We categorize uer into two group (elite uer and ordinary uer) by three different algorithm: PageRank [22] 3, #degree, and(α,β) algorithm [8] 4. Specifically, with PageRank, we etimate the importance of each uer according to the network tructure, and then elect a elite uer with the top % uer 5 who have the highet PageRank core and the ret a ordinary uer; while with #degree, we elect top % uer with the highet number of indegree a elite uer and the ret a ordinary uer. For(α,β), we input the ize of the core community a 2, and after running the algorithm, we ue uer elected in the core community a elite uer and the ret a ordinary uer. Then, we examine the difference of follow back behavior among the two group of uer. Figure 3 clearly how that, though the three algorithm preent different tatitic, elite uer have a much tronger tendency to follow each other: the likelihood of two elite uer following back each other i nearly 8 time higher than that of ordinary uer (by the(α,β) algorithm). The (α, β) algorithm eem able to better ditinguih elite uer from ordinary uer in our problem etting. Thi i becaue beide the global network tructure, the (α, β) algorithm alo conider the community tructure among elite uer. Implicit tructure On Twitter, beide the explicit network with following link, there are alo ome implicit network tructure that can be induced from the textural information. For example, uer A may mention uerb in her tweet, which i called a reply link; ueramay forward uerb tweet, which reult in a retweet link. We tudy how the implicit link correlate with the formation of the follow-back relationhip on Twitter. Figure 4 clearly how that when uer A and B retweet or reply each other tweet, the likelihood of their following back each other i higher (3 time than chance). Another intereting phenomenon i that compared with replying omeone tweet, retweeting (forwarding) her tweet eem to be more helpful (5% v. 9%) to win her follow-back. Structural balance Now, we connect our work to a baic ocial pychological theory: tructural balance theory [6]. Let u firt explain the tructural balance property. For every group of three uer 3 PageRank i an algorithm to etimate the importance of each node in a network. 4 (α,β) algorithm i deigned to find core member (elite uer) in a ocial network. 5 Statitic have hown that le than % of the Twitter uer produce 5% of it content [3].

4 follow back probability ordinary uer ordinary and elite uer elite uer #degree pagerank (alpha, beta) Figure 3: Statu homophily by different algorithm. Y-axi: probability that two uer follow back each other, conditioned on whether the two uer are from the ame group of elite/ordinary uer or from different group. #Degree, PageRank, and (α, β) are three algorithm to ditinguih elite uer from ordinary uer. follow back probability retweet no retweet(reply) A retweet(reply) B B retweet(reply) A both reply Figure 4: Implicit network correlation. Y-axi: probability that uerb follow uer A back, conditioned on one uer (A orb) retweet or replie the other uer tweet. (called triad), the balance property implie that either all three of thee uer are or only one pair of them are. Figure 5 how uch an example. To adapt the theory to our problem, we can map either the two-way relationhip or the one-way relationhip on the hip. Then we examine how the Twitter network with (only two-way relationhip or one-way relationhip) atify the tructural balance property. More preciely, we compare the probabilitie of the reultant triad that atify the balance theory baed on two-way relationhip and one-way relationhip on Twitter. Figure 6 clearly how that it i much more likely (88%) for uer to be connected with a balanced tructure of two-way relationhip. While with one-way relationhip, the reultant tructure i very unbalanced. Thi i becaue two uer are very likely to follow a ame movie tar, but they do not know each other, which reult in a unbalanced triad (Figure 5 (C)). In ummary, according to the tatitic above, we have the following obervation:. Geographic ditance ha a pronounced effect on the number of two-way relationhip created between uer, but little effect on the likelihood of uer following back each other. 2. Uer with common of two-way relationhip have a tendency (link homophily) to follow each other. 3. Elite uer have a much tronger tendency (tatu homophily) to follow each other than ordinary uer. 4. The implicit network of retweet or reply link have a trong correlation with the formation of two-way (reciprocal) relationhip. B A C B A C B A C B A (A) (B) (C) (D) Figure 5: Illutration of tructural balance theory. (A) and (B) are balanced, while (C) and (D) are not balanced. probability two way not balance balance one way Figure 6: Structural balance correlation. Y-axi: probability that a triad create two-way (reciprocal) relationhip, conditioned on whether the reultant tructure i balanced or not. 5. The network of two-way relationhip on Twitter i balanced (88% of triad atifying the tructural balance property), while the network of one-way relationhip i unbalanced (7% are unbalanced). 4. MODEL FRAMEWORK In thi ection, we propoe a novel Triad Factor Graph (TriFG) model to incorporate all the information within a ingle entity for better modeling and predicting the formation of two-way (followback) relationhip. For an edge e ij E, if uer v j follow v i at time t, our tak i to predict whether uer v i will follow v j back, i.e. y ij = or. For eay explanation, We introduce a light change of notation. We write each edge a e i with it two end uer a v i and v u i. For the follow-back prediction tak, we aume that v i follow v u i at time t, and our tak i to predict whether v u i will follow v i back at time (t + ). Baed on the obervation in 3, we define a number of attribute for each edge, denoted a x i. The E d attribute matrix X decribe edge-pecific characteritic, wheredi the number of attribute. For example, on Twitter, an attribute can be defined a whether two end uer are from the ame time zone. An element x ij in the matrix X indicate the j th attribute value of edge e i. 4. The Propoed Model We propoe a Triad Factor Graph (TriFG) model. The name i derived from the idea that we incorporate ocial theorie (tructural balance and homophily) over triad into the factor graph model. Figure 7 how the graphical tructure of the TriFG model. The left figure how the following network of ix uer at time t. Blue arrow indicate new follow action, black arrow indicate follow action performed before timet, and blue indicate uerv u i doe not follow uerv i back at timet. The right figure i the factor graph model derived from the left input network. Each gray eclipe indicate an relationhip (v u i,v i) between uer and each white circle indicate the hidden variable y i, with y i = repreenting v u i perform a follow-back action,y i = not, andy i =? unknown, which actually i the variable we need to predict. Factor h(.) repreent C

5 New follow action (in blue) at time t 3 5 v v 6 4 v v 2 v 5 TriFG model v y = f (v u,v,y) v u,v y 2=? y 2 y y 3 h (y,y 2,y 3) y 3= (v2,v) f (v2 u,v2,y2) v 2 u,v 2 (v2,v3) f (v3 u,v3,y3) v 3 u,v 3 (v4,v3) y 4 v 4 u,v 4 Obervation y 4=? f (v4 u,v4,y4) (v4,v5) h (y 3,y 4,y 5) y 5 y 6 y 6=? f (v6 u,v6,y6) v 6 u,v 6 (v6,v5) y 5= f (v5 u,v5,y5) v 5 u,v 5 (v4,v6) Figure 7: Graphical repreentation of the TriFG model. The left figure how the follow network at timet. Blue arrow indicate new follow action, black arrow indicate previouly exiting follow link, and blue indicate uervi u doe not follow uerv i back. The right figure i the TriFG model derived from the following graph. Each gray eclipe indicate an relationhip (vi u,v i ) between uer and each white circle indicate the hidden variable y i. f(vi,vu i,y i) repreent an attribute factor function and h(.) repreent a triad factor function. a balance factor function defined on a triad; and f(v i,v u i,y i) (or f(x i,y i)) repreent a factor to capture the information aociated with edge e i. Given a network at time t, i.e., G t = (V t,e t,x t ) with ome known variabley = or and ome unknown variabley =?, our goal i to infer value of thoe unknown variable. For implicity, we remove the upercripttfor all variable if there i no ambiguity. We begin with the poterior probability of P(Y X,G), according to the Baye theorem, we have P(Y X,G) = P(X,G Y)P(Y) P(X,G) P(X Y) P(Y G) () where P(Y G) denote the probability of label given the tructure of the network and P(X Y) denote the probability of generating the attributexaociated with each edge given their labely. Auming that the generative probability of attribute given the label of each edge i conditionally independent, we get P(Y X,G) P(Y G) i P(x i y i) (2) where P(x i y i) i the probability of generating attribute x i given the labely i. Now, the problem i how to intantiate the probabilitie P(Y G) and P(x i y i). In principle, they can be intantiated in different way. In thi work, we model them in a Markov random field, and thu by the Hammerley-Clifford theorem [7], the two probabilitie can be intantiated a: P(x i y i) = Z exp{ d α jf j(x ij,y i)} (3) j= P(Y G) = Z 2 exp{ c µ k h k (Y c)} (4) where Z and Z 2 are normalization factor. Eq. 3 indicate that we define a feature functionf j(x ij,y i) for each attributex ij aociated with edge e i and α j i the weight of the j th attribute; while Eq. 4 repreent that we define a et of correlation feature function k Input: networkg t, learning rate η Output: etimated parameter θ Initialize θ ; repeat Perform LBP to calculate marginal ditribution of unknown variable P(y i x i,g); Perform LBP to calculate the marginal ditribution of triadc, i.e., P(y c X c,g); Calculate the gradient ofµ k according to Eq. 7 (forα j with a imilar formula): O(θ) µ k = E[h k (Y c)] E Pµk (Y c X,G)[h k (Y c)] Update parameter θ with the learning rateη: until Convergence; θ new = θ old +η O(θ) θ Algorithm : Learning algorithm for the TriFG model. {h k (Y c)} k over each triady c in the network. Hereµ k i the weight of the k th correlation feature function. Baed on Eq. 2-4, we define the following log-likelihood objective function O(θ) = logp θ (Y X,G): O(θ) = E i= d j=α jf j(x ij,y i)+ c µ k h k (Y c) logz (5) where Y c i a triad derived from the input network, Z = Z Z 2 i a normalization factor and θ = ({α},{µ}) indicate a parameter configuration. One example of factor decompoition i hown in Figure 7. There are ix edge, three with known variable (two y = and one y = ) and three with unknown value (y =?). We have four triad (e.g., Y c = (y,y 2,y 3)) baed on the tructure of the input network. For each edge, we define a et of factor function f(v i,v u i,y i) (alo written a f(x i,y i)). We now briefly introduce poible way to define the factor functionf j(x ij,y i) andh k (Y c). f j(x ij,y i) i an attribute factor function. It can be defined a either a binary function or a real-valued function. For example, for the implicit network feature, we imply define it a a binary feature, that i if uerv i forwarded (retweeted) v u i tweet before time t and uer v u i follow uer v i back, then a feature f j(x ij =,y i = ) i defined and it value i ; otherwie. (Such a feature definition i often ued in graphical model uch a Conditional Random Field [4]. For the triad factor function h(y c), we define four feature, two balanced and two unbalanced factor function, a depicted in Figure 5. The triad function i defined a a binary function, that i, if a triad atifie the tructural balance property, then the value of a correponding triad factor function i, otherwie. More detail of the factor function definition are given in Appendix. 4.2 Model Learning and Prediction We now addre the problem of etimating the free parameter and inferring uer follow-back behavior. Learning the TriFG model i to etimate a parameter configuration θ = ({α},{µ}) to maximize the log-likelihood objective function O(θ) = logp θ (Y X,G), i.e., k θ = arg maxo(θ) (6)

6 To olve the objective function, we adopt a gradient decent method (or a Newton-Raphon method). We ue µ a the example to explain how we learn the parameter. Specifically, we firt write the gradient of each µ k with regard to the objective function (Eq. 5): O(θ) µ k = E[h k (Y c)] E Pµk (Y c X,G)[h k (Y c)] (7) wheree[h k (Y c)] i the expectation of factor functionh k (Y c) given the data ditribution (eentially it can be conidered a the average value of the factor function h k (Y c) over all triad in the training data); and E Pµk (Y c X,G)[h k (Y c)] i the expectation of factor function h k (Y c) under the ditribution P µk (Y c X,G) given by the etimated model. A imilar gradient can be derived for parameter α j. One challenge here i that the graphical tructure in the TriFG model can be arbitrary and may contain cycle, which make it intractable to directly calculate the marginal ditribution P µk (Y c X,G). A number of approximate algorithm can be conidered, uch a Loopy Belief Propagation (LBP) [2] and Meanfield [32]. We choe Loopy Belief Propagation due to it eae of implementation and effectivene. Specifically, we approximate the marginal ditribution P µk (Y c X,G) uing LBP. With the marginal probabilitie, the gradient can be obtained by umming over all triad. It i worth noting that we need to perform the LBP proce twice in each iteration, one time for etimating the marginal ditribution of unknown variabley i =? and the other time for marginal ditribution over all triad. Finally with the gradient, we update each parameter with a learning rate η. The learning algorithm i ummarized in Algorithm. Predicting Follow-back With the etimated parameter θ, we can predict the label of unknown variable {y i =?} by finding a label configuration which maximize the objective function, i.e., Y = argmaxo(y X,G,θ). It i till intractable to obtain the exact olution. Again, we utilize the loopy belief propagation to approximate the olution, i.e., to calculate the marginal ditribution of each relationhip with unknown variablep(y i x i,g) and finally aign each relationhip with label of the maximal probability. 5. EXPERIMENTS In thi ection, we firt decribe our experimental etup. We then preent the performance reult for different approache in different etting. Next, we preent everal analye and dicuion. Finally, we ue a cae tudy further to demontrate the advantage of the propoed model. 5. Experimental Setup Prediction Setting We ue the data et decribed in 3 in our experiment. To quantitatively evaluate the effectivene of the propoed model and compare with other alternative method, we carefully elect a ub network from the data et, which ha a completely hitoric log of link formation among all uer, i.e., each uer i aociated with a complete lit of follower and uer they are following at each time tamp. The ub network i compried of 2,44 uer, 468,238 following link among them, and 2,49,768 tweet. Averagely there are 3,337 new follow-back link per day. We divide the ub network into 3 time tamp by viewing every four day a a time tamp. Our general tak i to predict whether a uer will follow another uer back at the next time tamp when he received a new following link from the other uer. By a more careful tudy however, we follow back probability time tamp Figure 8: Follow-back probability for different time tamp. found that it i very challenging if we retrict the prediction jut for the next time tamp. Figure 8 how the ditribution of time pan in which a uer perform the follow-back action, which indicate that 6% of follow-back are performed in the next time tamp though, 37% of the follow-back would be till performed in the following three time tamp. A further data analyi, how that active uer often either perform an immediate follow-back (at the next time tamp) or reject to follow-back; while ome other (inactive) uer may not frequently login into Twitter, thu the time pan of follow-back varie a lot. According to thi obervation, in our firt experiment, we ue a network of the firt 8 time tamp for training and predicate follow-back action in the following 4 (9th- 2th) time tamp (Tet Cae ). Then we incrementally add the network of the 9th time tamp into the training data and again ue the following 4 (th-3th) time tamp for prediction (Tet Cae 2). We repectively report the prediction performance of different approache for the two tet cae. Comparion Method We compare the propoed TriFG model with the following method: SVM: it ue the ame attribute aociated with each edge a feature to train a claification model and then employ the claification model to predict edge label in the tet data. For SVM, we employ SVM-light. LRC: it ue the ame attribute aociated with each edge a feature to train to train a logitic regreion claification model [6] and then predict edge label in the tet data. CRF-balance: it train a Conditional Random Field [4] model with attribute aociated with each edge. The difference of thi method from our model i that it doe not conider tructural balance factor. CRF: it train a Conditional Random Field model all factor (including attribute and tructural balance factor) and predict edge label in the tet data. TriFG: the propoed model, which train a factor graph model with unlabeled data and all factor we defined in 4. Weak TriFG (wtrifg): the difference of wtrifg from TriFG i that we do not conider tatu homophily and tructural balance here. We ue thi method to evaluate how ocial theorie can help thi tak. In the five method, SVM and CRF-balance only conider attribute factor; wtrifg further conider unlabeled data. CRF conider all factor we defined, but doe not conider unlabeled data. Our propoed TriFG model conider all factor a well a the unlabeled data. Evaluation Meaure We evaluate the performance of different approache in term of Preciion (Prec.), Recall (Rec.), F-Meaure (F), and Accuracy (Accu.).

7 Table : Follow-back prediction performance of different method in the two tet cae. Tet Cae : predicting follow-back action in the 9th-2th time tamp; and Tet Cae 2 for the th-3th time tamp. Data Algorithm Prec. Rec. F Accu. Tet Cae Tet Cae 2 SVM LRC CRF-balance CRF wtrifg TriFG SVM LRC CRF-balance CRF wtrifg TriFG Table 2: Follow-back prediction performance of TriFG with three different algorithm (#degree, PageRank and (α, β)) for finding elite uer from ordinary uer. Data Algorithm Prec. Rec. F Accu. Tet Cae Tet Cae 2 (α, β) #degree pagerank (α, β) #degree pagerank All algorithm are implemented in C++, and all experiment are performed on a PC running Window 7 with Intel(R) Core(TM) 2 CPU 66 (2.4GHz and 2.39GHz) and 4GB memory. All algorithm have a good efficiency performance: the CPU time needed for training and prediction by all method on the Twitter network range from 2 to 5 minute. 5.2 Prediction Performance We now decribe the performance reult for the different method we conidered. Table how the reult in the two tet cae (prediction performance for the 9th-2th time tamp and that for the th-3th time tamp). It can be clearly een that our propoed TriFG model ignificantly outperform the four comparion method. In term of F-Meaure, TriFG achieve a +27% improvement compared with the (SVM). Comparing with the other three graph-baed method, TriFG alo reult in an improvement of 22-25%. The advantage of TriFG mainly come from the improvement on recall. One important reaon here i that TriFG can detect ome difficult cae by leveraging the tructural balance correlation and homophily correlation. For example, without conidering the two kind of ocial correlation, the performance of wtrifg decreae to 7-72% in term of F-Meaure in the two tet cae. Another advantage of TriFG i that it make ue of the unlabeled data. Eentially, it further conider ome latent correlation in the data et, which cannot be leveraged with only the labeled training data. 5.3 Analyi and Dicuion Now, we perform everal analye to examine the following a- F Meaure Tet Cae Tet Cae 2 TriFG TriFG B TriFG BI TriFG BIS TriFG BISL Figure 9: Factor contribution analyi. TriFG-B tand for ignoring tructural balance correlation. TriFG-BI tand for ignoring both tructural balance correlation and implicit network correlation. TriFG-BIS tand for further ignoring tatu homophily and TriFG- BISL tand for further ignoring link homophily. pect of the TriFG model: () contribution of different factor in the TriFG model; (2) convergence property of the learning algorithm; (3) Effect of different etting for the time pan; and (4) Effect of different algorithm for elite uer finding. Factor Contribution Analyi In TriFG, we conider five different factor function: Geographic ditance (G), Link homophily (L), Statu homophily (S), Implicit network correlation (I), and tructural Balance correlation (B). Here we examine the contribution of the different factor defined in our model. We firt rank the individual factor by their predictive power 6, then remove them one by one in revering order of their prediction power. In particular, we firt remove tructural balance correlation denoted a TriFG-B, followed by further removing the implicit network correlation denoted a TriFG-BI, tatu homophily denoted a TriFG-BIS, and finally removing link homophily denoted a TriFG-BISL. We train and evaluate the prediction performance of the different verion of TriFG. Figure how the average F-Meaure core of the different verion of the TriFG model. We can oberve clear drop on the performance when ignoring each of the factor. Thi indicate that our method work well by combining the different factor function and each factor in our method contribute improvement in the performance. Convergence Property We conduct an experiment to ee the effect of the number of the loopy belief propagation iteration. Figure illutrate the convergence analyi reult of the learning algorithm. We ee on both tet cae, the BLP-baed learning algorithm can converge in le than iteration. After only even learning iteration, the prediction performance of TriFG on both tet cae become table. Thi ugget that learning algorithm i very efficient and ha a good convergence property. Effect of Time Span Figure 8 already how the ditribution of follow-back in different time tamp. Now, we quantitatively examine how different etting for the time pan will affect the prediction performance. Figure lit the average prediction performance of TriFG in the two tet cae with different etting of the time pan. It how that when etting the time pan a two or le time tamp, the prediction performance of TriFG drop harply; 6 We did thi by repectively removing each particular factor from our model and evaluated the decreae of the prediction performance by the TriFG model. A larger decreae mean a higher predictive power.

8 F Meaure Tet Cae Tet Cae #iteration Figure : Convergence analyi of the learning algorithm. F Meaure Tet Cae Tet Cae 2 t = t = 2 t = 3 t = 4 Figure : Follow-back prediction for different time tamp. while when etting it a three time tamp, the performance i acceptable. The reult are conitent with the tatitic in Figure 8: more than 9% of follow-back action are performed in the firt three time tamp, and only about 8% of the follow-back action are in the firt two time tamp. Effect of different algorithm for elite uer finding The tatu homophily factor depend on reult of elite uer finding. We ue three different algorithm, i.e., PageRank, #degree, and (α, β) algorithm, to find elite uer. Now we examine how the different algorithm would affect the prediction performance. Table 2 how the prediction performance of TriFG with different elite uer finding algorithm in the two tet cae. Interetingly, though TriFG with the (α, β) algorithm achieve the bet performance, the difference of performance among the three algorithm, epecially in the econd tet cae i not that pronounced (with a difference of %-4% in term of F-meaure core). Thi confirm the effectivene and generalization of incorporating the tatu homophily factor into our TriFG model. 5.4 Qualitative Cae Study Now we preent a cae tudy to demontrate the effectivene of the propoed model. Figure 2 how an example generated from our experiment. It repreent a portion of the Twitter network from the th-3th time tamp. Black arrow indicate following link created 4 time tamp (we ue 4 time tamp a the time pan for prediction) before. Blue arrow indicate new following link in the pat 4 time tamp. Dah arrow indicate follow-back link in our data et (a), predicted by SVM (b), and predicted by our model TriFG (c), with green color denoting a correct one and red color denoting a mitake one. Red colored indicate there hould be a follow-back link, which the approach did not detect. We look at pecific example to tudy why the propoed model can outperform the comparion method. A, B, and C are three elite uer identified uing the (α, β) algorithm [8]. SVM correctly predict that there i a follow-back link from C to B, but mie predicting the follow-back link from C to A. Our model TriFG correctly predicted both the follow-back link. Thi i becaue TriFG leverage the tructural balance factor. The reulting tructure among the three uer by SVM i unbalanced. TriFG leverage the tructural balance factor and tend to reult in a balanced tructure. It i alo worth looking at the ituation of uer 9 and. TriFG made a mitake here: it doe not predict the follow-back link, while the link wa correctly predicted by SVM. Uer 9 and uer have a imilar ocial tatu (imilar indegree) and alo they are from the ame time zone, thu SVM uccefully predicted the follow-back link. However, a the reulting tructure i unbalanced, TriFG made a compromie and finally reulted in a mitaken prediction. 6. RELATED WORKS In thi ection, we review related work on link prediction and Twitter tudy in ocial network. Our work i related with link prediction, which i one of the core tak in ocial network. Exiting work on link prediction can be broadly grouped into two categorie baed on the learning method employed: unupervied link prediction and upervied link prediction. Unupervied link prediction uually aign core to potential link baed on the intuition - the more imilar the pair of uer are, the more likely they are linked. Variou imilarity meaure of uer are conidered, uch a the preferential attachment [2], and the Katz meaure [2]. A urvey of unupervied link prediction can be found in [7]. Recently, [8] deign a flow baed method for link prediction. There are alo a number of work which employ upervied approache to predict link in ocial network, uch a [28, 8, 2, 6]. Backtrom et al. [2] propoe a upervied random walk algorithm to etimate the trength of ocial link. Lekovec et al. [6] employ a logitic regreion model to predict poitive and negative link in online ocial network. The main difference between exiting work on link prediction and our work are about two apect. Firt, exiting work handle undirected ocial network, while we addre the directed nature of the Twitter network and predict a directed link between a pair of uer given an exiting link in the another direction. Secondly, mot exiting model for link prediction are tatic. In contrat, our model i dynamic and learned from the evolution of the Twitter network. Moreover, we combine ocial theorie (uch a homophily and tructural balance theory) into a emi-upervied learning model. Another type of related work i ocial behavior analyi. Tang et al. [26] tudy the difference of the ocial influence on different topic and propoe Topical Affinity Propagation (TAP) to model the topic-level ocial influence in ocial network and develop a parallel model learning algorithm baed on the map-reduce programming model. Tan et al. [25] invetigate how ocial action evolve in a dynamic ocial network and propoe a time-varying factor graph model for modeling and predicting uer ocial behavior. The propoed method in thee work can be utilized in the problem defined in thi work, but the problem i fundamentally different. There i little doubt that Twitter ha intrigued worldwide netizen, and the reearch communitie alike. Exiting Twitter tudy i mainly centered around the following three apect: ) the Twitter network. Java et al. [] tudy the topological and geographical propertie of the Twitter network. Their finding verify the homophily phenomenon that uer with imilar intention connect with each other. Kwak et al. [3] conduct a imilar tudy on the entire Twitterphere and they oberve ome notable propertie of Twitter, uch a a non-power-law follower ditribution, a hort ef-

Unit 11 Uing Linear Regreion to Decribe Relationhip Objective: To obtain and interpret the lope and intercept of the leat quare line for predicting a quantitative repone variable from a quantitative explanatory

TIME SERIES ANALYSIS AND TRENDS BY USING SPSS PROGRAMME RADMILA KOCURKOVÁ Sileian Univerity in Opava School of Buine Adminitration in Karviná Department of Mathematical Method in Economic Czech Republic

Complex Stock Trading Strategy Baed on Particle Swarm Optimization Fei Wang, Philip L.H. Yu and David W. Cheung Abtract Trading rule have been utilized in the tock market to make profit for more than a

International Journal of Advanced Technology & Engineering Reearch (IJATER) REDUCTION OF TOTAL SUPPLY CHAIN CYCLE TIME IN INTERNAL BUSINESS PROCESS OF REAMER USING DOE AND Abtract TAGUCHI METHODOLOGY Mr.

Key to learning in pecific ubject area of engineering education an example from electrical engineering Anna-Karin Cartenen,, and Jonte Bernhard, School of Engineering, Jönköping Univerity, S- Jönköping,

HUMAN CAPITAL AND THE FUTURE OF TRANSITION ECONOMIES * By Michael Spagat Royal Holloway, Univerity of London, CEPR and Davidon Intitute Abtract Tranition economie have an initial condition of high human

Aignment Report RP/98-983/5/0./03 Etablihment of cientific and technological information ervice for economic and ocial development FOR INTERNAL UE NOT FOR GENERAL DITRIBUTION FEDERATION OF ARAB CIENTIFIC

Allen M. Potehman Univerity of Illinoi at Urbana-Champaign Unuual Option Market Activity and the Terrorit Attack of September 11, 2001* I. Introduction In the aftermath of the terrorit attack on the World

Control of Wirele Network with Flow Level Dynamic under Contant Time Scheduling Long Le and Ravi R. Mazumdar Department of Electrical and Computer Engineering Univerity of Waterloo,Waterloo, ON, Canada

Four Way Companie Can Ue Open Source Social Publihing Tool to Enhance Their Buine Operation acquia.com 888.922.7842 1.781.238.8600 25 Corporate Drive, Burlington, MA 01803 Four Way Companie Can Ue Open

International Journal of Applied Information ytem (IJAI) IN : 2249-0868 Return on Invetment and Effort Expenditure in the oftware Development Environment Dineh Kumar aini Faculty of Computing and IT, ohar

12.1 Homework for t Hypothei Tet 1) Below are the etimate of the daily intake of calcium in milligram for 38 randomly elected women between the age of 18 and 24 year who agreed to participate in a tudy

How Long will She Call Me? Distribution, Social Theory and Duration Prediction Yuxiao Dong, Jie Tang, Tiancheng Lou, Bin Wu and Nitesh V. Chawla Department of Computer Science and Engineering, University

Introduction to the article Degree of Freedom. The article by Walker, H. W. Degree of Freedom. Journal of Educational Pychology. 3(4) (940) 53-69, wa trancribed from the original by Chri Olen, George Wahington

Performance of a Brower-Baed JavaScript Bandwidth Tet David A. Cohen II May 7, 2013 CP SC 491/H495 Abtract An exiting brower-baed bandwidth tet written in JavaScript wa modified for the purpoe of further

Turbulent Mixing and Chemical Reaction in Stirred Tank André Bakker Julian B. Faano Blend time and chemical product ditribution in turbulent agitated veel can be predicted with the aid of Computational

Chapter 10 Velocity, Acceleration, and Calculu The firt derivative of poition i velocity, and the econd derivative i acceleration. Thee derivative can be viewed in four way: phyically, numerically, ymbolically,

SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol 4 No June Mixed Method of Model Reduction for Uncertain Sytem N Selvaganean Abtract: A mixed method for reducing a higher order uncertain ytem to a table reduced

SHS Web of Conference 24, 0203 (206 DOI: 0.05/ hconf/206240203 C Owned by the author, publihed by EDP Science, 206 Efficacy and implementation of ideological and political coure in the contruction of harmoniou

Note: The following curriculum i a conolidated verion. It i legally non-binding and for informational purpoe only. The legally binding verion are found in the Univerity of Innbruck Bulletin (in German).