Team Chinchillas, Telenor SNA case

Business Understanding

Telenor wants to identify Social Network leaders from a list of A and B nodes, their connection counts and connection strengths. A second goal is to characterize a node’s possibility of turning “bad” from its relationship to a list of truly bad nodes.

Data Understanding

We have 118 690 unique nodes with 1 452 755 connections between them.

Some nodes exhibit a number of outgoing connections as high as 2 460, which signifies that we can have call centers and taxi services in the data. This is particularly important as they can form “rank sinks” during the implementation of an algorithm similar to pagerank.

Data Preparation

Dealing with 0/1 call flag and NULLs

We have omitted the zero-calls and the rows with NULL values from the dataset.

Approaches to the Low-Medium-High connection strengths

One approach is to reserve part of the data as the test set. Train the remaining portion of the data using log of the contribution metric, which is the percentage of the connection compared to the entire network. We expect these percentages to be very small – log on it will make the numbers easier to work with. Then we can use cross validation to fit the best low, medium and high values (on scale of 1), and test the model on the remaining test set.

Joint data preparation

Our team has agreed to produce in parallel a number of datasets with unique node IDs in column1 and different algorithm outputs in the other columns. Since there are different algorithms for information on the network, node pairs, ranking the nodes or clustering the nodes, but we are investigating the nodes in their differences, we have agreed to mark with R_ columns that represent ranking of nodes and with C_ clustering algorithm outputs.

## Constructing a network graph to see which networks are most activegraph_data=data.copy()graph_data['weights']=graph_data.groupby(['Subscriber_A','Subsciber_B'])['Subscriber_A'].transform('count')G=nx.from_pandas_dataframe(graph_data,'Subscriber_A','Subsciber_B','weights')

## Constructing a table that determines the networks that performed best with receiving calls. (Assumes count > 30 for representive data)raw_data=data.copy()raw_data=raw_data.drop_duplicates().dropna()raw_data.loc[:,'Count']=pd.Series(len(raw_data)*[1],index=raw_data.index)hello=raw_data.groupby(['Subsciber_B','Label','Real_Event_Flag'],as_index=False).sum()hello['Product']=hello['Real_Event_Flag']*hello['Count'].astype(float)hello=hello.groupby(['Subsciber_B','Label'],as_index=False).sum()hello['Receive Response']=hello['Product']/hello['Count']A=hello[hello['Count']>30].sort_values(by='Receive Response',ascending=False)

In [11]:

## Constructing a table that determines the networks that performed best with making calls. (Assumes count > 30 for representive data)raw_data=data.copy()raw_data=raw_data.drop_duplicates().dropna()raw_data.loc[:,'Count']=pd.Series(len(raw_data)*[1],index=raw_data.index)hello2=raw_data.groupby(['Subscriber_A','Label','Real_Event_Flag'],as_index=False).sum()hello2['Product']=hello2['Real_Event_Flag']*hello2['Count'].astype(float)hello2=hello2.groupby(['Subscriber_A','Label'],as_index=False).sum()hello2['Get Response']=hello2['Product']/hello2['Count']B=hello2[hello2['Count']>30].sort_values(by='Get Response',ascending=False)

In [15]:

# Shown below are the nodes that performed best with receiving calls.# Some nodes are able to maintain 100% response rate given signal strength Medium.A.head(30)

Out[15]:

Subsciber_B

Label

Real_Event_Flag

Count

Product

Receive Response

57

0x00159E0014337E04501CF282E95C4F4F

High

1

54

54.0

1.0

79043

0x8BA2042BEE28E4E59D4B6A12CA0B0FA1

High

1

31

31.0

1.0

78543

0x8AC713E5E7C59B01D0AAB6D59B323AF3

Medium

1

65

65.0

1.0

78541

0x8AC713E5E7C59B01D0AAB6D59B323AF3

High

1

64

64.0

1.0

78287

0x8A4EF70753E97E82413277A846F4C0CB

High

1

31

31.0

1.0

78227

0x8A33B5BAA9DC0C44D27AA2ACFDAFC51B

Medium

1

73

73.0

1.0

78225

0x8A33B5BAA9DC0C44D27AA2ACFDAFC51B

High

1

42

42.0

1.0

78013

0x89CE7493854A25CA3D6810B6818D5A9D

Medium

1

67

67.0

1.0

78011

0x89CE7493854A25CA3D6810B6818D5A9D

High

1

42

42.0

1.0

78002

0x89C8366F1CD7EB482E6A613D148AEF98

High

1

32

32.0

1.0

78000

0x89C689F854724B5A69802B20D241385E

Medium

1

47

47.0

1.0

77970

0x89BA1F806FCBBEA1C4698F8DB986129E

Medium

1

37

37.0

1.0

77932

0x89AEBF7BE20BBFDA3B835D0684B131C0

Medium

1

41

41.0

1.0

77930

0x89AEBF7BE20BBFDA3B835D0684B131C0

High

1

49

49.0

1.0

77733

0x895700E906F861318EEE2957FE51ABD8

Medium

1

53

53.0

1.0

77731

0x895700E906F861318EEE2957FE51ABD8

High

1

48

48.0

1.0

77690

0x894AC69DD662B0BA9C8C417A330C3126

Medium

1

36

36.0

1.0

77386

0x88C7B2A8EA609E97222402B82258D071

Medium

1

52

52.0

1.0

77384

0x88C7B2A8EA609E97222402B82258D071

High

1

42

42.0

1.0

78903

0x8B660CE1AFEF0B19A24E15565A8B678E

High

1

31

31.0

1.0

80180

0x8DA0D5664BD34CE38BD534D2F8193A15

Medium

1

32

32.0

1.0

71761

0x7EFC22C3E042401F068CE51EC78FBE0F

High

1

84

84.0

1.0

80509

0x8E2DBA3830F728338138E2461638296C

High

1

31

31.0

1.0

82245

0x912232FD1F7BD8747FBCF05C2D8542D4

Medium

1

33

33.0

1.0

81687

0x902811B88F11C8F12947240A5A88614C

Medium

1

96

96.0

1.0

81685

0x902811B88F11C8F12947240A5A88614C

High

1

79

79.0

1.0

81611

0x90098090E10D308185D2E69B6A2B2A91

Medium

1

58

58.0

1.0

81609

0x90098090E10D308185D2E69B6A2B2A91

High

1

34

34.0

1.0

81512

0x8FE57B9D26D0A710D23ED9FF81F8DA3C

Medium

1

53

53.0

1.0

81510

0x8FE57B9D26D0A710D23ED9FF81F8DA3C

High

1

45

45.0

1.0

In [14]:

# Shown below are the nodes that performed best with making calls.# Some nodes are able to maintain 100% response rate given signal strength Medium.B.head(30)

Out[14]:

Subscriber_A

Label

Real_Event_Flag

Count

Product

Get Response

36

0x00159E0014337E04501CF282E95C4F4F

High

1

73

73.0

1.0

48387

0x8B42C630D468330C5B39F7286FA81F2C

Medium

1

74

74.0

1.0

49383

0x8E2981F1C9C84692C8EFE7D3F23AB63F

Medium

1

50

50.0

1.0

49374

0x8E252CEF8BC8F9F0D5C5E9825ABCCF98

Medium

1

88

88.0

1.0

49372

0x8E252CEF8BC8F9F0D5C5E9825ABCCF98

High

1

32

32.0

1.0

49286

0x8DF077608BB390B16A9D211BA5BCFAA3

Medium

1

31

31.0

1.0

49179

0x8DA0D5664BD34CE38BD534D2F8193A15

Medium

1

146

146.0

1.0

49177

0x8DA0D5664BD34CE38BD534D2F8193A15

High

1

48

48.0

1.0

49037

0x8D3DC15563174A007454CF61DEA91A39

Medium

1

59

59.0

1.0

48740

0x8C4E98F01F02BD933027D4040E9FE31B

High

1

47

47.0

1.0

48615

0x8BFA10676B225B30D3EE84CFB3D0D073

Medium

1

89

89.0

1.0

48613

0x8BFA10676B225B30D3EE84CFB3D0D073

High

1

40

40.0

1.0

48497

0x8BA2042BEE28E4E59D4B6A12CA0B0FA1

Medium

1

33

33.0

1.0

48430

0x8B660CE1AFEF0B19A24E15565A8B678E

Medium

1

47

47.0

1.0

48428

0x8B660CE1AFEF0B19A24E15565A8B678E

High

1

72

72.0

1.0

48398

0x8B490B78906664B274347BFB7036AA02

Medium

1

74

74.0

1.0

48385

0x8B42C630D468330C5B39F7286FA81F2C

High

1

33

33.0

1.0

49392

0x8E2DBA3830F728338138E2461638296C

Medium

1

65

65.0

1.0

48227

0x8AC713E5E7C59B01D0AAB6D59B323AF3

Medium

1

70

70.0

1.0

48225

0x8AC713E5E7C59B01D0AAB6D59B323AF3

High

1

58

58.0

1.0

48121

0x8A80E40DE83C1F45A8288C578CE97A3F

High

1

48

48.0

1.0

48064

0x8A4EF70753E97E82413277A846F4C0CB

Medium

1

33

33.0

1.0

48022

0x8A33B5BAA9DC0C44D27AA2ACFDAFC51B

Medium

1

60

60.0

1.0

48020

0x8A33B5BAA9DC0C44D27AA2ACFDAFC51B

High

1

66

66.0

1.0

47948

0x8A013D5CA246709598388300032C31CE

Medium

1

96

96.0

1.0

47891

0x89DAFDCAC7321C84E4EBC479E50FF7BF

Medium

1

58

58.0

1.0

47878

0x89CE7493854A25CA3D6810B6818D5A9D

Medium

1

64

64.0

1.0

47876

0x89CE7493854A25CA3D6810B6818D5A9D

High

1

40

40.0

1.0

47866

0x89C8366F1CD7EB482E6A613D148AEF98

Medium

1

45

45.0

1.0

47864

0x89C8366F1CD7EB482E6A613D148AEF98

High

1

43

43.0

1.0

Modeling

We have used the combined table from the data prep step to identify bad nodes and leaders by different approaches.

Using igraph package in R several statistics were calculated. We had 118k observations and only 140 were labeled as bad, others were unclassified. So new set was generated 600 randomly selected from unknown. The new set was split to training and test. With this proportion it’s inevitable to have bads in unclassified set labeled as goods, but the high proportion of bads will suppress their influence.
Logistic regression was performed. After correlation analysis using these variables were selected: R_degree_out, R_degree_in, R_centr_eigen_no_names, R_coreness_in, R_page_rank + R_components_strong and 0.17 threshold.

Using this coef. 5 bads have misclassification and other 4 unknown are labeled as bad. The accuracy is 92.62%.

Another approach was used for finding leaders by an augmented pagerank algorithm. Unfortunately the process for the whole dataset is not complete yet.

LeaderRank [1] (LR) is a modified PageRank[2] algorithm that aiming for identifying influential users. It is able to find leaders who result in quick opinion spreading. In this project, we use LeaderRank to rank the nodes in Network X and the leader is defined as nodes with high LeaderRank Score.

The algorithm is implemented in Python 2.7. Input is Edge List with nodes renamed from 0 to N. (N is total number of unique nodes. Here is 118,691). For pair of nodes that share more than 1 interaction, we only consider one. The output of the algorithm the ranking score for each node. We use a list to store the mapping from NodeID (0,1,…,N) to NodeName (0xFC…). The result is show below …

Evaluation

One approach we have used for defining a “leader” is to define rules for the R_cores and look into each core’s R_triangles to identify the leaders. Gephi demonstration (please, note that the largest part of the dataset is spread around the first graph and on it we see only the cores).