Tutorial

We’d like to share a simple modeling for predicting whether players will leave or not. We used R as an analysis tool, and the final model has F1-score of 0.4832.

We proceeded the modeling in the following order.

Data Processing

Data Visualization

Data Modeling and Performance Comparison

Measurement of final test results

Data Processing

This is the stage where features for the model are created and integrated with the game log data of each account. The data provided were as follows.

Data Set

Sample Size

Period (8 weeks)

churn = 1

churn = 0

total

Train

1,200

2,800

4,000

2016-03-16 06 ~ 2016-05-11 06

Test1

900

2,100

3,000

2016-07-27 06 ~ 2016-09-21 06

Test2

900

2,100

3,000

2016-12-14 06 ~ 2017-02-28 06

First, we created a feature that would affect players’ leaving. There are four types of features.

√ raw.val: This is a total count or amount in the period. That is, it shows the amount of usage per feature for 8 weeks.
√ last.val : This is the final week’s count or amount in the same period. It is based on the assumption that the most recent playing will affect the churning.
√ slope.val : This is a slope value of the simple linear regression model for each weekly usage. It is based on the assumption that the amount of usage will decrease as it gets closer to the time of a leave.
√ cv.val : Coefficient of variation for each weekly usage. It is based on the assumption that players who play infrequently are more likely to leave than to steadily play.

The generated variable list is as follows.

no.

feature

type

raw.val

last.val

slope.val

cv.val

(1)

dt_cnt

O

O

O

O

(2)

con_cnt

O

O

O

O

(3)

log_cnt

O

O

O

O

(4)

char_cnt

O

O

O

O

(5)

exp_amt

O

O

O

O

(6)

exp_mastery_amt

O

O

O

O

(7)

exhaust_cnt

O

O

O

O

(8)

exhaust_pvp_cnt

O

O

O

O

(9)

die_cnt

O

O

O

O

(10)

die_pvp_cnt

O

O

O

O

(11)

quest_cnt

O

O

O

O

(12)

party_join_cnt

O

O

O

O

(13)

party_kick_cnt

O

O

O

O

(14)

teleport_cnt

O

O

O

O

(15)

pve_cnt

O

O

O

O

(16)

pvp_cnt

O

O

O

O

(17)

arena_cnt

O

O

O

O

(18)

guild_join_cnt

O

O

O

O

(19)

guild_withdraw_cnt

O

O

O

O

(20)

char_lev_max

O

(21)

char_lev_job

O

(22)

guild_char_cnt

O

(1) dt_cnt : the number of logged-in days
(2) con_cnt : the number of times player have entered the world (Logid = 1003)
(3) log_cnt : the number of generated logs
(4) char_cnt : the number of characters played
(5) exp_amt : the total amount of experience acquired while playing
(6) exp_mastery_amt : the total amount of mastery-experience acquired while playing
(7) exhaust_cnt: the number of times a player has exhausted (Logid = 1201)
(8) exhaust_pvp_cnt : the number of times a player is exhausted by another player (logid = 1201 and target_code = 10)
(9) die_cnt : the number of times a player has fainted or died (Logid = 1202)
(10) die_pvp_cnt : the number of times a player has fainted or died by another player (Logid = 1201 and target_code = 10)
(11) quest_ cnt : the number of completed quests (Logid = 5004)
(12) party_join_cnt : the number of party participation (Logid = 1102)
(13) party_kick_cnt : the number of times a player banned from a party (Logid = 1106)
(14) teleport_cnt : the number of teleports (Logid = 1010)
(15) pve_ cnt : the number of PvE times (Logid = 1208) cf. PvE means a player killing an NPC.
(16) pvp_ cnt : the number of PvP times. (Logid = 1209) cf. PvP means a player killing a PC.
(17) arena_cnt : the number of times when a team or individual duel has ended (Logid = 1404, 1406)
(18) guild_join_cnt : the number of times a player joining a clan. (Logid = 6005)
(19) guild_withdraw_cnt : the number of times a player quitting a clan. (Logid = 6009)
(20) char_lev_max : the highest level of a character. The maximum value of the ‘actor_level’ field.
(21) char_lev_job : the job of a character with the highest level and the highest log volume among characters played.
(22) guild_char_cnt : the number of characters who have joined a clan.

The query that extracts variables of type raw.val using R is as follows.

The following time table is created considering a week starts from Wednesday, 6:00 a.m. to the next Wednesday, 6:00 a.m. The weekly and daily counts were calculated based on the weeks(bs_wk) and days(bs_dt).

bs_wk

bs_dt

time_from

time_to

201612

20160316

2016-03-16 6:00

2016-03-17 6:00

201612

20160317

2016-03-17 6:00

2016-03-18 6:00

201612

20160318

2016-03-18 6:00

2016-03-19 6:00

201612

20160319

2016-03-19 6:00

2016-03-20 6:00

201612

20160320

2016-03-20 6:00

2016-03-21 6:00

201612

20160321

2016-03-21 6:00

2016-03-22 6:00

201612

20160322

2016-03-22 6:00

2016-03-23 6:00

201613

20160323

2016-03-23 6:00

2016-03-24 6:00

…

…

…

…

Data Visualization

We visualized the features and looked for the differences depending on whether players leaves or not. The graph below shows the cumulative distribution function of each feature. As you can see, there exist features with differences between the leaving and non-leaving groups.

Data Modeling and Performance Comparison

We used Random Forest, SVM, and Lasso Regression for Model Generation. For the final model selection, Train set was divided into 70:30. 70% was used for Model Learning, and 30% was used for Performance Measurement.

AUC

Random Forest

Support Vector Machine

Lasso Regression

Confusion matrix

b-1. Random Forest

F1-score : 0.6079
Accuracy : 0.7084

Actual value

1

0

Predicted value

1

262

240

0

98

559

b-2. Support Vector Machine

F1-score : 0.5798
Accuracy : 0.7606

Actual value

1

0

Predicted value

1

198

125

0

162

714

b-3. Lasso Regression

F1-score : 0.6094
Accuracy : 0.7306

Actual value

1

0

Predicted value

1

252

215

0

108

624

Measurement of final test results

After the performance evaluation using Random Forest, SVM and Lasso Regression, the test set prediction was finally performed by the lasso model as the final model. The final result of the comparison between the predicted value and the actual value of the test data is as follows.