Explaining clusterings of process instances

Abstract

This paper presents a technique that aims to increase human understanding of trace clustering solutions. The clustering techniques under scrutiny stem from the process mining domain, where the clustering of process instances is deemed a useful technique to analyse process data with a large variety of behaviour. Until now, the most often used method to inspect clustering solutions in this domain is visual inspection of the clustering results. This paper proposes a more thorough approach based on the post hoc application of supervised learning with support vector machines on cluster results. Our approach learns concise rules to describe why a specific instance is included in a certain cluster based on specific control-flow based feature variables. An extensive experimental evaluation is presented showing that our technique outperforms alternatives. Likewise, we are able to identify features that lead to shorter and more accurate explanations.

Appendix

Results of the experimental evaluation comparing SECPI with C4.5 and RIPPER averaged over clustering techniques and datasets, replicated with a cluster number of 6

Technique

Feature

Accuracy

Explanation length

Runtime (s)

NrLeaves

TreeSize

NrRules

Explainable

C4.5 (default)

Exi

0.78

4.68

5.22

8.72

16.45

–

–

C4.5 (default)

Adf

0.88

6.83

31.34

19.95

38.9

–

–

C4.5 (default)

Sdf

0.88

8.10

33.88

24.7

48.4

–

–

C4.5 (default)

Awf

0.85

6.30

33.81

15.28

29.55

–

–

C4.5 (default)

Swf

0.89

6.45

33.69

18.88

36.75

–

–

C4.5 (default)

All

0.90

7.07

113.09

21.4

41.8

–

–

C4.5 (min. leaf size)

Exi

0.75

2.98

2.40

4

7

–

–

C4.5 (min. leaf size)

Adf

0.80

3.68

13.26

5.6

10.2

–

–

C4.5 (min. leaf size)

Sdf

0.78

3.47

11.93

5.03

9.05

–

–

C4.5 (min. leaf size)

Awf

0.79

3.61

15.50

5.33

9.65

–

–

C4.5 (min. leaf size)

Swf

0.83

3.51

15.61

5.35

9.7

–

–

C4.5 (min. leaf size)

All

0.84

3.72

55.13

5.38

9.75

–

–

Ripper (default)

Exi

0.76

5.75

11.82

–

–

6.15

–

Ripper (default)

Adf

0.87

11.68

76.01

–

–

11.45

–

Ripper (default)

Sdf

0.88

12.81

76.35

–

–

14.2

–

Ripper (default)

Awf

0.84

9.76

92.34

–

–

9.4

–

Ripper (default)

Swf

0.88

11.13

93.27

–

–

10.8

–

Ripper (default)

All

0.89

12.56

311.92

–

–

11.6

–

Ripper (min. rule weight)

Exi

0.68

2.10

4.98

–

–

2.52

–

Ripper (min. rule weight)

Adf

0.72

3.32

25.26

–

–

3.15

–

Ripper (min. rule weight)

Sdf

0.70

3.71

22.21

–

–

2.75

–

Ripper (min. rule weight)

Awf

0.72

3.30

31.19

–

–

2.85

–

Ripper (min. rule weight)

Swf

0.74

3.87

32.02

–

–

3.1

–

Ripper (min. rule weight)

All

0.75

3.72

107.92

–

–

3.1

–

SECPI

Exi

0.93

1.66

3.35

–

–

–

0.75

SECPI

Adf

0.97

3.20

2.27

–

–

–

0.94

SECPI

Sdf

0.96

2.18

1.88

–

–

–

0.9

SECPI

Awf

0.95

3.50

3.57

–

–

–

0.88

SECPI

Swf

0.96

3.59

2.93

–

–

–

0.92

SECPI

All

0.97

8.46

8.55

–

–

–

0.96

The classification accuracy and explanation length of Pareto-optimal results are presented in boldface, as are values which are not significantly different from the Pareto-optimal combinations at a 5% level

Table 10

Results of the experimental evaluation comparing SECPI with C4.5 and RIPPER averaged over clustering techniques and datasets, replicated with a cluster number of 8

Technique

Feature

Accuracy

Explanation length

Runtime (s)

NrLeaves

TreeSize

NrRules

Explainable

C4.5 (default)

Exi

0.80

4.80

3.04

12.69

24.39

–

–

C4.5 (default)

Adf

0.91

6.92

19.90

25.17

49.33

–

–

C4.5 (default)

Sdf

0.91

7.68

20.91

26.07

51.14

–

–

C4.5 (default)

Awf

0.90

6.35

21.98

19.19

37.39

–

–

C4.5 (default)

Swf

0.92

6.69

22.16

22.72

44.44

–

–

C4.5 (default)

All

0.93

6.99

72.83

23.28

45.56

–

–

C4.5 (min. leaf size)

Exi

0.76

2.99

1.53

4.15

7.31

–

–

C4.5 (min. leaf size)

Adf

0.83

3.49

8.72

5.28

9.56

–

–

C4.5 (min. leaf size)

Sdf

0.81

3.51

7.33

4.97

8.94

–

–

C4.5 (min. leaf size)

Awf

0.82

3.49

10.39

5.06

9.11

–

–

C4.5 (min. leaf size)

Swf

0.84

3.46

10.81

5.11

9.22

–

–

C4.5 (min. leaf size)

All

0.86

3.59

33.03

5.33

9.67

–

–

Ripper (default)

Exi

0.77

7.13

9.51

–

–

7.6

–

Ripper (default)

Adf

0.90

16.44

60.20

–

–

14.94

–

Ripper (default)

Sdf

0.90

15.96

55.80

–

–

16

–

Ripper (default)

Awf

0.88

13.32

73.38

–

–

12.04

–

Ripper (default)

Swf

0.91

15.08

75.83

–

–

13.64

–

Ripper (default)

All

0.92

16.30

248.24

–

–

14.06

–

Ripper (min. rule weight)

Exi

0.69

2.39

4.72

–

–

2.28

–

Ripper (min. rule weight)

Adf

0.75

3.43

23.66

–

–

2.94

–

Ripper (min. rule weight)

Sdf

0.74

3.83

20.94

–

–

2.74

–

Ripper (min. rule weight)

Awf

0.75

3.41

31.24

–

–

2.79

–

Ripper (min. rule weight)

Swf

0.76

4.16

32.74

–

–

2.94

–

Ripper (min. rule weight)

All

0.76

3.78

106.32

–

–

3

–

SECPI

Exi

0.95

1.38

7.20

–

–

–

0.78

SECPI

Adf

0.98

2.43

10.40

–

–

–

0.96

SECPI

Sdf

0.98

2.15

7.02

–

–

–

0.94

SECPI

Awf

0.97

2.84

21.69

–

–

–

0.91

SECPI

Swf

0.98

3.01

19.99

–

–

–

0.94

SECPI

All

0.98

7.14

58.86

–

–

–

0.96

The classification accuracy and explanation length of Pareto-optimal results are presented in boldface, as are values which are not significantly different from the Pareto-optimal combinations at a 5% level