Identifying outliers in R with ggplot2

One of the first steps when working with a fresh data set is to plot its values to identify patterns and outliers. When outliers appear, it is often useful to know which data point corresponds to them to check whether they are generated by data entry errors, data anomalies or other causes.

Unfortunately ggplot2 does not have an interactive mode to identify a point on a chart and one has to look for other solutions like GGobi (package rggobi) or iPlots.

However, if all is needed is to give a “name” to the outliers, it is possible to use ggplot labeling capabilities for the purpose. While labeling all points would usually produce a crowded and difficult to read plot, we can limit the labeling only to those points that respect certain conditions, namely our outliers.

Here is an example to illustrate this useful technique. We will be using the following data set consisting of 100 observations. The data set has been generated using rnorm for x and y.The label column provides an identifier for each observation in the form of “Data N” where N is the number of the observation.

To generate an outlier in the data set, the x value for observation number 87 has been changed to 100.

Sample data set. Observation 87 is an outlier.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

xylabel

1715Data1

21416Data2

31214Data3

41410Data4

51716Data5

6812Data6

7139Data7

8124Data8

91215Data9

10108Data10

11812Data11

12715Data12

13118Data13

14179Data14

15147Data15

161018Data16

17168Data17

18156Data18

191016Data19

201316Data20

21411Data21

221511Data22

231210Data23

241212Data24

251111Data25

26107Data26

2717Data27

28129Data28

29614Data29

301219Data30

31115Data31

32315Data32

33137Data33

34815Data34

351415Data35

36018Data36

37113Data37

38-111Data38

39129Data39

40015Data40

411216Data41

421112Data42

4385Data43

44177Data44

451112Data45

46210Data46

47177Data47

4870Data48

49115Data49

50109Data50

51411Data51

52109Data52

53189Data53

54129Data54

55614Data55

56118Data56

57129Data57

581018Data58

59719Data59

60128Data60

61514Data61

62244Data62

63134Data63

64243Data64

65148Data65

661010Data66

67815Data67

68814Data68

69911Data69

701913Data70

71129Data71

7280Data72

73516Data73

74710Data74

7514-1Data75

76147Data76

771414Data77

781614Data78

79151Data79

8078Data80

811113Data81

82917Data82

83159Data83

84613Data84

85111Data85

8661Data86

871005Data87

881315Data88

8923Data89

90107Data90

91111Data91

9257Data92

93915Data93

94188Data94

95175Data95

9647Data96

97815Data97

98811Data98

99814Data99

100513Data100

Let’s use qplot to plot the data.

Plotting the data from the sample data set.

R

1

qplot(data=data,x=x,y=y)

And here is the resulting plot.

The next step is to label the outlier (the point with x=100, observation number 87) and the outlier only with a label corresponding to its name. This is as easy as adding a geom_text call to qplot and setting the condition according to which the label has to be added.

The call to geom_text as it appears above adds a label to all points, but only those for which either x is greater than four times the Inter Quartile Range of all x in data or y is greater than four times the IQR of all y in data receive a non empty label (equal to the corresponding name in the label column). All the other points, those that are not outliers according to the condition we have set, receive an empty label, which means no label is displayed for them.

The hjust parameter is used to slightly offset on the horizontal direction the label respect to the point, so it doesn’t overlap with it.

Here the graphical result, correctly identifying the outlier as being “Data 87”.

The right condition to specify within the ifelse statement to correctly select the outliers to label largely depends on the data set. Often it is a matter of trial and errors (trying 1.5 * IQR, 2 *IQR, 3 * IQR, …) until only the “right” outliers are labeled.

A small addition to the code above allows us to label the outliers also with their x and y values.