Confidentialise Your Data with the randomNames Package

Sensitive data has it’s restrictions for good reason. Personal data such as names and other identifiable information should be protected. Policies are in place to prevent any accidental data breach by governments and businesses. This can be hurdle for data projects, particularly when socialising your work. A common technique is to strip the individuals name and replace it with a random number. This is fine and does the job but the story is much better told when you can refer to a person. Another method is to randomise the names in the list giving each individual a random name first and last name that is present in your data. This often leaves you with unease because what if, by chance you randomly assign the same name to someone? It could happen to a John Smith.

A better way is to use the randomNames package. It’s simple and easy to use and an important step can be done without too much thought. Simply use the function below.

1

2

3

# Random names

randomNames(10)

1

2

3

4

## [1] "Shibles, Suzanna" "Foehl, Meghan" "Marino, Jebediah"

## [4] "May, Cheyenne" "Lockhart, Isaiah" "Vera, Ian"

## [7] "al-Othman, Ilhaam" "Sanchez, Garrett" "Aguilar, Madison"

## [10] "Johnson, Nico"

Other parameters control the sex and ethnicity of the name.

1

2

randomNames(10,gender=1,ethnicity=5)

1

2

3

4

## [1] "Cramer, Kylie" "Baldocchi, Melissa" "Ray, Alexis"

## [4] "Hoffman, Jennifer" "Ellerbrock, Nikki" "Sholdt, Kimberly"

## [7] "Lewis, Emily" "Riddle, Laura" "Davison, Rachel"

## [10] "Mounts, Kristin"

Perhaps you only need the first or last name to be confidentialised.

1

2

randomNames(10,which.names="first")

1

2

## [1] "David" "Jeong Min" "Daniel" "Jake" "Nyamekye"

## [6] "Lamontee" "Juana" "Connor" "Mariah" "Deante"

1

2

randomNames(10,which.names="last")

1

2

## [1] "Williams" "Pham" "el-Hares" "Farrell" "Hayes"

## [6] "al-Ullah" "Cantu" "Burnett" "Lightfoot" "Nguyen"

You can also sample the whole feature set if needed randomising gender and ethnicity.

1

2

randomNames(10,return.complete.data=TRUE)

1

2

3

4

5

6

7

8

9

10

11

## gender ethnicity first_name last_name

## 1: 1 2 Kathleena Bellino

## 2: 0 5 Michael Farabaugh

## 3: 1 2 Erin Sterett

## 4: 1 3 Eva Waldon

## 5: 1 5 Fatima Mills

## 6: 0 1 Joseph Cournoyer

## 7: 1 5 Sasha Pawlak

## 8: 0 2 Mario Pham

## 9: 0 2 Grant Nguyen

## 10: 1 1 Danielle Felix

Addresses are also sensitive since they can be identifiable, whether it’s the property, business or the persons which reside within them. Here’s a freebie, a simple function that uses the randomNames package to generate fake addresses.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

# random address for australian household

randomAddress<-function(n,punit=0.2){

#Weighting number between 1-30 to be equally as likely and numbers beyond with decreasing probability

Not the most accurate recreation of Australian addresses but it’s not bad. The proportions are consistent with the current Australian population figures. This looks better and is easier to tell a story with these address rather than labelling every address as “1 Aardvark Ave”… which I have regretfully done before. I intend to build a much better function which offers more flexibility when randomising addresses which simulate the Australian populaltion and allows you to subset the addresses to a user definied set of states, postcode, more realistic set of unit numbers, etc. But for now this does the job.