Mostly tech with some other bits of me

12

After seeing Brian Sherwin’s presentation on Jupyter notebooks and participating in Matthew Renze’s Practical Data Science with R Workshop at CodeMash 2017, I wanted to play with both technologies more. Here we are, many months later, and I still have that curiosity and excitement over both. So… I’m presenting to you one of my adventures with R, as documented in an R Notebook (yes, similar to the Jupyter notebooks I’ve seen in Brian’s presentation). I am using RStudio to generate this.

The Project

My husband has various temperature and humidity sensors scattered throughout the house, recording data points to a MySQL server. The data is stored on a table that looks like this:

id

<int>

date

<chr>

sensorname

<chr>

sensorvalue

<dbl>

1

31

2016-12-18 22:20:23

temp5

63.6116

2

32

2016-12-18 22:20:23

finalDHTTempF2

68.0000

3

33

2016-12-18 22:20:23

humidity2

36.0000

4

34

2016-12-18 22:25:23

temp5

64.1750

5

35

2016-12-18 22:25:23

finalDHTTempF2

68.0000

6

36

2016-12-18 22:25:23

humidity2

36.0000

7

37

2016-12-18 22:30:23

temp5

63.7250

8

38

2016-12-18 22:30:23

finalDHTTempF2

69.8000

9

39

2016-12-18 22:30:23

humidity2

35.0000

10

40

2016-12-18 22:35:23

temp5

63.3866

I wanted to use his dataset to test my adventures in applying R.

Our current dataset data is a data frame with 198164 rows.

The Problem

Looking at this data, the first thing I thought was untidy. There has to be a better way. When I think of tidy data, I think of the tidyr package, which is used to help make data tidy, easier to work with. Specifically, I thought of the spread() function, where I could break things up. Once data was spread into appropriate columns, I figure I can operate on the data a bit better.

The Adventures so far…

As seen in the date field, the values are logged with their times. This is why we have so many data points. The first thing I wanted to do was group the values into daily means.

Cleaning up Dates

I am using lubridate to make some of my date management a bit easier. I am using dplyr to do the chaining with %>%. I grouped my data by sensor then by date parts – year, month, and day. After grouping the data, I summarized the data to get daily means. Once the data was summarized, I spread it out to make it more meaningful:

year(date)

<dbl>

month(date)

<dbl>

day(date)

<int>

finalDHTTempF1

<dbl>

finalDHTTempF2

<dbl>

finalDHTTempF3

<dbl>

humidity1

<dbl>

1

2016

12

18

NA

68.34286

NA

NA

2

2016

12

19

NA

67.77578

NA

NA

3

2016

12

20

NA

67.88750

NA

NA

4

2016

12

21

NA

68.95625

NA

NA

5

2016

12

22

NA

69.74375

NA

NA

6

2016

12

23

NA

69.71875

NA

NA

7

2016

12

24

NA

70.97500

NA

NA

8

2016

12

25

NA

70.85625

NA

NA

9

2016

12

26

NA

71.78750

NA

NA

10

2016

12

27

NA

71.08750

NA

NA

As a developer, I find working with date parts a bit annoying, so I want to compress these back into a single date field. Since the month and day fields can be single digits, I need to account for that – thankfully str_pad in stringr makes that easy to account for. Between str_pad, paste, and some date functions from lubridate make this data cleanup a little easier:

finalDHTTempF1

<dbl>

finalDHTTempF2

<dbl>

finalDHTTempF3

<dbl>

humidity1

<dbl>

humidity2

<dbl>

humidity3

<dbl>

temp4

<dbl>

temp5

<dbl>

NA

68.34286

NA

NA

35.80952

NA

NA

63.08703

NA

67.77578

NA

NA

35.55709

NA

NA

62.37841

NA

67.88750

NA

NA

35.50347

NA

NA

62.41281

NA

68.95625

NA

NA

35.46528

NA

NA

63.40109

NA

69.74375

NA

NA

35.24306

NA

NA

64.36713

NA

69.71875

NA

NA

35.25000

NA

NA

64.33000

Cleaning up NAs

Now some of the data shows NA. If there’s anything I’ve learned with data, NULL and NA can be problematic, depending on the data tool and the user operating said tool. In this case, I can easily convert my NA values to 0 without ruining the data meaning:

finalDHTTempF1

<dbl>

finalDHTTempF2

<dbl>

finalDHTTempF3

<dbl>

humidity1

<dbl>

humidity2

<dbl>

humidity3

<dbl>

temp4

<dbl>

temp5

<dbl>

0

68.34286

0

0

35.80952

0

0

63.08703

0

67.77578

0

0

35.55709

0

0

62.37841

0

67.88750

0

0

35.50347

0

0

62.41281

0

68.95625

0

0

35.46528

0

0

63.40109

0

69.74375

0

0

35.24306

0

0

64.36713

0

69.71875

0

0

35.25000

0

0

64.33000

Presentation

So now that I have daily averages in a format that I can work with, let’s do something meaningful with the data – let’s plot it! I am using ggplot2 for plotting.

Conclusion

So far, I’m having fun putting my skills to work, especially with this dataset at. I’m at the tail end of the 2nd course of an R specialization on Coursera. Between CodeMash and Coursera, I’ve been enjoying my exploRation into R. Here’s to many adventures ahead!