Look at that! Those dots aren't too far from being a straight line. It looks as though height and weight are more or less linearly related! We'll come back to that in a moment, but first, let's convert them to standard units.

A value in standard units is how many standard deviations above the mean it is. So, to convert a data point to standard units, we need the mean and the standard deviation of the data set that it came from. Then we subtract the mean from it, and then divide that by the standard deviation The standard_units function below, which comes from the Data 8 textbook, does this for a whole array of numbers at once.

In [4]:

defstandard_units(nums):return(nums-np.mean(nums))/np.std(nums)

Now we can create a version of the heightweight table that's in standard units.

Now that we've got standard units, though, it's easy to compute the correlation coefficient $r$. First, we need to take the product of our $x$ and $y$ values -- in this case, height and weight -- for each data point. We'll add a new column to our table that has these products.

r=np.mean(heightweight_product.column('Product of standardized height and weight'))r

Out[8]:

0.9910523777994954

Wow. $r$ is very close to 1, which means that our data has an almost perfect linear correlation.

What does "almost perfect" look like? Now that we have $r$, we can plot the regression line. (To be specific, we're plotting a segment of the regression line that goes from -2 to 2 on the $x$-axis, but we could plot a longer segment if we wanted to.)

The slope of the actual regression line is just sliiiiightly smaller than the slope of the line $y = x$. That is, the regression line is slightly higher than the red line on the left side of the plot, and slightly lower on the right side. If we scale up the plot size and tweak the line width and range of the axes, we can see the separation between the two lines a bit more clearly.

We can now compute the number of days since birth for each row in our data set. Sylvia was born on July 24, 2017, so we'll define that date to be her birthday and work from there.

In [13]:

birthday=date(2017,7,24)days_since_birth=[(date-birthday).daysfordateinheightweight_dates.column('Date')]heightweight_with_days=heightweight_dates.with_columns('Days since birth',days_since_birth)heightweight_with_days

Out[13]:

Date

Height (cm)

Weight (kg)

Days since birth

2017-07-28

53.3

4.204

4

2017-08-07

54.6

4.65

14

2017-08-25

55.9

5.425

32

2017-09-25

61

6.41

63

2017-11-28

63.5

7.985

127

2018-01-26

67.3

9.125

186

2018-04-27

71.1

10.39

277

2018-07-30

74.9

10.785

371

And now we can easily plot the relationship between days since birth and height, and the relationship between days since birth and weight.

How linear are these relationships? We can find out by computing $r$ for each relationship.

In [15]:

heightweight_with_days_standard=Table().with_columns("Date",heightweight_with_days.column('Date'),"Height (standard units)",standard_units(heightweight_with_days.column('Height (cm)')),"Weight (standard units)",standard_units(heightweight_with_days.column('Weight (kg)')),"Days since birth (standard units)",standard_units(heightweight_with_days.column('Days since birth')))heightweight_with_days_product=heightweight_with_days_standard.with_columns("Product of standardized height and days since birth",heightweight_with_days_standard.column('Height (standard units)')*heightweight_with_days_standard.column('Days since birth (standard units)'),"Product of standardized weight and days since birth",heightweight_with_days_standard.column('Weight (standard units)')*heightweight_with_days_standard.column('Days since birth (standard units)'))heightweight_with_days_product

Out[15]:

Date

Height (standard units)

Weight (standard units)

Days since birth (standard units)

Product of standardized height and days since birth

Product of standardized weight and days since birth

2017-07-28

-1.26135

-1.3158

-1.03738

1.3085

1.36498

2017-08-07

-1.08691

-1.13054

-0.957736

1.04097

1.08276

2017-08-25

-0.912464

-0.808628

-0.814374

0.743087

0.658526

2017-09-25

-0.228116

-0.399485

-0.567474

0.12945

0.226697

2017-11-28

0.107349

0.254728

-0.0577429

-0.00619863

-0.0147087

2018-01-26

0.617255

0.728253

0.412165

0.254411

0.30016

2018-04-27

1.12716

1.2537

1.13694

1.28151

1.42538

2018-07-30

1.63707

1.41777

1.88561

3.08686

2.67336

In [16]:

r_days_height=np.mean(heightweight_with_days_product.column('Product of standardized height and days since birth'))print("r_days_height:",r_days_height)r_days_weight=np.mean(heightweight_with_days_product.column('Product of standardized weight and days since birth'))print("r_days_weight:",r_days_weight)

r_days_height: 0.9798241135426988
r_days_weight: 0.9646449280709019

Both of these values for $r$ are pretty close to 1, although a little less so than the relationship between height and weight. Let's plot the regression lines:

Although $r$ is pretty close to 1 in each case, the shapes of the scatter plots look to me like the data would be better modeled by a curved line, especially the second one. Maybe the next part of the course will talk about those.

This website does not host notebooks, it only renders notebooks
available on other websites.