5.5: Fitting a Line to Data

The real-world situations you have been studying so far form linear equations. However, most data in life is messy and does not fit a line in slope-intercept form with 100% accuracy. Because of this tendency, people spend their entire career attempting to fit lines to data. The equations that are created to fit the data are used to make predictions, as you will see in the next lesson.

This lesson focuses on graphing scatter plots and using the scatter plot to find a linear equation that will best fit the data.

A scatter plot is a plot of all the ordered pairs in the table. This means that a scatter plot is a relation, and not necessarily a function. Also, the scatter plot is discrete, as it is a set of distinct points. Even when we expect the relationship we are analyzing to be linear, we should not expect that all the points would fit perfectly on a straight line. Rather, the points will be “scattered” about a straight line. There are many reasons why the data does not fall perfectly on a line. Such reasons include measurement errors and outliers.

Measurement error is the amount you are off by reading a ruler or graph.

An outlier is a data point that does not fit with the general pattern of the data. It tends to be “outside” the majority of the scatter plot.

Example: Make a scatter plot of the following ordered pairs.

(0, 2), (1, 4.5), (2, 9), (3, 11), (4, 13), (5, 18), (6, 19.5)

Solution: Graph each ordered pair on one Cartesian plane.

Fitting a Line to Data

Notice that the points graphed on the plane above look like they might be part of a straight line, although they would not fit perfectly. If the points were perfectly lined up, it would be quite easy to draw a line through all of them and find the equation of that line. However, if the points are “scattered,” we try to find a line that best fits the data. The graph below shows several potential lines of best fit.

You see that we can draw many lines through the points in our data set. These lines have equations that are very different from each other. We want to use the line that is closest to all the points on the graph. The best candidate in our graph is the red line \begin{align*}A\end{align*}. Line \begin{align*}A\end{align*} is the line of best fit for this scatter plot.

Writing Equations for Lines of Best Fit

Once you have decided upon your line of best fit, you need to write its equation by finding two points on it and using either:

Point-slope form;

Standard form; or

Slope-intercept form.

The form you use will depend upon the situation and the ease of finding the \begin{align*}y-\end{align*}intercept.

Using the red line from the example above, locate two points on the line.

The equation for the line that fits the data best is \begin{align*}y=3.25x+1.25\end{align*}.

Finding Equations for Lines of Best Fit Using a Calculator

Graphing calculators can make writing equations of best fit easier and more accurate. Two people working with the same data might get two different equations because they would be drawing different lines. To get the most accurate equation for the line, we can use a graphing calculator. The calculator uses a mathematical algorithm to find the line that minimizes error between the data points and the line of best fit.

Example: Use a graphing calculator to find the equation of the line of best fit for the following data: (3, 12), (8, 20), (1, 7), (10, 23), (5, 18), (8, 24), (11, 30), (2, 10).

Input the data into the table by entering the \begin{align*}x\end{align*} values in the first column and the \begin{align*}y\end{align*} values in the second column.

Step 2:Find the equation of the line of best fit.

Press [STAT] again and use the right arrow to select [CALC] at the top of the screen.

Chose option number 4: \begin{align*}LinReg(ax+b)\end{align*} and press [ENTER]. The calculator will display \begin{align*}LinReg(ax+b)\end{align*}.

Press [ENTER] and you will be given the \begin{align*}a\end{align*} and \begin{align*}b\end{align*} values.

Here \begin{align*}a\end{align*} represents the slope and \begin{align*}b\end{align*} represents the \begin{align*}y-\end{align*}intercept of the equation. The linear regression line is \begin{align*}y=2.01x+5.94\end{align*}.

Step 3:Draw the scatter plot.

To draw the scatter plot press [STATPLOT] [2nd] [Y=].

Choose Plot 1 and press [ENTER].

Press the On option and choose the Type as scatter plot (the one highlighted in black).

Make sure that the \begin{align*}X\end{align*} list and \begin{align*}Y\end{align*} list names match the names of the columns of the table in Step 1.

Choose the box or plus as the mark since the simple dot may make it difficult to see the points.

Press [GRAPH] and adjust the window size so you can see all the points in the scatter plot.

Step 4:Draw the line of best fit through the scatter plot.

Press [Y=].

Enter the equation of the line of best fit that you just found: \begin{align*}Y_1 = 2.01X+5.94\end{align*}.

Press [GRAPH].

Using Lines of Best Fit to Solve Situations

Example: Gal is training for a 5K race (a total of 5000 meters, or about 3.1 miles). The following table shows her times for each month of her training program. Assume here that her times will decrease in a straight line with time. Find an equation of a line of fit. Predict her running time if her race is in August.

Month

Month number

Average time (minutes)

January

0

40

February

1

38

March

2

39

April

3

38

May

4

33

June

5

30

Solution: Begin by making a scatter plot of Gal’s running times. The independent variable, \begin{align*}x\end{align*}, is the month number and the dependent variable, \begin{align*}y\end{align*}, is the running time in minutes. Plot all the points in the table on the coordinate plane.

Since the slope is negative, the number of minutes Gal spends running a 5K race decreases as the months pass. The slope tells us that Gal’s running time decreases 1.75 minutes per month.

The \begin{align*}y-\end{align*}intercept tells us that when Gal started training, she ran a distance of 5K in 41 minutes, which is just an estimate, since the actual time was 40 minutes.

The problem asks us to predict Gal’s running time in August. Since June is assigned to month number five, then August will be month number seven. Substitute \begin{align*}x=7\end{align*} into the line of best fit equation.

The equation predicts that Gal will be running the 5K race in 28.75 minutes.

Practice Set

Sample explanations for some of the practice exercises below are available by viewing the following video. Note that there is not always a match between the number of the practice exercise in the video and the number of the practice exercise listed in the following exercise set. However, the practice exercise is the same in both.

Shiva is trying to beat the samosa eating record. The current record is 53.5 samosas in 12 minutes. The following table shows how many samosas he eats during his daily practice for the first week of his training. Will he be ready for the contest if it occurs two weeks from the day he started training? What are the meanings of the slope and the \begin{align*}y–\end{align*}intercept in this problem?

Day

No. of Samosas

1

30

2

34

3

36

4

36

5

40

6

43

7

45

Nitisha is trying to find the elasticity coefficient of a Superball. She drops the ball from different heights and measures the maximum height of the resulting bounce. The table below shows her data. Draw a scatter plot and find the equation. What is the initial height if the bounce height is 65 cm? What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?

Initial height(cm)

Bounce height(cm)

30

22

35

26

40

29

45

34

50

38

55

40

60

45

65

50

70

52

Baris is testing the burning time of “BriteGlo” candles. The following table shows how long it takes to burn candles of different weights. Let's assume it’s a linear relation. We can then use a line to fit the data. If a candle burns for 95 hours, what must be its weight in ounces?

Candle Burning Time Based on Candle Weight

Candle weight (oz)

Time (hours)

2

15

3

20

4

35

5

36

10

80

16

100

22

120

26

180

The table below shows the median California family income from 1995 to 2002 as reported by the U.S. Census Bureau. Draw a scatter plot and find the equation. What would you expect the median annual income of a Californian family to be in year 2010? What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?

Year

Income

1995

53,807

1996

55,217

1997

55,209

1998

55,415

1999

63,100

2000

63,206

2001

63,761

2002

65,766

Mixed Review

Sheri bought an espresso machine and paid $119.64 including tax. The sticker price was $110.27. What was the percent of tax?

What are the means of \begin{align*}\frac{4}{x}=\frac{141}{98}\end{align*}? What are the extremes?

Solve the proportion in question 18.

The distance traveled varies directly with the time traveled. If a car has traveled 328.5 miles in 7.3 hours, how many hours will it take to travel 82.8 miles?