Linear Regression

Finding the Line That Best Fits a Set of Data Points.

A researcher at the National Highway Traffic Safety Administration tracked how driver reaction time changes with age. She collected data from dozens of drivers — their age in years and how long it took them to brake after seeing a hazard. When she plotted the points, no single straight line passed through all of them. But one line came closer to all the points than any other. That line let her predict reaction times for ages she had never tested. The method she used is called linear regression.

Linear regression finds the line that best fits a scatter plot. That line has the form

y = ax + b

where $a$ is the slope and $b$ is the y-intercept, the same form you already know. The difference is that here you did not construct the line from two exact points — a calculation found it by minimizing how far the line is from every point at once. Your graphing calculator does this automatically. The result is called the linear regression equation or the line of best fit.

Here is a small data set. A student tracked how many hours she studied for each of six quizzes and what score she earned.

| Hours studied | Score | |---|---| | 1 | 62 | | 2 | 70 | | 3 | 75 | | 4 | 80 | | 5 | 88 | | 6 | 91 |

Enter the hours in L1 and the scores in L2 on your calculator. Run LinReg(ax+b) from the STAT → CALC menu. The calculator returns values for $a$ and $b$ . For this data, it returns approximately $a = 5.83$ and $b = 56.33$ , so the regression line is

y = 5.83x + 56.33

You can see what that line looks like against the actual data points below.

Interactive graph — scroll to zoom, drag to pan

The line does not pass through any of the six points exactly. That is expected. It is the line that gets as close as possible to all of them together.

Now the interpretation. The slope is $5.83$ . That means for each additional hour studied, the model predicts a score increase of about 5.83 points. The y-intercept is $56.33$ . Technically, it says the predicted score for someone who studied zero hours is about 56. That prediction might not be meaningful — nobody studies negative hours and a student who studied zero might have prior knowledge — but the y-intercept is still part of the equation.

Predicting a value using the regression line

y = 5.83x + 56.33

Start with the regression equation the calculator gave you.

y = 5.83(4.5) + 56.33

The student wants to predict a score for 4.5 hours of studying. Substitute x = 4.5.

y = 26.235 + 56.33

Multiply first.

y \approx 82.6 \checkmark

A student who studies 4.5 hours is predicted to score about 82.6 points.

The correlation coefficient $r$ measures how well the regression line fits the data. It is always between $-1$ and $1$ . A value close to $1$ means a strong positive linear relationship — as $x$ increases, $y$ tends to increase, and the points cluster tightly around the line. A value close to $-1$ means a strong negative linear relationship. A value close to $0$ means the linear model fits poorly and a line is probably the wrong shape to use. For the quiz data above, the calculator gives $r \approx 0.996$ , which is extremely strong. The line fits the data very well.

To make $r$ appear on your calculator, go to CATALOG and turn DiagnosticOn on before running LinReg.

Now look at a case where you have to interpret a negative slope.

Interpreting a regression line with negative slope

y = -3.2x + 98.4

A regression line for a data set where x is the number of absences in a semester and y is the final exam grade.

\text{slope} = -3.2

For each additional absence, the predicted grade drops by 3.2 points.

y = -3.2(0) + 98.4 = 98.4

The y-intercept says a student with zero absences is predicted to score about 98.4. That is the model's starting point.

y = -3.2(10) + 98.4 = 66.4 \checkmark

A student with 10 absences is predicted to score about 66.4. Substitute x = 10 and evaluate.

One more thing the Regents tests regularly: using the regression line to make predictions within the data range versus outside it. Predicting inside the range of your data is called interpolation, and it is generally reliable. Predicting far outside the range is called extrapolation, and the model may not hold. If the study data only went from 1 to 6 hours, predicting for 20 hours of studying using the same line might give a score above 100, which is impossible. The line is a model, not a law.

Practice Questions

y = 4.1x + 12.6 \text{, find } y \text{ when } x = 5

y = -2.5x + 80, \text{ interpret slope and y-intercept}

r = -0.21 \text{ — is a linear model a good fit?}

Regents Corner

On Part II and Part III of the Algebra I Regents, linear regression questions almost always ask you to write the equation, interpret the slope in context, and use the equation to predict a value. Showing the substitution step earns a process credit — do not just write the final answer. The grader needs to see $y = 5.83(4.5) + 56.33$ before they see $82.6$ .

Students round the slope and intercept too early — they write a = 6 and b = 56 instead of using the calculator's full decimal output, then get a predicted value that is far enough off to lose a point. Keep the values from the calculator until the very last step, then round your final answer.

← Previous

Box Plots and IQR

Scatter Plots and Correlation