Finding the Line That Best Fits a Set of Data Points.
A researcher at the National Highway Traffic Safety Administration tracked how driver reaction time changes with age. She collected data from dozens of drivers — their age in years and how long it took them to brake after seeing a hazard. When she plotted the points, no single straight line passed through all of them. But one line came closer to all the points than any other. That line let her predict reaction times for ages she had never tested. The method she used is called linear regression.
Linear regression finds the line that best fits a scatter plot. That line has the form
where is the slope and is the y-intercept, the same form you already know. The difference is that here you did not construct the line from two exact points — a calculation found it by minimizing how far the line is from every point at once. Your graphing calculator does this automatically. The result is called the linear regression equation or the line of best fit.
Here is a small data set. A student tracked how many hours she studied for each of six quizzes and what score she earned.
| Hours studied | Score | |---|---| | 1 | 62 | | 2 | 70 | | 3 | 75 | | 4 | 80 | | 5 | 88 | | 6 | 91 |
Enter the hours in L1 and the scores in L2 on your calculator. Run LinReg(ax+b) from the STAT → CALC menu. The calculator returns values for and . For this data, it returns approximately and , so the regression line is
You can see what that line looks like against the actual data points below.
The line does not pass through any of the six points exactly. That is expected. It is the line that gets as close as possible to all of them together.
Now the interpretation. The slope is . That means for each additional hour studied, the model predicts a score increase of about 5.83 points. The y-intercept is . Technically, it says the predicted score for someone who studied zero hours is about 56. That prediction might not be meaningful — nobody studies negative hours and a student who studied zero might have prior knowledge — but the y-intercept is still part of the equation.
The correlation coefficient measures how well the regression line fits the data. It is always between and . A value close to means a strong positive linear relationship — as increases, tends to increase, and the points cluster tightly around the line. A value close to means a strong negative linear relationship. A value close to means the linear model fits poorly and a line is probably the wrong shape to use. For the quiz data above, the calculator gives , which is extremely strong. The line fits the data very well.
To make appear on your calculator, go to CATALOG and turn DiagnosticOn on before running LinReg.
Now look at a case where you have to interpret a negative slope.
One more thing the Regents tests regularly: using the regression line to make predictions within the data range versus outside it. Predicting inside the range of your data is called interpolation, and it is generally reliable. Predicting far outside the range is called extrapolation, and the model may not hold. If the study data only went from 1 to 6 hours, predicting for 20 hours of studying using the same line might give a score above 100, which is impossible. The line is a model, not a law.
On Part II and Part III of the Algebra I Regents, linear regression questions almost always ask you to write the equation, interpret the slope in context, and use the equation to predict a value. Showing the substitution step earns a process credit — do not just write the final answer. The grader needs to see before they see .