FrontierMath
Algebra I/Sequences/Scatter Plots and Correlation
Algebra I Regents in 22 days
Algebra I · Lesson 4

Scatter Plots and Correlation

How to find a relationship hiding inside a cloud of points.


A researcher tracked how many hours per week teenagers spent exercising and their resting heart rates. She plotted each person as a single dot — hours on the horizontal axis, heart rate on the vertical. When she stepped back and looked at the whole picture, she saw something: as hours of exercise went up, resting heart rates tended to go down. That picture is a scatter plot, and the pattern she noticed is a correlation.

A scatter plot places two numerical variables against each other on a coordinate plane. Each person, object, or event becomes one point. The horizontal axis holds the independent variable — the one you think might be doing the influencing. The vertical axis holds the dependent variable — the one you think might be responding.

Here is what that exercise data might look like:

Interactive graph — scroll to zoom, drag to pan

Once the points are plotted, you describe the association. There are three things to name: direction, form, and strength.

Direction tells you which way the pattern goes. A positive association means that as the xx-variable increases, the yy-variable tends to increase too. A negative association means that as xx increases, yy tends to decrease. The exercise and heart rate data shows a negative association. If there is no pattern at all, you say there is no association.

Form describes the shape of the pattern. If the points cluster around a straight line, the association is linear. If they bend or curve, it is nonlinear.

Strength describes how tightly the points cluster around that pattern. If the points hug the line closely, the association is strong. If they are spread out loosely, it is weak.

A complete description of a scatter plot combines all three: "There is a strong, negative, linear association between hours of exercise and resting heart rate."

To put an exact number on the strength and direction of a linear association, statisticians use the correlation coefficient, written rr. It always falls between 1-1 and 11.

1r1-1 \leq r \leq 1

When rr is close to 11, the points fall nearly on a line with positive slope — strong positive linear association. When rr is close to 1-1, they fall nearly on a line with negative slope — strong negative linear association. When rr is close to 00, the linear pattern is weak or absent. The sign of rr tells you direction. The size of rr tells you strength.

For the Regents, your calculator computes rr for you. You will not calculate it by hand. But you need to interpret what it means. An rr-value of 0.94-0.94 means a strong negative linear association. An rr-value of 0.310.31 means a weak positive linear association.

When the association is linear, you can draw a line of best fit — also called a trend line or regression line. This line passes through the middle of the data, with roughly equal numbers of points above and below it. On the Regents, you may be asked to draw one by hand or to use a calculator to find its equation.

The line of best fit has the same form as any linear equation:

y=mx+by = mx + b

Here mm is the slope and bb is the yy-intercept, just like always. What changes is how you read them. The slope tells you how much yy changes for each one-unit increase in xx. In context, that matters. If the equation for the exercise data is y=2.5x+81y = -2.5x + 81, the slope 2.5-2.5 means that for each additional hour of weekly exercise, resting heart rate drops by about 2.52.5 beats per minute.

Reading a line of best fit
y=2.5x+81y = -2.5x + 81
This is the regression equation your calculator gives. Each number means something real.
m=2.5m = -2.5
For every one extra hour of exercise per week, resting heart rate decreases by 2.5 bpm on average.
b=81b = 81
When x = 0 — no exercise at all — the predicted resting heart rate is 81 bpm. This is the y-intercept.

You can also use the equation to make predictions. Plug in an xx-value and compute the predicted yy. If you predict within the range of your data, that is called interpolation. If you predict outside the range of your data, that is extrapolation — and you should be cautious, because the pattern may not hold forever.

Predicting from the regression equation
y=2.5(5)+81y = -2.5(5) + 81
Plug in x = 5 hours of exercise to predict resting heart rate.
y=12.5+81y = -12.5 + 81
Multiply first.
y=68.5y = 68.5 \checkmark
The model predicts a resting heart rate of about 68.5 bpm for someone who exercises 5 hours per week.
Practice Questions
r=0.87r = -0.87
y=3.2x+14, predict y when x=6y = 3.2x + 14, \text{ predict } y \text{ when } x = 6
r=0.04r = 0.04
Regents Corner

On Part II and Part III of the Algebra I Regents, scatter plot questions often ask you to write a sentence interpreting the slope in context. Writing just "the slope is 2.5-2.5" earns no credit. You need to say what the numbers mean using the variables described in the problem. A complete answer names both variables, states the direction of change, and includes units.

Students see a strong correlation — say, r = 0.96 — and write that one variable causes the other. This is wrong and the Regents rubric will not give credit for causation language when the problem only involves observational data. Correlation measures how variables move together. It says nothing about why. Stick to language like "as x increases, y tends to increase" rather than "x causes y to increase."
← Previous
Linear Regression
Next →
Standard Deviation