Title: Describing Relationships: Scatterplots and Correlation
1Chapter 4
- Describing Relationships Scatterplots and
Correlation
2Objectives (BPS chapter 4)
- Relationships Scatterplots and correlation
- Explanatory and response variables
- Displaying relationships scatterplots
- Interpreting scatterplots
- Adding categorical variables to scatterplots
- Measuring linear association (correlation)
- Facts about correlation
3Scatterplot
- A scatterplot is a graph in which paired (x, y)
data (usually collected on the same individuals)
are plotted with one variable represented on a
horizontal (x -) axis and the other variable
represented on a vertical (y-) axis. Each
individual pair (x, y) is plotted as a single
point.
Example
4Student Number of Beers Blood Alcohol Level
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Here we have two quantitative variables for
each of 16 students. 1. How many beers they
drank, and 2. Their blood alcohol level
(BAC) We are interested in the relationship
between the two variables How is one affected by
changes in the other one?
5Scatterplots
- In a scatterplot one axis is used to represent
each of the variables, and the data are plotted
as points on the graph.
Student Beers BAC
1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
6Explanatory and response variables
- A response variable measures or records an
outcome of a study. An explanatory variable
explains changes in the response variable. - Typically, the explanatory or independent
variable is plotted on the x axis and the
response or dependent variable is plotted on the
y axis.
7Some plots dont have clear explanatory and
response variables.
Do calories explain sodium amounts?
Does percent return on Treasury bills explain
percent return on common stocks?
8Examining a Scatterplot
- You can describe the overall pattern of a
scatterplot by the - Form linear or non-linear ( quadratic,
exponential, no - correlation etc.)
- Direction negative, positive.
- Strength strong, very strong, moderately
strong, - weak etc.
- Look for outliers and how they affect the
correlation.
9Scatterplot
Example Draw a scatter plot for the data below.
What is the nature of the
relationship between X and Y.
x 1 2 3 4 5
y -4 -2 1 0 2
Strong, positive and linear.
10Examining a Scatterplot
- Two variables are positively correlated when high
values of the variables tend to occur together
and low values of the variables tend to occur
together. - The scatterplot slopes upwards from left to
right. - Two variables are negatively correlated when
high values of one of the variables tend to occur
with low values of the other and vice versa. - The scatterplot slopes downwards from left to
right.
11Types of Correlation
As x increases, y tends to decrease.
As x increases, y tends to increase.
Negative Linear Correlation
Positive Linear Correlation
No Correlation
Non-linear Correlation
12Examples of Relationships
13Caution
- Relationships require that both variables be
quantitative (thus the order of the data points
is defined entirely by their value). - Correspondingly, relationships between
categorical data are meaningless. - Example Beetles trapped on boards of different
colors - What association? What relationship?
14Thought Question 1
What type of association would the following
pairs of variables have positive, negative, or
none?
- Temperature during the summer and electricity
bills - Temperature during the winter and heating costs
- Number of years of education and height
(Elementary School) - Frequency of brushing and number of cavities
- Number of churches and number of bars in cities
- Height of husband and height of wife
15Thought Question 2
- Consider the two scatterplots below. How does
the outlier impact the correlation for each plot? - does the outlier increase the correlation,
decrease the correlation, or have no impact?
16Strength of the association
- The strength of the relationship between the two
variables can be seen by how much variation, or
scatter, there is around the main form.
With a strong relationship, you can get a pretty
good estimate of y if you know x.
With a weak relationship, for any x you might get
a wide range of y values.
17How to scale a scatterplot
Same data in all four plots
- Using an inappropriate scale for a scatterplot
can give an incorrect impression. - Both variables should be given a similar amount
of space - Plot roughly square
- Points should occupy all the plot space (no
blank space)
18Adding categorical variables to scatterplots
- Often, things are not simple and one-dimensional.
We need to group the data into categories to
reveal trends.
What may look like a positive linear relationship
is in fact a series of negative linear
associations. Plotting different habitats in
different colors allowed us to make that
important distinction.
19Comparison of mens and womens racing records
over time. Each group shows a very strong
negative linear relationship that would not be
apparent without the gender categorization.
Relationship between lean body mass and metabolic
rate in men and women. While both men and women
follow the same positive linear trend, women show
a stronger association. As a group, males
typically have larger values for both variables.
20Measuring Strength Directionof a Linear
Relationship
- How closely does a non-horizontal straight line
fit the points of a scatterplot? - The correlation coefficient (often referred to as
just correlation) r - measure of the strength of the relationship
the stronger the relationship, the larger the
magnitude of r. - measure of the direction of the relationship
positive r indicates a positive relationship,
negative r indicates a negative relationship.
21Correlation Coefficient
Greek Capital Letter Sigma denotes summation or
addition.
22Example Find the correlation between X and Y
x 1 2 3 4 5
y -4 -2 1 0 2
x y
1 -2 -4 -3.4 6.8
2 -1 -2 -1.4 1.4
3 0 1 1.6 0
4 1 0 0.6 0.6
5 2 2 2.6 5.2
23Correlation Coefficient
- The range of the correlation coefficient is -1 to
1.
If r -1 there is a perfect negative correlation
If r 1 there is a perfect positive correlation
If r is close to 0 there is no linear correlation
24Linear Correlation
r ?0.91
r 0.88
Strong negative correlation
Strong positive correlation
r 0.42
r 0.07
Try
Weak positive correlation
Non-linear Correlation
25Correlation Coefficient
- special values for r
- a perfect positive linear relationship would have
r 1 - a perfect negative linear relationship would have
r -1 - if there is no linear relationship, or if the
scatterplot points are best fit by a horizontal
line, then r 0 - Note r must be between -1 and 1, inclusive
- r gt 0 as one variable changes, the other
variable tends to change in the same direction - r lt 0 as one variable changes, the other
variable tends to change in the opposite direction
26Correlation Coefficient
- Because r uses the z-scores for the observations,
it does not change when we change the units of
measurements of x , y or both. - Correlation ignores the distinction between
explanatory and response variables. - r measures the strength of only linear
association between variables. - A large value of r does not necessarily mean that
there is a strong linear relationship between the
variables the relationship might not be linear
always look at the scatterplot. - When r is close to 0, it does not mean that there
is no relationship between the variables, it
means there is no linear relationship. - Outliers can inflate or deflate correlations.
Try
27Not all Relationships are LinearMiles per Gallon
versus Speed
- Curved relationship(r is misleading)
- Speed chosen for each subject varies from 20 mph
to 60 mph - MPG varies from trial to trial, even at the same
speed - Statistical relationship
r-0.06
28Common Errors Involving Correlation
- 1. Causation It is wrong to conclude that
correlation implies causality. - 2. Averages Averages suppress individual
variation and may inflate the correlation
coefficient. - 3. Linearity There may be some relationship
between x and y even when there is no linear
correlation.
29Example
- A survey of the worlds nations in 2004 shows a
strong - positive correlation between percentage of
countries - using cell phones and life expectancy in years at
birth. - Does this mean that cell phones are good for your
health? - No. It simply means that in countries where cell
phone use is high, the life expectancy tends to
be high as well. - What might explain the strong correlation?
- The economy could be a lurking variable. Richer
countries generally have more cell phone use and
better health care.
30Example
- The correlation between Age and Income as
measured on 100 - people is r 0.75. Explain whether or not each
of these - conclusions is justified.
- When Age increases, Income increases as well.
- The form of the relationship between Age and
Income is linear. - There are no outliers in the scatterplot of
Income vs. Age. - Whether we measure Age in years or months, the
correlation will still be 0.75.
31Example
- Explain the mistakes in the statements below
- My correlation of -0.772 between GDP and Infant
Mortality Rate shows that there is almost no
association between GDP and Infant Mortality
Rate. - There was a correlation of 0.44 between GDP and
Continent - There was a very strong correlation of 1.22
between Life Expectancy and GDP.
32Key Concepts
- Strength of Linear Relationship
- Direction of Linear Relationship
- Correlation Coefficient
- Common Problems with Correlations
- r can only be calculated for quantitative data.