Title: SLIDES PREPARED
1STATISTICS for the Utterly Confused, 2nd ed.
- SLIDES PREPARED
- By
- Lloyd R. Jaisingh Ph.D.
- Morehead State University
- Morehead KY
2Chapter 5
3Outline
- Do I Need to Read This Chapter?
- 5-1 Scatter Plots
- 5-2 Looking for Patterns in the Data
- 5-3 Linear Correlation
- 5-4 Correlation and Causation
4Outline
- 5-5 Least-Squares Regression
- Line
- 5-6 The Coefficient of Determination
- 5-7 Residual Plots
- 5-8 Outliers and Influential
- Points
5Objectives
- Introduction of some basic statistical terms that
are related to correlation and regression
analysis. - Basic introduction to the concepts of linear
correlation and linear regression analysis.
65-1 Scatter Plots
- In simple correlation and regression studies,
data are collected on two quantitative variables
(bivariate data) to determine whether a
relationship exists between the two variables. - To illustrate this graphically, consider the
following example.
75-1 Scatter Plots
- Example The bivariate data given in the
following table relate the high temperature (0F)
reached on a given day and the number of cans of
soft drinks sold from a particular vending
machine in front of a grocery store. Data were
collected for 15 different days.
85-1 Scatter Plots
We would like to graphically study the
association between the temperature and the
number of cans of soft drinks sold.
95-1 Scatter Plots
- To analyze graphically, we can display the data
on a two-dimensional graph. - We can plot the number of cans of soft drinks
along the vertical axis and the temperature along
the horizontal axis. - Such plots are called scatter plots.
105-1 Scatter Plots
Observe that the number of cans of soft drinks
sold increases as the temperature increases.
Scatter Plot of Number of Cans versus Temperature
115-1 Scatter Plots
- The variable plotted along the vertical axis is
called the dependent variable. - The variable plotted along the horizontal axis is
called the independent variable. - Notation We will let y represent the dependent
variable and we will let x represent the
independent variable.
125-1 Scatter Plots
- Explanation of the term scatter plot A
scatter plot is a graph of the ordered pairs (x,
y) of values for the independent variable x and
the dependent variable y.
135-1 Scatter Plots
- NOTE The number of cans of soft drinks sold
will depend on the temperature. - Thus, the dependent variable (y) will be the
number of cans of soft drinks sold, and the
independent variable (x) will be the temperature.
145-2 Looking at Patterns
- Detecting an association or a relationship for
bivariate data starts with a scatter plot. - When examining a scatter plot, one should try to
answer the following questions - Is there a straight-line pattern or association?
155-2 Looking at Patterns
- Does the pattern or association slope upward or
downward? - Are the plotted values tightly clustered together
in the pattern or widely separated? - Are there noticeable deviations from the pattern?
16Quick Tips
- Two variables are said to be positively related
if larger values of one variable tend to be
associated with larger values of the other. - Two variables are said to be negatively related
if larger values of one variable tend to be
associated with smaller values of the other.
17 Perfect Positive Linear Association
Perfect positive association rarely occurs when
sample data are collected.
18 Perfect Negative Linear Association
Perfect negative association rarely occurs when
sample data are collected.
19 Very Strong Positive Linear Association
The points are closely packed along a positive
linear trend..
20 Very Strong Negative Linear Association
The points are closely packed along a negative
linear trend..
21 Little or No Association
The points are scattered around with no apparent
trend..
22 Nonlinear Association
The points display a nonlinear relationship..
235-3 Correlation
- So far you have seen how a scatter plot can
provide a visual of the association between two
variables. - Here we will discuss a numerical measure of the
linear association between two variables called
the Pearson product moment correlation
coefficient or simply the correlation coefficient.
245-3 Correlation
- Explanation of the term sample correlation
coefficient The sample correlation coefficient
measures the strength and direction of the linear
relationship between two variables using sample
data. - The sample correlation coefficient is denoted by
the letter r and is computed from the equation on
the next slide.
255-3 Correlation
- n is the number of (x,y) data pairs.
-
265-3 Correlation
- Example Compute the linear correlation
coefficient for the following set of observations
for the independent variable x and the dependent
variable y.
275-3 Correlation
- Solution The formula may look intimidating, but
we can construct a table to help with the
computations.
285-3 Correlation
- Solution Using the values from the previous
table, we have
295-3 Correlation
- Note We may use available technology to help
compute the correlation coefficient. The
following is a MINITAB output with the value.
305-3 Correlation
- The scatter plot displays the negative
correlation between x and y.
31Properties of the Correlation Coefficient
- The range of the correlation coefficient is from
1 to 1. - If there is a perfect positive linear
relationship between the variables, the value of
r will be equal to 1. - If there is a perfect negative linear
relationship between the variables, the value of
r will be equal to 1. -
32Properties of the Correlation Coefficient
- If there is a strong positive linear relationship
between the variables, the value of r will be
close to 1 - If there is a strong negative linear relationship
between the variables, the value of r will be
close to 1 - If there is little or no linear relationship
between the variables, the value of r will be
close to 0. -
33Quick Tip
- One should always examine the scatter plot and
not just rely on the value of the linear
correlation. - This measure will not detect curvilinear or other
types of complex relationships. - That is, there may be a non-linear relationship
between two variables even though the linear
correlation is close to 0. See the next slide. -
34Quick Tip
Small linear correlation but strong non-linear
correlation.
35Correlation and Causation
This illustration shows the distinction between
association and causation.
36Correlation and Causation
- Suppose that a high correlation is observed
between the weekly sales of hot chocolate and the
number of skiing accidents. - One can reasonably conclude that hot chocolate
sales could not cause skiers to have accidents
while skiing. - Also, one can reasonably conclude that more
skiing accidents could not cause an increase in
hot chocolate sales.
37Correlation and Causation
- Since the two variables are not actually
related, what could explain such a relationship? - The apparent relationship between the two
variables may be caused by a third variable. - In this case, the variables may be related to the
weather conditions during the winter months.
38Correlation and Causation
- Hence, one can conclude that correlation is not
the same as causation.
395-5 Least-Squares Regression Line
- In investigating the relationship between two
variables, the first thing one should do is to
prepare a scatter plot after the data are
collected. - From the plot one can observe any pattern.
- If the correlation coefficient is reasonably
large (positive or negative), the next step would
be to fit the regression line which best fits or
models the data.
405-5 Least-Squares Regression Line
- The following scatter plot (next slide) shows
two possible straight lines that may be used to
model the data. - Question Which of these lines best represents
the association between the two variables?
415-5 Least-Squares Regression Line
425-5 Least-Squares Regression Line
- NOTE Regression analysis allows us to
- determine which of the two lines best
- represents the relationship.
- The equation of the linear regression
- line is usually written as (where a is the
- slope and b is the y-intercept)
435-5 Least-Squares Regression Line
- Least-squares analysis allows us to determine
- the values for a and b such that the equation
- of the regression line best represents the
- relationship between the two variables by
- minimizing the error sum of squares.
- The regression line is usually called the line
of - best fit.
445-5 Least-Squares Regression Line
- We usually refer to this type of regression
analysis as simple regression analysis since we
are dealing with straight-line models involving
one independent variable.
455-5 Least-Squares Regression Line
The equations that one can use to compute the
values for a and b are
465-5 Least-Squares Regression Line
NOTE Because of the availability of different
technologies, there is no need to memorize the
formulas (or even work with them) when we
have real data. We will illustrate using the
MINITAB software.
475-5 Least-Squares Regression Line
- Example The following data relate the high
temperature (0F) reached on a given day and the
number of cans of soft drinks sold from a
particular vending machine in front of grocery
store. Data were collected for 15 days. - The data is given on the next slide.
485-5 Least-Squares Regression Line
495-5 Least-Squares Regression Line
Write the equation as
50Quick Tip
- When using the line of best fit to make
predictions, care must be taken to use
independent values that are within the range of
the observed independent variable. - Using values outside of the range of observed
independent values may lead to incorrect
predictions because we do not know how the model
is behaving outside this range. -
51Quick Tip
- The model reflects the behavior of the
association between the two variables only within
the range of the observed values. -
525-5 Least-Squares Regression Line
- Example For the previous example, what is the
predicted number of soft-drinks sold for a
temperature of 85 0F? - The predicted number of soft-drinks sold is
-
535-6 The Coefficient of Determination
- Explanation of the term coefficient of
determination The coefficient of determination
measures the proportion of the variability in the
dependent variable (y variable) that is explained
by the regression model through the independent
variable (x variable). -
545-6 The Coefficient of Determination
- The coefficient of determination is obtained by
squaring the value of the correlation
coefficient. - The notation used is r2 or R2.
- Note 0 lt R2 lt 1 or 0 lt R2 lt 100
-
555-6 The Coefficient of Determination
- r2 or R2 close to 1 (or 100) would imply that
the model is explaining most of the variation in
the dependent variable and may be a very useful
model. - r2 or R2 close to 0 (or 0) would imply that the
model is explaining little of the variation in
the dependent variable and may not be a very
useful model. -
56Display of the Least-Squares Regression Line
Superimposed on the Scatter Plot