Title: Relations between 2 quantitative variables
1Relations between 2 quantitative variables
2Example
- Students who apply to MBA programs must write
the Graduate Management Admission Test (GMAT).
University admission committees use the GMAT
scores as one of the critical indicators of how
well a student is likely to perform in the MBA
program. However, the GMAT may not be a very
strong indicator for all MBA programs. Suppose
that an MBA program designed for middle managers
who wish to upgrade their skills was launched 3
years ago. To judge how well the GMAT score
predicts MBA performance, a sample of 12
graduates was taken. Their grade point average in
the MBA program (values from 0 to 12) and the
GMAT score (values from 200 to 800) are listed in
the table below and stored in file Xm04-16.
3(No Transcript)
4Scatterplots
- Show the relationship between two paired
- quantitative variables (X and Y).
- Examples
- Variable X age Variable Y height
- Variable X SAT score Variable Y academic
success - Variable X alcohol level Variable Y no. of
heart diseases
5- Data for variable X x1,x2,,xn
- Data for variable Y y1, y2,,yn
- Plot (x1,y1), (x2,y2),, (xn,yn)
6Interpreting scatterplots
- Three important aspects of pattern
- Form
- Direction
- Strength
7- For the GMAT-GPA
- Form
- Direction
- Strength
Close to linear association
positive
Moderate to low
8Relationship does not necessarily mean that one
variable causes the other!
- Two levels of relationship
- 1. Simple association (and Response variable and
explanatory variable) - 2. Causality
9Two levels of relationship between variables
- 1. Simple association
- Two variables are associated if some values of
one variable tend to occur more often with some
values of the second variable than with other
values of that variable. - Examples
- Statistics course grade and
- American history course grade
- Cholesterol level and blood pressure
- No. of babies and number of storks
- Heights of fathers and their sons
- GMAT scores and GPA
- The association between the variables means
- that we can predict one variable using the other.
10- Response variable and explanatory variable
- If we think that a variable x can explain
changes in variable y, we call x an explanatory
variable and y a response variable. -
-
-
For each of the following, if possible, determine
which variable is the explanatory and which
variable is the response Statistics course
grade and American history course
grade Cholesterol level and blood pressure No.
of babies and number of storks Heights of
fathers and their sons SAT scores and
performance at University
11- 2. Causality
- Changes in one (or more) explanatory variable
cause changes in a response variable. -
- Examples
- Time spent studying for an exam and the grade on
the exam - smoking and life expectancy
12How to quantify the relationship?
The most common way of quantifying relationship
is due primarily to
Karl Pearson
? Pearson correlation coefficient
13Correlation coefficient
- The correlation coefficient r measures the
strength of the linear relationship between two
quantitative variables. - To compute r for two variables X and Y we need
- Mean and standard deviation of X ,
- Mean and standard deviation of Y ,
- r is given by
14- Note that r a dimensionless number.
- That is, its value depends only on the
association between the two variables, not on
their units or the magnitude their values
15Interpreting the Correlation Coefficient
The correlation coefficient is between -1 to 1 A
positive value for r indicates a positive
relationship A negative value for r a negative
relationship.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
r -1 Perfect negative linear association
r 0 No linear association
r 1 Perfect positive linear association
16Correlation r-0.3
Correlation r0
Correlation r0
Correlation r0.5
Correlation r-0.7
Correlation r0.9
Correlation r-0.99
17Outliers affect correlation
r
r
0.5
0.8
18- The correlation coefficient r ONLY measures a
linear relationship. If all of data points lie on
a circle, you will get an r 0, even though the
nonlinear relationship is perfect -
- You MUST examine the scatterplot first in order
to get a sense of what type of relationship, if
any, is present.
19Assigning categorical variables to scatterplots
Example
Scatterplot of income versus age
20? To add a categorical variable to a
scatterplot, use a different color or
symbol for each category
Scatterplot of income versus age classified by sex
21Regression
- Correlation measures the direction and the
strength of linear relationship between 2
quantitative variables - Regression allows us to predict y from the
knowledge of x.
22Example
23We want a line that is as close as possible to
all points. This line will be used to predict y
from x.
24Fitting a straight line using least squares
25The method of least squares chooses the line that
makes the sum of squares of these errors as small
as possible. To find this line we must find the
a and b that minimize
Or
26- The equation of the straight line is
Where is the predicted value of y
What is a? Intercept -value of y when x0 What
is b? Slope amount of increase in y for
every unit change in x
27Fitting a least square line to the GMAT data
632.3
43.6
1.120
8.867
The correlation between GMAT and GPA is r0.536
Slope
Intercept
The equation of the least square line
28Using the least square line to calculate
predicted y values
8.408
9.649
8.201 8.849 8.339 9.015 9.194 8.339 9.939 8.574 8.
326 9.566
29How far is the fitted line from the observations?
Error observed value of y predicted value of y
30Computing the errors
Predicted GPA (
Error (y-
1.192
-0.849
-0.801 1.151 -0.539 0.185 0.406 0.061 1.261 -0.974
0.474 -1.566
31Example of correlation and regression
- The manager of a small restaurant studied the
absentee rate of employees. Whenever employees
called in sick, or simply didnt appear, the
restaurant had to find replacements in a hurry,
or else work short-handed. The manager had data
on the total number of absences of each employee
per month (y) and the number of months
experience at the restaurant (x) for 10
employees. The manager expected that longer-term
employees would be more reliable and absent less
often.
The data
32Scatterplot of the data
33- Correlation between x and y
- X Y
- Sx Sy
22.44
5.9
2.658
2.99
r -0.875
34Slope
Intercept
Minitab CORRELATION restaurants.MPJ
35Minitab output statgtregressiongtregression Regres
sion Analysis absences versus experience The
regression equation is absences 28.01 - 0.985
experience
36Minitab output statgtregressiongtfitted-line plot
37- What number of absences would the model predict
for a worker that has 22 months experience? - The value of experience is x22
- The predicted absences according to the model
are
38- The actual number of absences for an employer
who has 22 months experience was 7 days. What is
the residual (error) for this observation? - 7 6.334 0.666 days
39- For what experience level was the residual the
largest? - Was the Y larger or smaller than what the
model predicted? - hint To find the largest residual, look for the
point on the scatterplot that is furthest from
the regression line in the vertical direction. - For 20 months experience the absences were
larger than predicted -
40- Describe how the model relates the experience
to - the absences
- The slope is -0.985, or about one
- For each extra month of experience we would
predict the number of absences to decrease by
about one day relative to the month before. -
41- Would you recommend using this model to
predict the absence days for employees that are
less than one year in the restaurant? Why or why
not? -
- The answer is no.
- Explanation
- The predictions are known to be valid only in
the range of the explanatory variable used to
create the prediction equation. In this case,
that is from 18 to 27.3 months. -
42More regression questions
- Suppose we developed the following least
squares regression equation Y 14 1.5X. Which
of the following statements is correct? - A. The equation crosses the Y axis at 14 B. The
equation crosses the Y axis at 1.5 - C. the equation crosses the Y axis at 0
- D. None of the above
43- Couples who share more similar attitudes
indicate that they are more satisfied with their
relationship. -
- This reflects a ___________ correlation.
- positive / negative
positive
44A study was conducted to examine whether the age
at which a child begins to talk can predict later
Gessel score on a test of mental ability.
Age score 15 95 26 71 10 83 9 91 15 102 20
87 18 93 11 100 8 104 20 94 7 113 9 96 10
83 11 84 11 102 10 100 12 105 42 57 17
121 11 86 10 100
The regression equasion is score 109.874 -
1.12699 age. Draw the straight line. Hint
in order to draw a straight line you need 2
points. Use ages 10 and 40 as your points Age
predicted score 10 ____________ 40
____________
98.604
64.7944
45Regression linescore 109.874 - 1.12699 age
Draw the straight line
(we need just 2 points to draw the line)
Age predicted score 10
109.874-1.12699(10) 98.604 40
109.8741.12699(40) 64.7944
r -0.640
46Right or wrong?
- When correlation between variables x and y is
very high, it implies that variable x causes y. - It is o.k. to predict the y value of a point that
is outside the range of the x observations. - A correlation of -0.93 means a very weak linear
relationship. - An influential observation should always be
removed from the data. - The following linear equation describes the
relationship between sales of a product (in
thousands of dollars) and days of advertising
sales32x. This means that for every extra day
of advertising we obtain extra 3 thousand
dollars.