Relations between 2 quantitative variables - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Relations between 2 quantitative variables

Description:

Students who apply to MBA programs must write the Graduate Management Admission Test (GMAT) ... No. of babies and number of storks. Heights of fathers and their sons ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 47
Provided by: naama6
Category:

less

Transcript and Presenter's Notes

Title: Relations between 2 quantitative variables


1
Relations between 2 quantitative variables
2
Example
  • Students who apply to MBA programs must write
    the Graduate Management Admission Test (GMAT).
    University admission committees use the GMAT
    scores as one of the critical indicators of how
    well a student is likely to perform in the MBA
    program. However, the GMAT may not be a very
    strong indicator for all MBA programs. Suppose
    that an MBA program designed for middle managers
    who wish to upgrade their skills was launched 3
    years ago. To judge how well the GMAT score
    predicts MBA performance, a sample of 12
    graduates was taken. Their grade point average in
    the MBA program (values from 0 to 12) and the
    GMAT score (values from 200 to 800) are listed in
    the table below and stored in file Xm04-16.

3
(No Transcript)
4
Scatterplots
  • Show the relationship between two paired
  • quantitative variables (X and Y).
  • Examples
  • Variable X age Variable Y height
  • Variable X SAT score Variable Y academic
    success
  • Variable X alcohol level Variable Y no. of
    heart diseases

5
  • Data for variable X x1,x2,,xn
  • Data for variable Y y1, y2,,yn
  • Plot (x1,y1), (x2,y2),, (xn,yn)

6
Interpreting scatterplots
  • Three important aspects of pattern
  • Form
  • Direction
  • Strength

7
  • For the GMAT-GPA
  • Form
  • Direction
  • Strength

Close to linear association
positive
Moderate to low
8
Relationship does not necessarily mean that one
variable causes the other!
  • Two levels of relationship
  • 1. Simple association (and Response variable and
    explanatory variable)
  • 2. Causality

9
Two levels of relationship between variables
  • 1. Simple association
  • Two variables are associated if some values of
    one variable tend to occur more often with some
    values of the second variable than with other
    values of that variable.
  • Examples
  • Statistics course grade and
  • American history course grade
  • Cholesterol level and blood pressure
  • No. of babies and number of storks
  • Heights of fathers and their sons
  • GMAT scores and GPA
  • The association between the variables means
  • that we can predict one variable using the other.

10
  • Response variable and explanatory variable
  • If we think that a variable x can explain
    changes in variable y, we call x an explanatory
    variable and y a response variable.

For each of the following, if possible, determine
which variable is the explanatory and which
variable is the response Statistics course
grade and American history course
grade Cholesterol level and blood pressure No.
of babies and number of storks Heights of
fathers and their sons SAT scores and
performance at University
11
  • 2. Causality
  • Changes in one (or more) explanatory variable
    cause changes in a response variable.
  • Examples
  • Time spent studying for an exam and the grade on
    the exam
  • smoking and life expectancy

12
How to quantify the relationship?
The most common way of quantifying relationship
is due primarily to
Karl Pearson
? Pearson correlation coefficient
13
Correlation coefficient
  • The correlation coefficient r measures the
    strength of the linear relationship between two
    quantitative variables.
  • To compute r for two variables X and Y we need
  • Mean and standard deviation of X ,
  • Mean and standard deviation of Y ,
  • r is given by

14
  • Note that r a dimensionless number.
  • That is, its value depends only on the
    association between the two variables, not on
    their units or the magnitude their values

15
Interpreting the Correlation Coefficient
The correlation coefficient is between -1 to 1 A
positive value for r indicates a positive
relationship A negative value for r a negative
relationship.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
r -1 Perfect negative linear association
r 0 No linear association
r 1 Perfect positive linear association
16
Correlation r-0.3
Correlation r0
Correlation r0
Correlation r0.5
Correlation r-0.7
Correlation r0.9
Correlation r-0.99
17
Outliers affect correlation
r
r
0.5
0.8
18
  • The correlation coefficient r ONLY measures a
    linear relationship. If all of data points lie on
    a circle, you will get an r 0, even though the
    nonlinear relationship is perfect
  • You MUST examine the scatterplot first in order
    to get a sense of what type of relationship, if
    any, is present.

19
Assigning categorical variables to scatterplots
Example
Scatterplot of income versus age
20
? To add a categorical variable to a
scatterplot, use a different color or
symbol for each category
Scatterplot of income versus age classified by sex
21
Regression
  • Correlation measures the direction and the
    strength of linear relationship between 2
    quantitative variables
  • Regression allows us to predict y from the
    knowledge of x.

22
Example
23
We want a line that is as close as possible to
all points. This line will be used to predict y
from x.
24
Fitting a straight line using least squares
25
The method of least squares chooses the line that
makes the sum of squares of these errors as small
as possible. To find this line we must find the
a and b that minimize
Or
26
  • The equation of the straight line is

Where is the predicted value of y
What is a? Intercept -value of y when x0 What
is b? Slope amount of increase in y for
every unit change in x
27
Fitting a least square line to the GMAT data
632.3
43.6
1.120
8.867
The correlation between GMAT and GPA is r0.536
Slope
Intercept
The equation of the least square line
28
Using the least square line to calculate
predicted y values
8.408
9.649
8.201 8.849 8.339 9.015 9.194 8.339 9.939 8.574 8.
326 9.566
29
How far is the fitted line from the observations?

Error observed value of y predicted value of y
30
Computing the errors
Predicted GPA (
Error (y-
1.192
-0.849
-0.801 1.151 -0.539 0.185 0.406 0.061 1.261 -0.974
0.474 -1.566
31
Example of correlation and regression
  • The manager of a small restaurant studied the
    absentee rate of employees. Whenever employees
    called in sick, or simply didnt appear, the
    restaurant had to find replacements in a hurry,
    or else work short-handed. The manager had data
    on the total number of absences of each employee
    per month (y) and the number of months
    experience at the restaurant (x) for 10
    employees. The manager expected that longer-term
    employees would be more reliable and absent less
    often.

The data
32
Scatterplot of the data
33
  • Correlation between x and y
  • X Y
  • Sx Sy

22.44
5.9
2.658
2.99
r -0.875
34
  • Fitting a straight line

Slope
Intercept
Minitab CORRELATION restaurants.MPJ
35
Minitab output statgtregressiongtregression Regres
sion Analysis absences versus experience The
regression equation is absences 28.01 - 0.985
experience
36
Minitab output statgtregressiongtfitted-line plot
  • The regression line

37
  • What number of absences would the model predict
    for a worker that has 22 months experience?
  • The value of experience is x22
  • The predicted absences according to the model
    are

38
  • The actual number of absences for an employer
    who has 22 months experience was 7 days. What is
    the residual (error) for this observation?
  • 7 6.334 0.666 days

39
  • For what experience level was the residual the
    largest?
  • Was the Y larger or smaller than what the
    model predicted?
  • hint To find the largest residual, look for the
    point on the scatterplot that is furthest from
    the regression line in the vertical direction.
  • For 20 months experience the absences were
    larger than predicted

40
  • Describe how the model relates the experience
    to
  • the absences
  • The slope is -0.985, or about one
  • For each extra month of experience we would
    predict the number of absences to decrease by
    about one day relative to the month before.

41
  • Would you recommend using this model to
    predict the absence days for employees that are
    less than one year in the restaurant? Why or why
    not?
  • The answer is no.
  • Explanation
  • The predictions are known to be valid only in
    the range of the explanatory variable used to
    create the prediction equation. In this case,
    that is from 18 to 27.3 months.

42
More regression questions
  • Suppose we developed the following least
    squares regression equation Y 14 1.5X. Which
    of the following statements is correct?
  • A. The equation crosses the Y axis at 14 B. The
    equation crosses the Y axis at 1.5
  • C. the equation crosses the Y axis at 0
  • D. None of the above

43
  • Couples who share more similar attitudes
    indicate that they are more satisfied with their
    relationship.
  • This reflects a ___________ correlation.
  • positive / negative

positive
44
A study was conducted to examine whether the age
at which a child begins to talk can predict later
Gessel score on a test of mental ability.
Age score 15 95 26 71 10 83 9 91 15 102 20
87 18 93 11 100 8 104 20 94 7 113 9 96 10
83 11 84 11 102 10 100 12 105 42 57 17
121 11 86 10 100
The regression equasion is score 109.874 -
1.12699 age. Draw the straight line. Hint
in order to draw a straight line you need 2
points. Use ages 10 and 40 as your points Age
predicted score 10 ____________ 40
____________
98.604
64.7944
45
Regression linescore 109.874 - 1.12699 age
Draw the straight line
(we need just 2 points to draw the line)
Age predicted score 10
109.874-1.12699(10) 98.604 40
109.8741.12699(40) 64.7944
r -0.640
46
Right or wrong?
  • When correlation between variables x and y is
    very high, it implies that variable x causes y.
  • It is o.k. to predict the y value of a point that
    is outside the range of the x observations.
  • A correlation of -0.93 means a very weak linear
    relationship.
  • An influential observation should always be
    removed from the data.
  • The following linear equation describes the
    relationship between sales of a product (in
    thousands of dollars) and days of advertising
    sales32x. This means that for every extra day
    of advertising we obtain extra 3 thousand
    dollars.
Write a Comment
User Comments (0)
About PowerShow.com