Title: BIVARIATE DATA
1Chapter 5
2BIVARIATE DATA
- The study designs considered so far investigated
only one characteristic of a population. They are
single variable studies. - Many study designs aim to look for an association
between two quantitative variables measured on
the same subject. These are bivariate study
designs. The prefix bi means two.
3Scatterplots
- Scatterplots are the most useful graphical device
to examine the possible association between two
quantitative variables. - Scatterplots help us identify
- Trends
- Outliers
4Linear Relationship
- A scatterplot of systolic blood pressure and age
for 29 subjects.
5Constructing a Scatterplot
- A scatterplot is a two dimension display
- The data for one variable is plotted on the
horizontal axis and the data for the other
variable is plotted on the vertical axis. - The convention is to place the studied variable
(response variable) on the vertical (y-axis). The
variable that is used to do the predicting (the
explanatory variable) is placed on the horizontal
(x-axis).
6Response Explanatory Variables
- Some studies are conducted to predict the value
of one variable using the value of anther
associated variables. In these studies we can
identify response and explanatory variables. - Others are conducted to simply look for potential
associations between two bivariate variables. In
these studies the choice of response and
explanatory variables is arbitrary.
7Examples
- Decide which variable is the response and which
is the explanatory variable - Serving size of an ice cream cone and the
calories of the ice cream cone - Explanatorysize Response calories
- The gas mileage of an automobile and the weight
of the automobile - Explanatory weight Response mpg
- The price of a theatre ticket and the number of
ticket sales - Explanatory price Response sold
- The age at marriage of the women and the age at
marriage of the man - No clear choice just ? association
8Linear Relationship
- A scatterplot of systolic blood pressure and age
for 29 subjects. - Points rise to the right
- Positive association
9Example Car Fuel Efficiency
Fuel_efficiency.xls
10Scatterplot for Fuel Efficiency
- Points fall to the right Negative association
11Positive and Negative Trends
- Two variables are said to have a positive (or
direct) association if larger values of one
variable occur with larger values of the other
variable - They are said to have a negative (or inverse)
association is smaller values of one variable
occur with larger values of the other variable
12Examples
- Classify each association in the following slides
as a strong, weak or no association - If there is an association go onto classify
the association as either a positive or negative
association
13Classify Graph 1
- Moderate, positive association
14Classify Graph 2
- Moderate, negative association
15Classify Graph 3
- Strong, positive association
16Classify Graph 4
- Strong, negative association
17Classify Graph 5
18Non-linear Relationship
- Monthly temperatures in Raleigh, N.C.
19Measuring the Strength of a Linear Relationship
- We have seen that associations can be
- Strong, weak or non-existent
- Positive or negative
- Linear or non-linear
-
- Now we want to numerically measure the
strength of linear relationships
20Correlation Coefficient
- The correlation coefficient, r, measures the
strength of a linear association - It is always the case that
- If r 1 then there is perfect positive
association - If r -1 then there is perfect negative
association - We only quote a correlation coefficient if
there is a linear relationship!
21Correlation Guessing Game
- http//www.stat.uiuc.edu/courses/stat100/java/GCAp
plet/GCAppletFrame.html
22Computational Formula
- The computational formula for the sample
correlation coefficient is
23Good News
- Graphical and statistical calculators as well as
programs like Excel and StatCrunch provide the
correlation coefficient as one of their options
when doing bivariate analysis
24Example Car Fuel Efficiency
Fuel_efficiency.xls
25Scatterplot for Fuel Efficiency
- Describe the association and estimate r
26Results
r -0.816
27Association Does Not Mean Causation!
Life_expectancy.xls
28Results
r -0.789
29Question
- The Number of TVs per Person has a strong
negative correlation with Life Expectancy - Does this mean buying more TVs will increase
life expectancy? - Association does not mean Causation!
30Linear Models
- The correlation coefficient measures the strength
of the linear relationship between two
quantitative variables x and y. - A linear equation describing how an dependant
variable, y, is associated with an explanatory
variable, x, looks like - y b mx
31Example
- A college charges a basic fee of 100 a semester
for a meal plan plus 2 a meal. The linear
equation describing the association between the
cost of the meal plan, y, and the number of meals
purchased, x, is - y 100 2x
32Linear Equations
- A linear equation takes the form
- y b mx
- m slope
- b y-intercept
- The slope measures the rate of change of y
with respect to x - The y-intercept measures the initial value of y
(value of y when x 0)
33Linear Modeling
- Rarely does an exact linear relationship exist
between two studied variables. - The correlation coefficient and the scatter plot
help us decide if there is a reasonably strong
linear relationship between two studied variables.
34Airfare from Baltimore, MD (1995 data)
Airfare.xls
35What is the Line of Best Fit?
- What properties do we want the line we fit to the
data to have? - It should be as close as possible to the data
- To decide if one line is better than another we
could measure how far data values are from the
line - We will use the symbol, , to designate a
predicted y-value
36Scatter Plot
Y 278
.
Residual for Dallas
The airfare for Dallas is 52 higher than
predicted
37Residuals
- To find a residual we take
- observed y-value and subtract the predicted
- y-value
- Positive residuals imply the observed value is
higher than predicted - Negative residuals imply the observed value is
less than predicted
38Minimize the Residuals
- It would seem like a good criterion would be to
have the line of best fit be one that the
residuals. - Since residuals can be both positive and negative
we dont want to just add them up . Cancellation
of positive and negative residuals would occur.
39Remember the Standard Deviation?
- Remember how when we defined the standard
deviation we squared the difference between a
data value and the mean to create a non-negative
quantity?
40Do it Again
- We apply the same process to create positive
residuals . We square them - Then we look for the line than minimizes the sum
of these squared residuals
41How it Works
- Minimize
- Its a good plan but you need calculus to carry
it out. - The end results are actually quite easy. All
thats important is that you realize the formulas
that will be presented minimize the residuals
42The Formulas
- The methods of calculus can be used to find
equations for the slope and y-intercept of the
least squares line. Here are the results.
43Least Squares Line for Airfare Data
- Distance (x)
- Airfare (y)
- r .795
44Prediction Equation
- Airfare 83.53 (.117)distance
- Each additional 100 miles costs an additional
11.70
45A Few Residuals
Pittsburgh Distance from Baltimore 210
miles Airfare 138 Predicted airfare .117(210)
83.53 107.92 Residual 138 107.92
30.08 The airfare from Pittsburgh to Baltimore
is 30.08 more than you would expect based on
the distance between these cities
St. Lois Distance from Baltimore 737
miles Airfare 98 Predicted airfare .117(737)
83.53 169.77 Residual 98 169.77 -
71.77 The airfare from St. Lois to Baltimore is
71.77 less than you would expect based on the
distance between these cities
46Coefficient of Determination
- The coefficient of determination is the
correlation coefficient squared - It can help us determine how good the least
squares line is as a prediction equation - It is the fractional amount of total variation in
y that can be explained by the linear
relationship with x
47Airfare Data
- r .795
- r2 .63
- 63 of the variability in airfare cost can be
explained by a linear relationship with distance - 37 of the variability in airfare cost is due to
factors other than distance
48Age Systolic Blood Pressure for 30 Adults
SBP.xls
49Approximate Positive Linear Relationship
50Equation of Fitted Line
SBP 98.7 0.97(AGE)y 98.7 0.97 x
51Interpretation of the Slope
- The slope of the SBP vs Age fitted equation is
0.97 - 0.97 rate of change of SBP with respect to age
- Every year a subjects blood pressure rises
approximately 0.97 units.
52Analysis of Variability
- r .657 and r2 .43
- 43 of the variability in SBP can be explained by
a linear relationship with age - 57 of the variability in SBP is due to factors
other than age
53Interpreting Residuals
Subject 2 Age 47 SBP 220 Predicted SBP
98.71 .97(47) 144.35 Residual 220 144.35
75.65 This subjects SBP is 75.65 units higher
than would be expected for his age
Subject 23 Age 39 SBP 120 Predicted SBP
98.71 .97(39) 136.58 Residual 120 136.58
- 16.58 This subjects SBP is 16.58 units lower
than would be expected for his age
54More Good News
- Many computer programs including Excel and
StatCrunch as well as graphing calculators
provide the slope and - y-intercept of the least squares line
55(No Transcript)
56(No Transcript)
57Only selected items are relevant
Regression equation
Correlation coefficient
Coefficient of determination
Not relevant
Not relevant
Look for these in the data table
58Notice, the dialog box that presents the summary
statistics has a Next button. Press next to get
the scatterplot
Look back at the data table and notice
the predicted (fitted) and residual values
are now included