Title: Descriptive methods in regression and correlation
1MATH 1530 Elements of StatisticsDr. Kirsten Boyd
- Chapter 4
- Descriptive methods in regression and correlation
Slides adapted from Ms Smyth, Dr. Griffy and the
Weiss Text
2(No Transcript)
3Sec. 4.1 Linear Equations with One Independent
Variable
- So the graph of that equation is a straight line
with y-intercept b0 and slope b1. This is the
same as the ymxb slope-intercept form you
probably saw in algebra class, but with different
letters. - example y 5 2x
- y-intercept, b0, is?
- slope, b1, is?
- Graph this line
4Slope
Word problem interpretation of slope Whenever
x increases by one unit, y increases by b1 units
(or decreases if b1 is negative, or stays the
same if b1 is zero).
5Problem 4.6, page 160
- A repair shop charges 55 per hour plus a 30
service charge. Let x denote the number of hours
required for the job and let y denote the total
cost to the customer. - Part a. Find the equation that expresses y in
terms of x. - Part b. Determine b0 and b1 .
- Part c. Construct a table like Table 4.1 on p.
157 for the x-values 0.5, 1, and 2.25 hours. - Part d. Draw the graph of the equation from Part
a. by plotting the points from Part c. and
connecting with a line. - Part e. Use the graph from Part d. to estimate
visually the cost of a job that takes 1.75 hours.
Then calculate the cost exactly, using the
equation from Part a.
6Sec. 4.2 The Regression Equation
- Regression Equations explain a (linear) pattern
in scatterplot data - x is the explanatory or predictor variable
- y is the response variable
7Scatterplot(Table 4.2 and Fig. 4.7, page 162)
8Regression Equation
- The goal is to construct a line with the smallest
possible distances from the data points to the
corresponding points on the line. - This line is the graph of the regression
equation.
9Example 4.3 (p. 163) Which Line Is Better?
10Example 4.3 Comparing Lines
Sum of the squared errors is less for B than for
A. Line B is a better fit than A. error e
y- y
Notation For any x-value, y is actual value and
y is value from line.
11Best Fitting Line Possible
12Computing the Regression Equation
13Use of Regression Equations
- Regression equation models (not perfectly) data
- x-values (explanatory variable) predict values of
y (response variable)
WARNING You can predict accurately only within
the spread of the x-values.
14Extrapolation
- Using an x-value outside the range of data is
unacceptable because the trend could change.
Making predictions for x-values outside the range
is called extrapolation, and should be avoided.
15Extrapolation
16Outliers and Influential Observations
- Outlier is a data point that lies vertically far
from the regression line relative to the other
points - Influential Observation is a data point that lies
horizontally far from the rest of the data and
whose removal will considerably change the slope
of the regression line
17Outliers and Influential Observations(Fig. 4.12,
p. 169)
18Data Must Be Linear
19Problem 4.53, page 174
204.53 a
214.53 b
224.53 c-g
- Emission increases as weight increases
- ?y/?x ?Emissions/?Potato Weight 0.16/1
- For each gram of potato plant, the emissions
increase 0.16 hundred nanograms, which is 16
nanograms. - y 3.5240.1628 75 15.73
- Predictor x weight of potato plant
- Response y emission quantity
- none
23Finding the Regression Equation Using Your
Calculator
- 1. Enter x-values in L1 and y-values in L2 (or
alternatively, use INS to insert new lists with
whatever names you want) - 2. Stat gt Calc gt 8 LinReg(abx) gt
- 3. Use LIST to enter L1, L2 (or whatever the
names of your lists are)dont forget the comma
between themthen press Enter - 4. Calculator tells you a and b, which
correspond to b0 and b1 in book (equation is y
a bx b0 b1x)
Be sure to turn your diagnostic on Catalog gt
Diagnostic On gt Enter (Catalog is the 2nd
function above the 0 key)
24Sec. 4.3 The Coefficient of Determination
- The coefficient of determination is denoted r2
- Always between 0 and 1
- Measures how well the regression equation
describes the relationship between x and y - Close to 0 (0 to 0.4) gt regression is not useful
- Close to 1 (0.6 to 1) gt regression is useful
25Formulas for r2
26SSE
Error of Sum of Squares, SSE, is the variation in
the observed values of y that is not explained by
the regression.
SSE SST-SSR
27Coefficient of Determination, r2
- To calculate r2, use your calculatorfollow same
instructions as for obtaining regression
equation. - We will not calculate r2 by hand.
Be sure to turn your diagnostic on Catalog gt
Diagnostic On gt Enter (Catalog is the 2nd
function above the 0 key)
28Percentage of Variation
- The percentage of variation in the y-values that
is explained by the variation in the x-values is - r2 100
294.91 (p. 185, same data as 4.53)
- Ignore (a) (computing SSR, SST, SSE) in all of
Sec 4.3 - 0.1096
- r2 100 10.96
- Not useful
30Sec. 4.4 Linear Correlation
31The Linear Correlation Coefficient, r
- The sign of r indicates the slope of the data
- positive r means positive relationship
- negative r-value means negative relationship
- The magnitude of r indicates the strength of the
linear relationship (magnitude how far from
zero)
Weak gt between -0.6 and 0.6 Strong gt less than
-0.75 or more than 0.75
32The Linear Correlation Coefficient, r
- Always between -1 and 1, inclusive
- Sign of r is same as sign of b1 (slope of
regression line) - Square r and you should always get r2
- To find r, use calculatorfollow same
instructions as for finding regression equation
and r2
Be sure to turn your diagnostic on Catalog gt
Diagnostic On gt Enter (Catalog is the 2nd
function above the 0 key)
33Linear Relationships
34Interpreting r
- r only has meaning if the data is linear
- r can be computed for nonlinear data, but data
may not be linear even if r is strong
354.125 (p. 194, same data as 4.53 and 4.91)
- Ignore using the computing formula part of (a)
in all of Sec. 4.4 and use your calculator - r 0.3311
- Weak, positive, linear relationship
- Very scattered, not close to line
- r2 0.33112 0.1096 correlation coefficient