Title: Biostatistics and Computer Applications
1Biostatistics and Computer Applications
Correlation and regression analysis Linear
regression Significant test Confidence interval
estimation SAS programming 1/07/2003
2Recap (Analysis of variance)
- Detect the effect of (response to) treatments
(levels or combination of level of experimental
factors). - Main idea Partition the total variance into
different components. - One-way ANOVA
- Two-way ANOVA
- Hierarchical data ANOVA
- ANOVA for different experimental designs
- Purpose treatment effect
- Designs to minimum experimental error.
3Correlation relationship analysis
- ANOVA analysis of the dependent variable (usually
continuous variable) with the treatment effects
(category variables). - Regression analysis finds the relationship of two
or more continuous variables (for two variables,
X and Y).
4Correlation Relationship
- Function Yf(X), for every value of variable of
X, we have a fixed value Y corresponding to X.
Deterministic models, Prediction error is
negligible. Example Force is exactly mass times
acceleration F ma, or area of a circle is
YpiX2.
- Correlation Yf(X)e, under certain conditions,
for every value of X, we do not have a fixed Y,
but we have a probability distribution of Y
associated with X. Probabilistic Models. Sales
volume is 10 times advertising spending random
error Y 10X ?.
5Dependent and independent variables
- In the correlation relationship, if one variable
response to the variance of another variable, we
call the response variable dependent variable
(Y), and the predictor variable independent
variable (X). - For example, yield of wheat (Y) response to
plant density (X), income (Y) response to
education levels (X). - Regression analysis.
- If there is no cause-response relationship
between these two variables, but co-vary with
other variables, we do not separate X and Y into
dependent and independent variable (X, Y). - For example, height and weight of person.
- Correlation analysis.
6Tasks of correlation relationship analysis
- Regression analysis
- Develop the regression equation (Yf(X)) and
estimate standard error of regression SY/X. - Correlation analysis
- Calculate the correlation coefficient r which is
the measure of the degree of association between
two variables.
7Scatter diagram
8(No Transcript)
9Background information
- Historically, the study of predicting one
measurement from knowledge of another occurred
before the development of correlation procedures. - Sir Francis Galton published a paper, Regression
towards Mediocrity in Hereditary Stature, in
1885. His work concentrated on the prediction of
physical traits of offspring from knowledge of
the parents' physical traits. - A general finding of his work was that children
tend to "regress toward the mean". For example,
taller than average parents tended to have
children that were shorter than them. Likewise,
shorter than average parents tended to have
children taller than them. Because of the
regression phenomena, prediction studies became
called regression studies.
10Background information
- Today's use of the term, regression, has nothing
to do with the biological phenomena observed by
Galton. Instead, regression refers to the
prediction of one measurement based on knowledge
of another. - The regression techniques used now were pioneered
by Karl Pearson. The most commonly used
correlation technique used today is called the
Pearson product-moment correlation coefficient.
It is applied to data which is at the level of
numerical discrete or continuous. - Categorical data requires other correlation
techniques such as the contingency coefficient.
Full ranking scale data may be analyzed using the
rank-order correlation methods.
11Types of regression analysis
Regression
2 Explanatory
1 Explanatory
Models
Variables
Variable
Simple
Multiple
Non-
Non-
Linear
Linear
Linear
Linear
12Regression Modeling Steps
- 1. Determine the regression equation
- 2. Calculate unknown model parameters
- 3. Estimate standard deviation of regression
- 4. Test the significance of the regression
equation - 5. Use model for estimation and prediction.
13Simple Linear Regression
-
- Linear regression equation of
Y on X. - a intercept. When X0, Y a.
- b regression coefficient or slope. When X
increases 1 unit, Y is expected to change b unit.
14Simple Linear Regression
- How would you draw a line through the points?
How do you determine which line fits best?
15 Least squares method
- Best Fit means difference between actual Y
values predicted Y values is a minimum - But positive differences off-set negative
- Least square minimizes the sum of the squared
differences (SSe, Q)
16Least squares estimation
Y
X
17Fitting Regression Lines
To obtain the minimum of a function we find
the values that make the derivatives equal to
zero. So to find (a, b) that minimizes Q we
solve the normal equations Solving these
equations yields the estimates
18Coefficient Equations
Prediction Equation
Sample Slope
Sample Y-intercept
SP sum of products
19Interpretation of Coefficients
- 1. Slope (b)
- Estimated Y changes by b for each 1 unit increase
in X. - If b4, then Y is expected to increase by 4 for
each 1 unit increase in X. - 2. Y-intercept (a)
- Average Value of Y When X 0.
- If a 2, then average Y is expected to be 2 when
X is 0.
20Properties of a,b and equation
21Computation Table
22Example of Parameter Estimation
- Example The amount of pest changes with climatic
conditions. The following data are the amount of
pest observed in 100 plants (Y) and the ratio of
precipitation/temperature (PPT/T, X) during ten
year at one site. Develop a regression equation.
23Parameter Estimation Solution
(4.976,109)
Meaning of a,b
24Standard deviation from regression
Standard error of estimation or standard
deviation from regression. Q sum of squares due
to the deviation from regression.
25Example of standard deviation
68.26 Y observations are Y-25.74, 95.45 are
Y-225.74.
26Linear Regression Model
- Assumptions
- 1. Predictor variables are fixed i.e., same
meaning among individuals. Predictor variable
measured without error - 2. For each value of the predictor variable,
there is a normal distribution of outcomes
(subpopulations) and the variance of these
distributions are equal.
27Linear Regression Model
- Assumptions
- 3. is
constant, changes with X linearly
28Linear Regression Model
29Population Sample Regression Models
Population
Random Sample
Unknown Relationship
?
?
?
?
?
?
?
30Hypothesis test of regression equation
- If you have X,Y, you can develop a regression
equation. Is it true? - Test if the sample is drawn from a population
that Y and X has no correlation relationship. - F test
- Student t test for regression coefficient.
31F test
Population
Sample
32Measures of Variation in Regression
- Total Sum of Squares (SST)
- Measures Variation of Observed Yi Around the
Mean?Y - Regression sum of squares (U) Explained
Variation - Variation Due to Relationship Between X Y
- Residual sum of squares (Q) Unexplained
Variation (Q) - Variation Due to Other Factors
33Variation Measures
Observedvalue
ei Random error
_
y
Observed value
34Variation Measures
Unexplained sum of squares (Yi -?Yi)2
Y
Yi
Total sum of squares (Yi -?Y)2
Explained sum of squares (Yi -?Y)2
Y
X
X
i
35F test
- The ANOVA table for regression analysis
36Example Measures of Variation in Regression
- Test if the linear regression equation is
significant.
37Student t test of Slope
- Test if there is a linear relationship between X
and Y. - Hypotheses
- H0 ? 0 (No Linear Relationship)
- HA ? ? 0 (Linear Relationship)
- Theoretical Basis Is Sampling Distribution of
Slope
b
b
38Slope Test Statistic
Relationship between F test and t test
39Example of Slope Test
Reject
Reject
.025
.025
t
0
3.355
-3.355
40Confidence Interval of regression
- Population mean response for given X
- Point on Population Regression Line
- Population individual response (Y) for given X
(Prediction interval of Y) - Intercept .
- Slope .
41Estimation of and prediction of Y
Y
Y
X
b
Individual
a
Y
Mean Y,
X
Prediction, Y
X
X
i
42Confidence interval of
confidence interval for
43Example of CI for
Calculate when PPT/T, X7 , 95 confidence
interval for
Influence factors Level of Confidence (1 -
?), data Dispersion (s), sample size and distance
of Xi from mean?X.
44Why Distance from Mean?
Greater dispersion than X1
?X
45Prediction Interval of Individual Response Y
Population individual observation Y 1-alpha
prediction interval
46Why the Extra ?
Y
Y
we're trying to
X
b
i
a
predict
e
Y
Expected
i
(Mean)
Prediction, Y
X
X
i
47Example of Prediction Interval of Y
Calculate when X7, 95 the prediction interval
for the individual observation Y.
48Hyperbolic Interval Bands
Y
b
X
a
i
Y
i
X
_
X
X
i
49Confidence Intervals of Intercept and Slope
1-alpha confidence interval for slope
1-alpha confidence interval for intercept
50Comparison of two regression equations
51Example of comparison of two regression equations
Example We measured two deciduous trees leaf
area and the product of leaf length and width.
Test if the relationships of leaf area and the
product of lengthwidth change.
52Example of comparison of two regression equations
53Example of comparison of two regression equations
As there is no difference between intercepts and
slopes, we can merge these two data sets together
to estimate one equation.
54Summary
- 1. Described the correlation relationship
- 2. Stated the regression modeling steps
- 3. Computed regression coefficients
- 4. Test significant regression model
- 5. Estimate confidence interval.
55SAS Programming
- Procedures PROC REG PROC GLM.
- Special procedures such as PROC LOGISTIC PROC
RSREG PROC LIFEREG PROC ORTHOREG PROC PHREG
PROC SURVEYREG PROC TRANSREG - PROC REG for one or multiple linear regression
analysis and common procedure.
56PROC REG
- PROC REG lt options gt
- lt label gt MODEL dependentsltregressorsgt lt /
options gt - VAR variables
- RESTRICT equation, ... ,equation
- lt label gt MTEST ltequation, ... ,equationgt lt /
options gt - lt label gt TEST equation,lt, ...,equationgt lt /
option gt - ADD variables
- DELETE variables
- REFIT
- PAINT ltcondition  ALLOBSgt lt / options gt  lt
STATUSÂ Â UNDOgt - PLOT ltyvariablexvariablegt ltsymbolgt lt
...yvariablexvariablegt ltsymbolgt lt / options gt
- PRINT lt options gt lt ANOVA gt lt MODELDATA gt
- OUTPUT lt OUTSAS-data-set gt keywordnames     lt
... keywordnames gt - BY variables
- FREQ variable
- ID variables
- WEIGHT variable
- REWEIGHT ltcondition  ALLOBSgt lt / options gt  lt
STATUSÂ Â UNDOgt
57SAS program 1
- DATA pest
- INPUT x y
- DATALINES
- 1.58 180
- 9.98 28
- 9.42 25
- 11.01 40
- 1.85 160
- 6.04 120
- 5.92 80
- PROC REG SIMPLE CORR
- MODEL yx /CLM CLI CLB
- PLOT yx
- RUN
- / CLB confidence interval for slope
- CLM confidence interval for population mean
miu_Y/X - and CLI prediction interval for Y /
58SAS program 2
- DATA frog
- INPUT temperature heartrate
- DATALINES
- 2 5
- 4 11
- 6 11
- 8 14
- 10 22
- 12 23
- 14 32
- 16 29
- 18 32
- PROC REG NOPRINT
- MODEL heartratetemperature
- PRINT ALL
- RUN