Title: General Announcement (01.07.2004)
1- General Announcement (01.07.2004)
- All course slides, solutions to quizzes and
solutions to assignments 1 and 3 as well as
practice questions on probability have been
uploaded on the Intranet. - All quizzes and assignments have been corrected.
The corrected documents can be seen by
approaching Mr. Parmanand Bhuye in Secretarial
section on ground floor (in front of reception).
You can report totalling mistake, if any. - Mark Sheet for the above will be put on notice
board by today evening.
2SAMIR K. SRIVASTAVA
3Correlation and Regression
- Univariate vs. Bivariate data (Multivariate)
- More than one attribute for each member of
population. - Height Weight
- Absenteeism Production
- Advertising Expenditure Sales Volume
- Unemployment Crime Rate
- Rainfall Food Production
- Web Site Visitor Profile
4Correlation and Regression
- Are the two attributes related to each other?
- Can we use one to predict the other?
- Can we change one to control the other?
- Predictor Variable and Response Variable
- Relationship may be
- Positive or negative (or nonexistent)
- Weak or strong
- Two variables are said to be correlated if value
of one is indicative of the value of the other.
5Organizing Bivariate DataScatter Plots
Negatively Correlated
Positively Correlated
Loosely Correlated
Strongly Correlated
Not Correlated
6Measuring the Strength of Correlation
- Can we define a quantitative measure of strength
of correlation? - Covariance is such a measure.
- Looks similar to variance.
- Can be positive as well as negative.
- When will it have a positive vs. negative value?
- A High vs. Low value?
7Measuring the Strength of Correlation
8Coefficient of Correlation
- Suppose we wish to measure the strength of
correlation on a scale of 0 to 1 - Is Covariance an appropriate measure?
- What if we multiply all X values by a constant?
- The measure should not be affected by change of
scale. - Coefficient of Correlation
- r ?xy/(?x.?y)
- Value of r lies between -1 and 1
- Values close to 0 indicate little or no
correlation - Values close to 1 or -1 indicate a very strong
correlation.
9Illustration
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
x 2.15
y
125
105
65
85
75
80
50
55
640
y 80
x-x
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0
y-y
45
25
-15
5
-5
0
-30
-25
0
(x-x)2
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
Sxx
(y-y)2
2025
625
225
25
25
0
900
625
4450
Syy
(x-x)(y-y)
-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
Sxy
10Correlation and Causation
- Is there a causal relationship between the two
variables? - Rainfall Food production
- Absenteeism Production
- Advertising Expenditure Sales
- Strange, Spurious or nonsense correlations
- Teachers salaries liquor sales
- Divorce rate death rate (negative
correlation)
11Correlation and Causation
- Spurious correlation is due to a third Lurking
Variable - Economic Growth ? higher salaries, higher liquor
consumption - Age ? older couples have fewer divorces, but
higher death rate. - To establish causation between variables,
establish - Consistency (relationship true in a variety of
contexts) - Responsiveness (change in one precedes change in
other) - Mechanism (manner in which change in X changes Y)
12Regression
- Francis Galton Introduced the term in 1877
- Height of children ? Mean of population
- Predicting one variable from another
- Relationships of association, not causal
- Relating variables mathematically
- Linear or non-linear
- Bivariate Linear between two variables
13Bivariate Regression Assumptions
- Assumptions for bivariate regression
- 1. Random sample
- Ideally N gt 20
- But different rules of thumb exist. (10, 30,
etc.) - 2. Variables are linearly related
- i.e., the mean of Y increases linearly with X
- Check scatter plot for general linear trend
- Watch out for non-linear relationships (e.g.,
U-shaped)
14Bivariate Regression Assumptions
- 3. Y is normally distributed for every outcome
of X in the population - Conditional normality
- Ex Years of Education X, Job Prestige (Y)
- Suppose we look only at a sub-sample X 12
years of education - Is a histogram of Job Prestige approximately
normal? - What about for people with X 4? X 16
- If all are roughly normal, the assumption is met
15Two Possible Regressions
16Simple Linear Regression An Example
- For a sample of 8 employees, a personnel
director has collected the following data on
ownership of company stock, y, versus years with
the firm, x. - x 6 12 14 6 9 13 15 9
- y 300 408 560 252 288 650 630 522
- (a) Determine the least squares regression line
and interpret its slope. (b) For an employee who
has been with the firm 10 years, what is the
predicted number of shares of stock owned?
17An Example, cont.
- x y xy x2
- 6 300 1800 36
- 12 408 4896 144
- 14 560 7840 196
- 6 252 1512 36
- 9 288 2592 81
- 13 650 8450 169
- 15 630 9450 225
- 9 522 4698 81
- Mean 10.5 451.25
- Sum 41,238 968
18An Example, cont.
- Slope
- y-Intercept
- So the best-fit linear model, rounding to the
nearest tenth, is -
19An Example, cont.
- Interpretation of the slope For every
additional year an employee works for the firm,
the employee acquires an estimated 38.8 shares of
stock per year. - If x1 10, the point estimate for the number of
shares of stock that this employee owns is
20Using the Regression Equation
- Before using the regression model, we need to
assess how well it fits the data.
- If we are satisfied with how well the model fits
the data, we can use it to make predictions for
y. - Illustration
- Predict the selling price of a three-year-old Car
with 40,000 km on the odometer
21Bivariate Regression Assumptions
22Estimating the Coefficients
- The estimates are determined by
- drawing a sample from the population of interest,
- calculating sample statistics.
- producing a straight line that cuts into the data.
The question is Which straight line fits best?
23 Ordinary Least Squares
- 1. Best Fit Means Difference Between Actual
Values (Yi) Predicted Values (Xi) Are a
Minimum - But Positive Differences Off-Set Negative
- 2. OLS Minimizes the Sum of the Squared
Differences (or Errors)
24Least Squares Method
The best line is the one that minimizes the sum
of squared vertical differences between the
points and the line.
Let us compare two lines
The second line is horizontal
The smaller the sum of squared differences the
better the fit of the line to the data.
25Assumptions of OLS regression
- Model is linear in parameters
- The residuals are normally distributed
- The residuals have constant variance
- The expected value of the residuals is always
zero - The residuals are independent from one another
- The X values are precise
- The independent variables are not too strongly
collinear
- If these assumptions are satisfied, then OLS
estimator is unbiased and has minimum variance of
all unbiased estimators. - How can we test these assumptions?
- If assumptions are violated,
- what does this do to our conclusions?
- how do we fix the problem?
26The Model
- The first order linear model or a simple
regression model, - y dependent variable
- x independent variable
- b0 y-intercept
- b1 slope of the line
- ? error variable
27Least Squares Method
To calculate the estimates of the coefficients
that minimize the differences between the data
points and the line, use the formulas
28Least Squares Method
Now we define
29Least Squares Method
Then
The estimated simple linear regression equation
that estimates the equation of the first order
linear model is
30Error Variable Required Conditions
- The error e is a critical part of the regression
model. - Five requirements involving the distribution of e
must be satisfied. - The mean of e is zero E(e) 0.
- The standard deviation of e is a constant (se)
for all values of x. - The errors are independent.
- The errors are independent of the independent
variable x. - The probability distribution of e is normal.
31 Standard error of estimate
- If se is small the errors tend to be close to
zero (close to the mean error). Then, the model
fits the data well. - Therefore, we can, use se as a measure of the
suitability of using a linear model. - An unbiased estimator of se2 is given by se2
32Assessing the Model
- The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y. - Consequently, it is important to assess how well
the linear model fits the data. - Several methods are used to assess the model
- Testing and/or estimating the coefficients.
- Using descriptive measurements.
33Outliers
- An outlier is an observation that is unusually
small or large. - Several possibilities need to be investigated
when an outlier is observed - There was an error in recording the value.
- The point does not belong in the sample.
- The observation is valid.
- Identify outliers from the scatter diagram and
remove them.
34Practice Problem
- A car dealer wants to find the relationship
between the odometer reading and the selling
price of used cars. - A random sample of 100 cars is selected, and the
data recorded. - Find the regression line.
35Solution
- We need to calculate several statistics first
where n 100.
36Coefficient of Determination
- A measure of the
- Strength of the linear relationship between x and
y. - The larger the value of r2, the more the value of
y depends in a linear way on the value of x. - Amount of variation in y that is related to
variation in x. - Ratio of variation in y that is explained by the
regression model divided by the total variation
in y.
37 Coefficient of determination
- To understand the significance of this
coefficient note
The regression model
Overall variability in y
The error
38 Coefficient of determination
- When we want to measure the strength of the
linear relationship, we use the coefficient of
determination.
39Conclusion
- Used scatter diagram to visualize relationship
between two variables - Learnt the use of correlation analysis
- Described the linear regression model
- Explained ordinary least-squares method in
generating equation - Learnt the limitations of regression and
correlation analysis
40Thank You !