General Announcement (01.07.2004) - PowerPoint PPT Presentation

About This Presentation

Title:

General Announcement (01.07.2004)

Description:

All course s, solutions to quizzes and solutions to assignments 1 and 3 as well as practice questions on probability have been uploaded on the Intranet. – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 41

Provided by: sami128

Category:

more less

Transcript and Presenter's Notes

Title: General Announcement (01.07.2004)

1

General Announcement (01.07.2004)

All course slides, solutions to quizzes and
solutions to assignments 1 and 3 as well as
practice questions on probability have been
uploaded on the Intranet.
All quizzes and assignments have been corrected.
The corrected documents can be seen by
approaching Mr. Parmanand Bhuye in Secretarial
section on ground floor (in front of reception).
You can report totalling mistake, if any.
Mark Sheet for the above will be put on notice
board by today evening.

QUANTITATIVE METHODS 1

SAMIR K. SRIVASTAVA
3
Correlation and Regression

Univariate vs. Bivariate data (Multivariate)
More than one attribute for each member of
population.
Height Weight
Absenteeism Production
Advertising Expenditure Sales Volume
Unemployment Crime Rate
Rainfall Food Production
Web Site Visitor Profile

4
Correlation and Regression

Are the two attributes related to each other?
Can we use one to predict the other?
Can we change one to control the other?
Predictor Variable and Response Variable
Relationship may be
Positive or negative (or nonexistent)
Weak or strong
Two variables are said to be correlated if value
of one is indicative of the value of the other.

5
Organizing Bivariate DataScatter Plots
Negatively Correlated
Positively Correlated
Loosely Correlated
Strongly Correlated
Not Correlated
6
Measuring the Strength of Correlation

Can we define a quantitative measure of strength
of correlation?
Covariance is such a measure.

Looks similar to variance.
Can be positive as well as negative.
When will it have a positive vs. negative value?
A High vs. Low value?

7
Measuring the Strength of Correlation
8
Coefficient of Correlation

Suppose we wish to measure the strength of
correlation on a scale of 0 to 1
Is Covariance an appropriate measure?
What if we multiply all X values by a constant?
The measure should not be affected by change of
scale.
Coefficient of Correlation
r ?xy/(?x.?y)
Value of r lies between -1 and 1
Values close to 0 indicate little or no
correlation
Values close to 1 or -1 indicate a very strong
correlation.

9
Illustration
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
x 2.15
y
125
105
65
85
75
80
50
55
640
y 80
x-x
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0

y-y
45
25
-15
5
-5
0
-30
-25
0

(x-x)2
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
Sxx
(y-y)2
2025
625
225
25
25
0
900
625
4450
Syy
(x-x)(y-y)
-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
Sxy
10
Correlation and Causation

Is there a causal relationship between the two
variables?
Rainfall Food production
Absenteeism Production
Advertising Expenditure Sales
Strange, Spurious or nonsense correlations
Teachers salaries liquor sales
Divorce rate death rate (negative
correlation)

11
Correlation and Causation

Spurious correlation is due to a third Lurking
Variable
Economic Growth ? higher salaries, higher liquor
consumption
Age ? older couples have fewer divorces, but
higher death rate.
To establish causation between variables,
establish
Consistency (relationship true in a variety of
contexts)
Responsiveness (change in one precedes change in
other)
Mechanism (manner in which change in X changes Y)

12
Regression

Francis Galton Introduced the term in 1877
Height of children ? Mean of population
Predicting one variable from another
Relationships of association, not causal
Relating variables mathematically
Linear or non-linear
Bivariate Linear between two variables

13
Bivariate Regression Assumptions

Assumptions for bivariate regression
1. Random sample
Ideally N gt 20
But different rules of thumb exist. (10, 30,
etc.)
2. Variables are linearly related
i.e., the mean of Y increases linearly with X
Check scatter plot for general linear trend
Watch out for non-linear relationships (e.g.,
U-shaped)

14
Bivariate Regression Assumptions

3. Y is normally distributed for every outcome
of X in the population
Conditional normality
Ex Years of Education X, Job Prestige (Y)
Suppose we look only at a sub-sample X 12
years of education
Is a histogram of Job Prestige approximately
normal?
What about for people with X 4? X 16
If all are roughly normal, the assumption is met

15
Two Possible Regressions
16
Simple Linear Regression An Example

For a sample of 8 employees, a personnel
director has collected the following data on
ownership of company stock, y, versus years with
the firm, x.
x 6 12 14 6 9 13 15 9
y 300 408 560 252 288 650 630 522
(a) Determine the least squares regression line
and interpret its slope. (b) For an employee who
has been with the firm 10 years, what is the
predicted number of shares of stock owned?

17
An Example, cont.

x y xy x2
6 300 1800 36
12 408 4896 144
14 560 7840 196
6 252 1512 36
9 288 2592 81
13 650 8450 169
15 630 9450 225
9 522 4698 81
Mean 10.5 451.25
Sum 41,238 968

18
An Example, cont.

Slope
y-Intercept
So the best-fit linear model, rounding to the
nearest tenth, is

19
An Example, cont.

Interpretation of the slope For every
additional year an employee works for the firm,
the employee acquires an estimated 38.8 shares of
stock per year.
If x1 10, the point estimate for the number of
shares of stock that this employee owns is

20
Using the Regression Equation

Before using the regression model, we need to
assess how well it fits the data.

If we are satisfied with how well the model fits
the data, we can use it to make predictions for
y.
Illustration
Predict the selling price of a three-year-old Car
with 40,000 km on the odometer

21
Bivariate Regression Assumptions
22
Estimating the Coefficients

The estimates are determined by
drawing a sample from the population of interest,
calculating sample statistics.
producing a straight line that cuts into the data.

The question is Which straight line fits best?
23
Ordinary Least Squares

1. Best Fit Means Difference Between Actual
Values (Yi) Predicted Values (Xi) Are a
Minimum
But Positive Differences Off-Set Negative
2. OLS Minimizes the Sum of the Squared
Differences (or Errors)

24
Least Squares Method
The best line is the one that minimizes the sum
of squared vertical differences between the
points and the line.
Let us compare two lines
The second line is horizontal
The smaller the sum of squared differences the
better the fit of the line to the data.
25
Assumptions of OLS regression

Model is linear in parameters
The residuals are normally distributed
The residuals have constant variance
The expected value of the residuals is always
zero
The residuals are independent from one another
The X values are precise
The independent variables are not too strongly
collinear

If these assumptions are satisfied, then OLS
estimator is unbiased and has minimum variance of
all unbiased estimators.
How can we test these assumptions?
If assumptions are violated,
what does this do to our conclusions?
how do we fix the problem?

26
The Model

The first order linear model or a simple
regression model,
y dependent variable
x independent variable
b0 y-intercept
b1 slope of the line
? error variable

27
Least Squares Method
To calculate the estimates of the coefficients
that minimize the differences between the data
points and the line, use the formulas
28
Least Squares Method
Now we define
29
Least Squares Method
Then
The estimated simple linear regression equation
that estimates the equation of the first order
linear model is
30
Error Variable Required Conditions

The error e is a critical part of the regression
model.
Five requirements involving the distribution of e
must be satisfied.
The mean of e is zero E(e) 0.
The standard deviation of e is a constant (se)
for all values of x.
The errors are independent.
The errors are independent of the independent
variable x.
The probability distribution of e is normal.

31
Standard error of estimate

If se is small the errors tend to be close to
zero (close to the mean error). Then, the model
fits the data well.
Therefore, we can, use se as a measure of the
suitability of using a linear model.
An unbiased estimator of se2 is given by se2

32
Assessing the Model

The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
Consequently, it is important to assess how well
the linear model fits the data.
Several methods are used to assess the model
Testing and/or estimating the coefficients.
Using descriptive measurements.

33
Outliers

An outlier is an observation that is unusually
small or large.
Several possibilities need to be investigated
when an outlier is observed
There was an error in recording the value.
The point does not belong in the sample.
The observation is valid.
Identify outliers from the scatter diagram and
remove them.

34
Practice Problem

A car dealer wants to find the relationship
between the odometer reading and the selling
price of used cars.
A random sample of 100 cars is selected, and the
data recorded.
Find the regression line.

35
Solution

We need to calculate several statistics first

where n 100.
36
Coefficient of Determination

A measure of the
Strength of the linear relationship between x and
y.
The larger the value of r2, the more the value of
y depends in a linear way on the value of x.
Amount of variation in y that is related to
variation in x.
Ratio of variation in y that is explained by the
regression model divided by the total variation
in y.

37
Coefficient of determination