??9:10?12:00 A211?

About This Presentation

Title:

??9:10?12:00 A211?

Description:

9:10 12:00 A211 hchen_at_math.ntu.edu.tw 2 ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 46

Provided by: Hun117

Category:

more less

Transcript and Presenter's Notes

Title: ??9:10?12:00 A211?

1
?????

? ?
???????
??910?1200 A211?
hchen_at_math.ntu.edu.tw

2
????

????,????????(?????2?)
???????
??????
?????????????
????(?????7?)
??????????
?????
?????
?????(?????8?)
?????(Principal Component Analysis)
????(Factor Analysis)
?????(Discriminant Analysis)
?????(Cluster Analysis)
??????(Canonical Correlation Analysis)

???
??
????
R(??????)
R has a home page at http//www.r-project.org/
Download
??????
???(30)?projects(70)

4
? ?

??
Exploratory Data Analysis Decision Making
Data Mining
Data Collection ?????
????
R Software
????,????????
Probability and Random Variables
Variance
????
Association
IntroRegression
MultipleRegression
DAonREgression

5
? ?

?????
?????(Principal Component Analysis)
????(Factor Analysis)
?????(Discriminant Analysis)
?????(Cluster Analysis)
??????(Canonical Correlation Analysis)

6
Statistics for Decision Making

Describing Sets of Data
Objective Introduce numerical methods and
graphical displays to summarize data sets.
Graphical and numerical tools
for examining the distribution of a single
variable,
for comparing several distributions, and
for investigating changes over time.
Sampling and Statistical Inference
Objective Provide methods to infer about a
population based on a sample of observations
drawn from that population
Forecasting with Distinguishable Data
Objective Introduce the basic concepts of
forecasting to motivate a regression model.
Method for studying relationships among several
variables.
Regression Coefficients and Forecasts
Objective Understand regression coefficients and
how to use them for forecasting

7
Statistics for Decision Making

Measures of Goodness of Fit and Residual Analysis
Objective Introduce a few statistics that
measure how well a regression model fits the data
and show how to use residual analysis to detect
inadequacies of a regression model
Developing a Regression Model
Objective Demonstrate how to develop a useful
regression model through
Selection of the Dependent Variable
Selection of the Independent Variables
Determining the Nature of Relationships

8
Sampling and Statistical Inference

Objective Provide methods to infer about a
population based on a sample of observations
drawn from that population.
Inference from a Sample
Statistical Estimation
From Margin of Error to Confidence Interval
Test of Significance

9
Inference from a Sample

The sample provides useful information, but the
information is imperfect.
Samples are taken when it is impossible,
impractical or too expensive to obtain complete
data on relevant population.
EX. Suppose you are asked 100 potential customers
how much they will spend on a proposed new
product next year?
From the 100 responses you obtained a sample
average of 250. You could make the following
inference
My best estimate of average sales per potential
customer is 250.
Average sales per potential customer will be
between 210 and 290 with 95 confidence.
Average sales per potential customer will be
greater than the break-even amount of 210 at a
2.5 level of significance.
Law of Large Numbers
Independent observations at random from any
population with finite mean ?
As the number of observations drawn increases,
the mean of the observed values eventually
approaches the mean ? of the population as
closely as you specified and then stays that
close.

10
Sampling variability

Parameter pthe proportion of the adult
population in the US (190 million) that find
clothes shopping frustrating.
Statistic 66 or 1650 out of 2500 adults.
Sampling variability The value of a statistic
varies in repeated random sampling.
Answer to What would happen if we took many
samples?
Take a large number of samples from the same
population.
Calculate the sample proportion p for each
sample.
Make a histogram of the values of p.
Examine the distribution displayed in the
histogram.
We can imitate chance behavior of many samples by
using random digits or computer (simulation).

11
Sampling variability

The sampling distribution of a statistic is the
distribution of values taken by the statistic in
all possible samples of the same size from the
same population.
Can be either
approximated by simulation or
obtained exactly by probability theory in
statistics.
1000 SRSs of size 100 when p0.6.

12
1000 SRSs of size 100 and 2500 when p0.6
13
Bias and variance

A statistic is unbiased in the mean of its
sampling distribution is equal to the true value
of the parameter being estimated. - no
favoritism.
The variability of a statistic is described by
the spread of its sampling distribution.
95 of the sample proportions will like in the
range 0.60.1 (n100) or 0.6 0.02 (n2500)
Larger samples have smaller spreads.
As long as the population is much larger than the
sample, the spread of the sampling distribution
for a sample of fixed size n is approximately the
same for any population size.
An SRS of size 2500 from 270 million US residents
gives results as precise as an SRS of size 2500
from 740,000 inhabitants of SFO!

14
(No Transcript)
15
Why randomize?

The act of randomizing guarantees that the
results of analyzing our data are subject to the
laws of probability.
Randomization removes bias.
Replication (bigger sample) reduces variance.
Better answer What would happen if the sample or
the experiment were repeated many times?
Caution the sampling distribution does not
reflect bias due to under-coverage, non-response,
lack of realism, etc.

16
Presidential Election and Poll

17
??1936???????

??????????????????????????????
??????????????
??????,?1929??1933?????????????
??????????????????The spender must go?
???????????????? (deficit financing)????Balance
the budget of the American people first?
????????????????????
???Literary Digest????????57?43?????
?????????????????????
????1916??,????????????????
????????62?38?????????
?????-???-???
??Literary Digest??????????????,????????,???????56
?44?????
?????????????,??????????56?44?????

18
Digest???????

?????????????,????????,????????????????????
????????????????,????????
???????
????Digest?????????????,???????????????
??????????
????????,?????????????,???20???,????????????,????
????????????????

19
??????????????????

????16??????393??????????????,
???1033???????
????????,???????????????????,?????????????????,???
??16??????????????(????)???
?????????,?????????????????

20
Digest???????

?????????????,????????,????????????????????
(???????????????????????)?
???????
????Digest?????????????,???????????????
??????????
????????,?????????????,???20???,????????????,????
????????????????

21
Statistical Estimation

A parameter is a number that described the
population.
Its value is fixed but unknown.
A statistic is a number that describes a sample.
Its value is known for a sample, but it can
change from sample to sample.
We use a statistic to estimate an unknown
parameter.
Error of estimation is the difference between an
estimate and the estimated parameter.
In case of estimating the population mean using
the sample mean,
Error of Estimation sample mean
population mean
The distribution of Error of Estimation Central
Limit Theorem
If the sample size is large, the error of
estimation is approximately normally distributed
with mean zero and a standard deviation which can
be estimated by
Standard Error sample standard
deviation/(sample size)1/2
The Normal Distribution
If X has N(?,?2) distribution, then Z(X- ?)/?
has N(0,1) distribution.

22
The normal density

The height of the normal density curve for the
normal distribution with mean ? and SD ? is given
by

Why is the normal distributions important?
Good description for some distributions of real
data. (e.g. test scores, repeated measurements,
characteristics of biological populations, etc.)
Good approximations to the results of many kinds
of chance outcomes. (e.g. coin tossing).
Many statistical inference procedures based on
normal distributions work well for other roughly
symmetric distributions.

23
From Margin of Error to Confidence Interval

What is the probability that the error of
estimation exceeds two standard errors?
If we add two standard errors to our estimate as
the margin of error, what can we say about the
resulting interval estimate?
Confidence and Probability
When reporting that a confidence interval for a
population mean extends from 210 to 290, it is
tempting to slip into the language of
probability, and say there is only 5 chance that
the true mean of the population is outside this
interval.
Such probabilistic interpretation is much more
natural and appealing than the rather convoluted
interpretation above. But is it legitimate?
Example
Suppose from a sample of 100 potential customers
one market researcher obtained a 95 confidence
interval of (190,210) for the average amount a
potential customer will spend on a product next
year.
Another market researcher from a different sample
of size 400 obtained a 95 confidence interval of
(215,225).
How do you reconcile these two results?

24
Test of Significance

Example 1 A market researcher asked a sample of
100 potential customers how much they plan to
spend on a product next year.
The mean of the sample turned out to be 25 and
the standard deviation is 200.
Is it likely that average sales per capita
exceeds a break-even level of 208?
Example 2 Suppose a manager is trying to decide
which of the two new products, A or B, to
introduce. Break-even sales per capita are 208
for both A and B.
Sample results are given in the following.
Product A sample size 10,000, sample mean211,
sample SD 100
Product B sample size 100, sample mean250,
sample SD 300
Example 3 In a Business Week/Harris executive
poll, senior executives were asked Compared
with the last 12 months, do you think the rate of
growth of the gross domestic product will go up,
go down, or stay the same for the next 12 months?

25
Test for Independence

Application on Business outlook
Results of this poll are summarized below
(Business Week, 1/09/95).
Date of Survey
12/94 6/94 12/93
Total
Go Up 152
177 101 430
Go Down 104 72
36 212
Outlook Stay the Same 144 152 261
557
Not Sure 0
0 4 4
Total 400
401 402 1203
Have the executives changed their outlook over
time?

26
Relations in categorical data

Relationship between two or more categorical
variables.
Use counts (frequencies) or percent (relative
frequencies) of individuals that fall into
various categories.
Two-way table
A two-way table describes two categorical
variables.
Each horizontal row in the table describes
individuals with one level of the row variable.
Each vertical column describes individuals with
one level of the column variable.
EX Years of school completed, by age (thousands
of persons)

27
Marginal distributions

Look at the distribution of each variable
separately.
Total columns list the totals for each of the
rows or row totals. Similarly for column totals.
Row and column totals specify the marginal
distributions of each of the two categorical
variables.
The distribution of years of schooling completed
among people age 25 years and over

28
Describing relationships

What percent of people aged 25 to 34 have
completed 4 years of college?
What percent of people aged 35 to 54 have
completed 4 years of college?
What percent of people aged 55 and over have
completed 4 years of college?
Conclusion?

29
Conditional distribution of age group on the
education level
30
Three way table

The table of outcome by hospital by patient
condition is a three-way table that reports the
frequencies of each combination of levels of
three categorical variables.
We can aggregate a three-way table into a two-way
table.
A variable being aggregated can become a lurking
variable.

31
NSF study on the salary of new women engineer

The median salary of newly graduated female
engineers and scientists was 73 of that for
males.
Field is a lurking variable. (life and social
sciences against physical and engineering)

32
Establishing causation

The best (and only?) method of establishing
causation is to conduct a carefully designed
experiment in which the effects of possible
lurking variables are controlled.
What other criteria when we cant do an
experiment?

33
Smoking causes lung cancer

The association is strong.
The association is consistent.
Higher doses are associated with stronger
responses.
The alleged cause precedes the effect in time.
The alleged cause is plausible.

34
Forecasting with Distinguishable Data

Objective Introduce the basic concepts of
forecasting to motivate a regression model.
Forecasting with Indistinguishable Data
If the future value of the variable you would
like to forecast is indistinguishable from the
sample values you collected, then you forecast
with indistinguishable data.
Example 1 To help forecasting the selling price
of your house, you obtained a sample (109,360,
137,980, 131,230, 130,230, 125,410, 124,370,
139,030, 140,160, 144,220, 154,190.
Forecasting when the Data are Distinguishable
When your sample contains additional information
so that the sample values are no longer
indistinguishable from the future value you would
like to forecast, you forecast with
distinguishable data.
Example 2 Our sample also contain the
information on the square footage of the ten
houses. (109,360,1404), (137,980,1477),
(131,230,1503), (130,230,1552),
(125,410,1608), (124,370,1633),
(139,030,1717), (140,160,1775),
(144,220,1838), (154,190,1934).

35
Forecasting with Distinguishable Data

Assume that your house has 1682 square feet of
living area.
Analysis 1 sample average of all ten houses
133,618 (SD 12,406)
Analysis 2 Stratify the sample according to lot
size.
Size Range Sample Average SD
Number of Observations
1400-1599 127,200
12,381 4
1600-1799 132,243
8,513 4
1800-1999 149,205
7,050 2
Then use 132,243 (instead of 133,618) to
forecast the selling value.
Does the cell standard deviation properly measure
the forecast uncertainty?
Is it possible to have a measure of overall
efficacy of our partitioning the sample into
cells?
Use the data more efficiently The stratification
method that we used is unsatisfactory for two
reasons. First, we have ignored data on house
that are less like, but not most like yours.
Secondly, we have stratified the data somewhat
arbitrarily.

36
The question of causation

Mothers adult height vs daughters adult height.
Amount of saccharin in a rats diet vs count of
tumors in the rats bladder.
A students SAT score and the students first
year GPA.
Monthly flow of money into stock mutual funds vs
monthly rate of return for the stock market.
The anesthetic used in surgery vs whether the
patient survives the surgery.
The number of years of education a worker has vs
the workers income.

37
Explaining association

Causation.
Common response. (a lurking variable).
Confounding two variables are confounded when
their effects on a response variable are mixed
together.

38
Data on the survival of patients after surgery in
hospital A and B

Hospital A loses 3 of patients while Hospital B
loses 2.

39
Lurking variable...

1 vs 1.3 for patients with good condition
3.8 vs 4 for patients with bad condition

40
Simpsons paradox

How can A do better in each group, yet do worse
overall??
An association or comparison that holds for all
of several groups can reverse direction when the
data are combined to form a single group.

41
Regression Model

Try to create a model that specifies the
relationship between selling price (dependent
variable) and other variables (independent or
explanatory variable) that help you forecast the
selling price.
It is reasonable to assume that as size go up,
selling price will go up on average.

42
Regression Coefficients and Forecasts

Objective Understand regression coefficients and
how to use them for forecasting.

43
Measures of Goodness of Fit and Residual Analysis

Objective Introduce a few statistics that
measure how well a regression model fits the data
and show how to use residual analysis to detect
inadequacies of a regression model

44
Developing a Regression Model