Title: Financial Data Mining and Analysis
1Financial Data Mining and Analysis
- References
- Prof. Hua Chens Lecture note (at National Taiwan
University) - U.S. News and World Report's Business
Technology section, 12/21/98, by William J.
Holstein - Prof. Jurans lecture note 1 (at Columbia
University) - J.H. Friedman (1999) Data Mining and Statistics.
technical report, Dept. of Stat., Stanford
University
2Main Goal
- Study statistical tools useful in managerial
decision making. - Most management problems involve some degree of
uncertainty. - People have poor intuitive judgment of
uncertainty. - IT revolution... abundance of available
quantitative information - data mining large databases of info, ...
- market segmentation targeting
- stock market data
- almost anything else you may want to know...
- What conclusions can you draw from your data?
- How much data do you need to support your
conclusions?
3Applications in Management
- Operations management
- e.g., model uncertainty in demand, production
function... - Decision models
- portfolio optimization, simulation, simulation
based optimization... - Capital markets
- understand risk, hedging, portfolios, beta's...
- Derivatives, options, ...
- it is all about modeling uncertainty
- Operations and information technology
- dynamic pricing, revenue management, auction
design, ... - Data mining... many applications
4Portfolio Selection
- You want to select a stock portfolio of companies
A, B, C, - Information Stock Annual returns by year
- A 10, 14, 13, 27,
- B 16, 27, 42, 23,
- Questions
- How do we measure the volatility of each stock?
- How do we quantify the risk associated with a
given portfolio? - What is the tradeoff between risk and returns?
5(No Transcript)
6Currency Value (Relative to Jan 2 1998)
7Introduction
- Premise All business becomes information driven.
- The concept of Data Mining is becoming
increasingly popular as a business information
management tool where it is expected to reveal
knowledge structures that can guide decisions in
conditions of limited certainty. - Competitiveness How you collect and exploit
information to your advantage? - The challenges
- Most corporate data systems are not ready.
- Can they share information?
- What is the quality of the input information
- Most data techniques come from the empirical
sciences the world is not a laboratory. - Defining good metrics abandoning gut rules of
thumb may be too "risky" for the manager. - Communicating success, setting the right
expectations.
8A visualization of a Naive Bayes model for
predicting who in the U.S. earns more than
50,000 in yearly salary. The higher the bar,
the greater the amount of evidence a person with
this attribute value earns a high salary.
9Data Mining and Statistics
- Data Mining is used to discover patterns and
relationships in data with an emphasis on large
observational data bases. - It sits at the common frontiers of several fields
including Data Base Management, Artificial
Intelligence, Machine Learning, Pattern
Recognition and Data Visualization. - From a statistical perspective it can be viewed
as computer automated exploratory data analysis
of large complex data sets. - Many organizations have large transaction
oriented data bases used for inventory billing
accounting, etc. These data bases were very
expensive to create and are costly to maintain.
For a relatively small additional investment DM
tools offer to discover highly profitable nuggets
of information hidden in these data. - Data, especially large amounts of it reside in
data base management systems DBMS. - Conventional DBMS are focused on online
transaction processing (OLTP) that is the
storage and fast retrieval of individual records
for purposes of data organization. They are used
to keep track of inventory payroll records,
billing records, invoices, etc.
10Data Mining Techniques
- Data Mining as an analytic process designed to
- explore data (usually large amounts of -
typically business or market related - data) in
search for consistent patterns and/or systematic
relationships between variables. - to validate the findings by applying the detected
patterns to new subsets of data. - The ultimate goal of data mining is prediction -
and predictive data mining is the most common
type of data mining and one that has most direct
business applications. - The process of data mining consists of three
stages - the initial exploration.
- model building or pattern identification with
validation and verification. - deployment (i.e., the application of the model to
new data in order to generate predictions).
11Stage 1 Exploration
- It usually starts with data preparation which may
involve cleaning data, data transformations,
selecting subsets of records and - in case of
data sets with large numbers of variables
("fields") - performing some preliminary feature
selection operations to bring the number of
variables to a manageable range (depending on the
statistical methods which are being considered). - Depending on the nature of the analytic problem,
this first stage of the process of data mining
may involve anywhere between a simple choice of
straightforward predictors for a regression
model, to elaborate exploratory analyses using a
wide variety of graphical and statistical methods
in order to identify the most relevant variables
and determine the complexity and/or the general
nature of models that can be taken into account
in the next stage.
12Stage 2 Model building and validation
- This stage involves considering various models
and choosing the best one based on their
predictive performance - Explain the variability in question and
- Producing stable results across samples.
- This may sound like a simple operation, but in
fact, it sometimes involves a very elaborate
process. - "competitive evaluation of models," that is,
applying different models to the same data set
and then comparing their performance to choose
the best. - These techniques - which are often considered the
core of predictive data mining - include Bagging
(Voting, Averaging), Boosting, Stacking (Stacked
Generalizations), and Meta-Learning.
13Models for Data Mining
- In the business environment, complex data mining
projects may require the coordinate efforts of
various experts, stakeholders, or departments
throughout an entire organization. - In the data mining literature, various "general
frameworks" have been proposed to serve as
blueprints for how to organize the process of
gathering data, analyzing data, disseminating
results, implementing results, and monitoring
improvements. - CRISP (Cross-Industry Standard Process for data
mining) was proposed in the mid-1990s by a
European consortium of companies to serve as a
non-proprietary standard process model for data
mining. - The Six Sigma methodology - is a well-structured,
data-driven methodology for eliminating defects,
waste, or quality control problems of all kinds
in manufacturing, service delivery, management,
and other business activities.
14CRISP
- CRISP postulates the following general sequence
of steps for data mining projects
15Six Sigma
- This model has recently become very popular (due
to its successful implementations) in various
American industries, and it appears to gain favor
worldwide. It postulated a sequence of,
so-called, DMAIC steps - The categories of activities Define (D), Measure
(M), Analyze (A), Improve (I), Control (C ). - Postulates the following general sequence of
steps for data mining projects - Define (D) ? Measure (M) ? Analyze (A)
? Improve (I) ? Control (C ) - - It grew up from the manufacturing, quality
improvement, and process control traditions and
is particularly well suited to production
environments (including "production of services,"
i.e., service industries). - Define. It is concerned with the definition of
project goals and boundaries, and the
identification of issues that need to be
addressed to achieve the higher sigma level. - Measure. The goal of this phase is to gather
information about the current situation, to
obtain baseline data on current process
performance, and to identify problem areas. - Analyze. The goal of this phase is to identify
the root cause(s) of quality problems, and to
confirm those causes using the appropriate data
analysis tools. - Improve. The goal of this phase is to implement
solutions that address the problems (root causes)
identified during the previous (Analyze) phase. - Control. The goal of the Control phase is to
evaluate and monitor the results of the previous
phase (Improve).
16Sampling
- Objective Determine the average amount of money
spent in the Central Mall. - Sampling A Central City official randomly
samples 12 people as they exit the mall. - He asks them the amount of money spent and
records the data. - Data for the 12 people
- Person spent Person spent
Person spent - 1 132 5
123 9 449 - 2 334 6
5 10 133 - 3 33 7
6 11 44 - 4 10 8
14 12 1 - The official is trying to estimate mean and
variance of the population based on a sample of
12 data points.
17Population versus Sample
- A population is usually a group we want to know
something about - all potential customers, all eligible voters, all
the products coming off an assembly line, all
items in inventory, etc.... - Finite population u1, u2, ... , uN versus
Infinite population - A population parameter is a number (q) relevant
to the population that is of interest to us - the proportion (in the population) that would buy
a product, the proportion of eligible voters who
will vote for a candidate, the average number of
MM's in a pack.... - A sample is a subset of the population that we
actually do know about (by taking measurements of
some kind) - a group who fill out a survey, a group of voters
that are polled, a number of randomly chosen
items off the line.... - x1, x2, ... , xn
- A sample statistic g(x1, x2, ... , xn) is often
the only practical estimate of a population
parameter. - We will use g(x1, x2, ... , xn) as proxies for q,
but remember their difference.
18Average Amount of Money spent in the Central Mall
- A sample (x1, x2, ... , xn)
- Its mean is the sum of their values divided by
the number of observations. - The sample mean, the sample variance, and the
sample standard deviation are 107, 220,854, and
144.40, respectively. - It claims that on average 107 are spent per
shopper with a standard deviation of 144.40.
19- The variance s2 of a set of observations is the
average of the squares of the deviations of the
observations from their mean. - The standard deviation s is the square root of
the variance s2 . - How far the observations are from the mean? s2
and s will be - large if the observations are widely spread about
their mean, - small if they are all close to the mean.
20Stock Market Indexes
- It is a statistical measure that shows how the
prices of a group of stocks changes over time. - Price-Weighted Index DJIA
- Market-Value-Weighted Index Standard and Poors
500 composite Index - Equally Weighted Index Wilshire 5000 Equity
Index - Price-Weighted Index It shows the change in the
average price of the stock that are included in
the index. - Price per share in current period P0 and price
per share in next period P1. - Number of shares outstanding in current period Q0
and number of shares outstanding in next period
Q1.
21Data Analysis
- Statistical Thinking is understanding variation
and how to deal with it. - Move as far as possible to the right on this
continuum - Ignorance--gtUncertainty--gtRisk--gtCertainty
- Information sciencelearning from data
- Probabilistic inference based on mathematics
- What is Statistics?
- What is the connection if any
- Fields including Data Base Management Artificial
Intelligence
22Probability the study of randomness
- It is based on a lecture given by Professor
Costis Maglaras at Columbia University.
23Randomness
- A phenomenon is random
- if individual outcomes are uncertain but there is
a regular distribution of outcomes in a large
number of repetitions.
24Probability
- The probability of any outcome of a random
phenomenon is - long term relative frequency, i.e.
- the proportion of the times the outcome would
occur in a very long series of repetitions.
(empirical) - Trials need to be independent.
- Computer simulation is a good tool to study
random behavior. - The uses of probability
- Begins with gambling.
- Now applied to analyze data in astronomy,
mortality data, traffic flow, telephone
interchange, genetics, epidemics, investment...
25Probability Terms
- Random Experiment A process leading to at least
2 possible outcomes with uncertainty as to which
will occur. - Event An event is a subset of all possible
outcomes of an experiment. - Intersection of Events Let A and B be two
events. Then the intersection of the two events,
denoted A ? B, is the event that both A and B
occur. - Union of Events The union of the two events,
denoted A ? B, is the event that A or B (or both)
occurs. - Complement Let A be an event. The complement of
A (denoted ) is the event that A does not occur. - Mutually Exclusive Events A and B are said to be
mutually exclusive if at most one of the events A
and B can occur. - Basic Outcomes The simple indecomposable
possible results of an experiment. One and
exactly one of these outcomes must occur. The set
of basic outcomes is mutually exclusive and
collectively exhaustive. - Sample Space The totality of basic outcomes of
an experiment.
26Basic Probability Rules
- 1. For any event A, 0 ? P(A) ? 1.
- 2. If A and B can never both occur (they are
mutually exclusive), then - P(A and B) P(A ? B) 0.
- 3. P(A or B) P(A ? B) P(A) P(B) - P(A ? B).
- 4. If A and B are mutually exclusive events, then
P(A or B) P(A ? B) P(A) P(B). - 5. P(Ac) 1 - P(A).
- Independent Events
- Two events A and B are said to be independent if
the fact that A has occurred or not does not
affect your assessment of the probability of B
occurring. Conversely, the fact that B has
occurred or not does not affect your assessment
of the probability of A occurring. - 6. If A and B are independent events, then
- P(A and B) P(A ? B) P(A) ? P(B).
(Markov??)
27Probability models
- Two parts in coin tossing.
- A list of possible outcomes.
- A probability for each outcome.
- The Sample space S of a random phenomenon is the
set of all possible outcomes. - Examples. Sheads, tailsH,T
- General analysis is possible.
28Event
- An event is an outcome or a set of outcomes. (
it is a subset of the sample space) - AHHTT,HTHT,HTTH,THHT,THTH,TTHH
- Two events A and B are independent if knowing
that one occurs does not change the probability
that the other occurs. - If A and B are independent,P(A and B) P(A)P(B)
- The heads of successive coin tosses are
independent, not independent. - The colors of successive cards dealt from the
same deck are independent, not independent.
29P(A ? B) P(AB)P(B) P(BA)P(A)
Conditional Probability
- In these simple calculations, we are making use
of the conditional probability formula - P(AB) P(A holds given that B holds)
P(AnB)/P(B) - This relationship is known as Bayes' Law, after
the English clergyman Thomas Bayes (1702-1761),
who first derived it. Bayes' Law was later
generalized by the French mathematician
Pierre-Simon LaPlace (1749-1827).
30Random Variables
- A random variable is a variable whose value is a
numerical outcome of a random phenomenon. - Sample spaces need not consist of numbers.
- Examples number of heads in 4 coin tossing,
31Random Variable
- A random variable is called discrete if it has
countably many possible values otherwise, it is
called continuous. - The following quantities would typically be
modeled as discrete random variables - The number of defects in a batch of 20 items.
- The number of people preferring one brand over
another in a market research study. - The following would typically be modeled as
continuous random variables - The yield on a 10-year Treasury bond three years
from today. - The proportion of defects in a batch of 10,000
items. - Sometimes, we approximate a discrete random
variable with a continuous one if the possible
values are very close together e.g., stock
prices are often treated as continuous random
variables.
32Distribution discrete
- If X is a discrete random variable then we denote
its pmf by PX. - The rule that assigns specific probabilities to
specific values for a discrete random variable is
called its probability mass function or pmf. - For any value x, PX(x) is the probability of the
event that X x i.e., - PX(x) P(X x) probability that the value
of X is x. - We always use capital letters for random
variables. Lower-case letters like x and y stand
for possible values (i.e., numbers). - The pmf gives us one way to describe the
distribution of a random variable. Another way is
provided by the cumulative probability function,
denoted by FX and defined by FX(x) P(X? x) - It is the probability that X is less than or
equal to x. - The the pdf gives the probability that the random
variable lands on a particular value, the cpf
gives the probability that it lands on or below a
particular value. In particular, FX is always
an increasing function.
33Distribution continuous
- The distribution of a continuous random variable
cannot be specified through a probability mass
function because if X is continuous, then P(X
x) 0 for all x i.e., the probability of any
particular value is zero. Instead, we must look
at probabilities of ranges of values. - The probabilities of ranges of values of a
continuous random variable are determined by a
density function. It is denoted by fX. The area
under a density is always 1. - The probability that X falls between two points a
and b is the area under fX between the points a
and b. The familiar bell-shaped normal curve is
an example of a density. - The cumulative distribution function or cdf of a
continuous random variable is obtained from the
density in much the same way a cpf is obtained
from the pmf of a discrete distribution. - The cdf of X, denoted by FX, is given by FX(x)
P(X? x). - FX(x) is the area under the density fX to the
left of x.
34Expectation
- The expected value of a random variable is
denoted by EX. - It can be thought of as the average value
attained by the random variable. - The expected value of a random variable is also
called its mean, in which case we use the
notation mX. - The formula for the expected value of a discrete
random variable is this EX Sx xPX(x). - The expected value is the sum, over all possible
values x, of x times its probability PX(x). - The expected value of a continuous random
variable cannot be expressed as a sum instead it
is an integral involving the density. - If g is a function (for example, g(x) x2), then
the expected value of g(X) is Eg(X) Sx
x2PX(x). - The variance of a random variable X is denoted by
either VarX or sX2. - The variance is defined by sX2 E(X- mX)2
EX2 - (EX)2. - For a discrete distribution, we can write the
variance as Sx (x- mX)2PX(x).
35Discrete random variable
- Discrete random variable X has a finite number of
possible values. - The probability distribution of X lists the
values and their probabilities. - The probabilities pk must satisfy ...
- Every probability pi is a number between 0 and 1.
- p1 p2... pk1.
- Probability histogram
- Possible values of X and corresponding
probability.
36Commonly Used Continuous Distribution
- The Normal Distribution
- History
- Abraham de Moivre (1667-1754) first described the
normal distribution in 1733. - Adolphe Quetelet (1796-1874) used the normal
distribution to describe the concept of l'homme
moyen (the average man), thus popularizing the
notion of the bell-shaped curve. - Carl Friedrich Gauss (1777-1855) used the normal
distribution to describe measurement errors in
geography and astronomy.
37Bernoulli Processes and the Binomial Distribution
- An airline reservations switchboard receives
calls for reservations, and it is found that - When a reservation is made, there is a good
chance that the caller will actually show up for
the flight. In other words, there is some
probability p (say for now p 0.9) that the
caller will show up and buy the ticket the day of
departure. - Consider a single person making a reservation.
This particular reservation can either result in
the person on the flight (a success) or a no
show (a failure). Let X (a random variable)
represent the result of a particular reservation.
That is, we could assign a value of 1 to X if the
person shows up for the flight (X 1), and let X
0 if the person does not. Then, P(X 0) 1 -
p and P(X 1) p. - The airline is not particularly interested in the
decision made by any one individual, but is more
concerned with the behavior of the total number
of people with reservations. - Suppose each passenger carried on the plane
provides a revenue of 100 for the airline and
each bumped passenger (passengers that do not
find a seat due to overbooking) results in a loss
of 200 for the airline. - If a plane holds 16 people, not including pilots
and crew, how many reservations should be taken?
38Bernoulli process
- This is an example of a Bernoulli process, named
for the Swiss mathematician James Bernoulli
(1654-1705). - A Bernoulli process is a sequence of n identical
trials of a random experiment such that each
trial - (1) produces one of two possible complimentary
outcomes that are conventionally called success
and failure and - (2) is independent of any other trial so that
the probability of success or failure is constant
from trial to trial. - Note that the success and failure probabilities
are assumed to be constant from trial to trial,
but they are not necessarily equal to each other.
- In our example, the probability of a success is
0.9 and the probability of a failure is 0.1. - The number of successes in a Bernoulli process is
a binomial random variable. - Random Variable A numerical value determined by
the outcome of an experiment.
39Analysis
- If the airline takes 16 reservations, what is the
probability that there will be at least one empty
seat? - P(at least one empty seat) 1 - (0.9)16
0.815. - An 81.5 chance of having at least one empty
seat! So the airline would be foolish not to
overbook. - Suppose we take 20 reservations for a particular
flight, let Y be the number of people who show
up. - Y is a binomial random variable that takes on an
integer value between 0 and 20. - What is the probability function or distribution
of Y? - What is the probability of getting exactly 16
passengers? A 0.08978 - P(Y ? 16) 0.133, P(Y 17) 0.190, P(Y
18)0.285, P(Y 19) 0.270, P(Y 20) 0.122 - Consider B number of people bumped. The load
L is Y - B. - The airline's total expected revenue (call this
R, then R 100L - 200B) - E(R) E(100L - 200B) 100E(L) - 200E(B)
1,182.81.
40How many reservation?
- Reservation 20 19 18
17 16 - E(Load) 15.943 15.839 15.599 15.132
14.396 - E(Bumps) 2.057 1.261 0.600
0.167 0.000 - E(Revenue) 1,183 1,332 1,440 1,480
1,440 - In this case, the best strategy is to take 17
reservations. - Expected Value The expected value (or mean or
expectation) of a random variable X with
probability function P(X x) is - E(X) S x P(Xx)
- where the summation is over all x that have
P(X x) gt 0. It is sometimes denoted ?X or ?. - Variance The variance of a random variable X
with probability function P(X x) is - Var(X) S (x- E(X))2P(Xx) ,
- where the summation is over all x such that P(X
x) gt 0. It is sometimes denoted ?2(X) or ?2.
41Inference
- Mean, Proportion, CLT
- Bootstrap
42From Probability to Statistics
- In all our probability calculations, we have
assumed that we know all quantities needed to
solve the problem - To find the expected return and standard
deviation of a portfolio, we assumed we knew the
mean and standard deviation of the returns of the
underlying stocks. - To find the proportion of bags below the 8-ounce
minimum, we assumed we knew the mean and standard
deviation of the weight of chips in each bags. - In practice, these types of parameters are not
given to us we must estimate them from data. - Statistical analysis usually proceeds along the
following lines - Postulate a probability model (usually including
unknown parameters) for a situation involving
uncertainty e.g., assume that a certain quantity
follows a normal distribution. - Use data to estimate the unknown parameters in
the model. - Plug the estimated parameters into the model in
order to do make predictions from the model.
43How do we start with?
- The first step, picking a model, must be based on
an understanding of the situation to be modeled. - Which assumptions are plausible?
- Which are not?
- These questions are answered by judgment, not by
precise statistical techniques. - Examples
- Assume that daily changes in a stock price follow
a normal distribution. - Use historical data to estimate the mean and
standard deviation. - Once we have estimates, we might use the model to
predict future price ranges or to value an option
on the stock. - Assume that demand for a fashion item is normally
distributed. - Use historical data to estimate the mean and
standard deviation. - Once we have estimates, we might use the model to
set production levels.
44How do we get data and make inference?
- The first step in understanding the process of
estimation is understanding basic properties of
sampled data and sample statistics, since these
are the basis of estimation. - When we talk about sampling it is always in the
context of a fixed underlying population - If we look at 50 daily changes in IBM stock, we
are looking at a sample of size 50 from the
population of all daily changes in IBM stock. - If the population is very large (as in these
examples), we generally treat it as though it
were infinite this simplifies matters. Thus, we
are primarily concerned with finite samples from
infinite populations. - A single sample from a population is a random
variable. Its distribution is the population
distribution e.g., - The distribution of a randomly selected daily
change in IBM stock is the distribution over all
daily changes
45Random Sample
- A random sample from a population is a set of
randomly selected observations from that
population. If X1,, Xn are a random sample, then - they are independent
- they are identically distributed, all with the
distribution of the underlying population. - A sample statistic is any quantity calculated
from a random sample. The most familiar example
of a sample statistic is the sample mean - , given by
- (X1 X2 Xn)/n
- The sample mean gives an estimate of the the
population mean m EXi.
46Distribution of the Sample Mean
- Every sample statistic is a random variable.
- Randomness is introduced through the sampling
mechanism. - As noted above, the sample mean of a random
sample X1,, Xn is an estimate of the population
mean m EXi. - How good an estimate is it?
- How can we assess the uncertainty in the
estimate? - To answer these questions, we need to examine the
sampling distribution of the sample mean that
is, the distribution of the random variable . - Assume that the underlying population is normal
with mean m and variance s2. - This means that Xi N(m,s2) for all i.
- The Xi's are independent, since we assume we have
a random sample. - The sum of independent normal random variables is
normally distributed. The usual rules for means
and variances apply - The expected value of the sum is the sum of the
expected values. - The variance of the sum is the sum of the
variances (by independence). - Any linear transformation of a normal random
variable is normal in particular, multiplication
by a constant preserves normality.
47Distribution of the Sample Mean
- Using these two facts, we find that if Xi
N(m,s2) for all i, then - X1 X2 Xn N(nm,ns2)
- The sample mean from a normal population has a
normal distribution. - First consequence
- The expected value of the sample mean is the
population mean on average" the sample mean
correctly estimates the underlying mean. - The standard deviation of a sample statistic is
called its standard error. Thus, we have shown
that the standard error of the sample mean is
s/vn, where s is the underlying standard
deviation and n is the sample size. - Second consequence
- Because the standard error of sample mean is
s/vn, the uncertainty in this estimate decreases
as the sample size n increases. (That's good.) - The uncertainty (as measured by the standard
deviation) decreases rather slowly to cut the
standard deviation in half, we need to collect
four times as much data, because of the square
root. (That's not so good, but that's life.)
48Example
- Suppose the number of miles driven each week by
US car owners is normally distributed with a
standard deviation of s 75 miles. - Suppose we plan to estimate the population mean
number of miles driven per week by US car owners
using a random sample of size n 100. - What is the probability that our estimate will
differ from the true value by more than 10 miles? - Denote the population mean by m and the sample
mean by . - We need to find
. - By symmetry of the normal distribution, it is
Thus, the probability that our estimate will be o
by more than 10 miles is 18.36.
- If the underlying population is not normal, what
can be done?
49Central Limit Theorem
- By the central limit theorem, regardless of the
underlying population, the distribution of sample
mean tends towards N(m,s2/n) as n becomes large. - If we accept the use of this approximation, we
don't need to assume that the number of miles
driven per week in the example is normally
distributed (as long as our sample size n is
large). - repeatedly to assess the error in X as an
estimate of . - How large should n be for the normal
approximation to be accurate? - There is no simple answer (it depends on the
underlying distribution), but n? 30 is a
reasonable rule of thumb. - If the underlying population is finite of size N,
and if the sample size n is not a small
proportion of N, we use the following small
sample correction to the standard error -
50Sampling Distribution of the Sample Proportion
- Consider estimating any of the following
quantities - Proportion of voters who will vote for a
third-party candidate in the next election. - Proportion of visits to a web site that result
in a sale. - Proportion of shoppers who prefer crunchy over
creamy. - In each of these examples, we are trying to
estimate a population proportion. Denote a
generic population proportion by the symbol p. - Estimate a population proportion using a sample
proportion. - For example, if a poll surveys 1000 voters and
finds that 85 of those surveyed plan to vote for
a third-party candidate, then the sample
proportion is 8.5. - The population proportion is what the poll would
find if it could ask every voter in the
population. - Denote the sample proportion by the symbol
- Once we have collected a random sample, the
sample proportion is known. We use it to
estimate the true, unknown population proportion
p.
51EXAMPLE
- Suppose that the true, unknown proportion p of
voters who will vote for a third-party candidate
in the next election is 9. - What is the probability that a poll of 1000
voters will find a sample proportion that differs
from the true proportion by more than 2? - We need to find
-
- We conclude that the probability that the poll
will be off by more than two percentage points is
.027.
52Bootstrap
- As a general term, bootstrapping describes any
operation which allows a system to generate
itself from its own small well-defined subsets
(e.g. compilers, software to read tapes written
in computer-independent form). - The word is borrowed from the saying pull
yourself up by your own bootstraps. - In statistics, the bootstrap is a method allowing
one to judge the uncertainty of estimators
obtained from small samples, without prior
assumptions about the underlying probability
distributions. - The method consists of forming many new samples
of the same size as the observed sample, by
drawing a random selection of the original
observations, i.e. usually introducing some of
the observations several times. - The estimator under study (e.g. a mean,
a correlation coefficient) is then formed for
every one of the samples thus generated, and will
show a probability distribution of its own. - From this distribution, confidence limits can be
given. - For details, see B. Efron (Computers and the
Theory of Statistics, SIAM Rev. 21 (1979) 460.)
or Efron (The Jackknife, the Bootstrap and Other
Resampling Plans, SIAM, Bristol, 1982. )
53Jackknife
- The jackknife is a method in statistics allowing
one to judge the uncertainties of estimators
derived from small samples, without assumptions
about the underlying probability distributions. - The method consists of forming new samples by
- omitting, in turn, one of the observations of the
original sample. - For each of the samples thus generated, the
estimator under study can be calculated, and the
probability distribution thus obtained will allow
one to draw conclusions about the estimator's
sensitivity to individual observations.