Title: Topic VI: Sampling Theory
1Topic VISampling Theory Sampling Methods
- Concepts Definitions
- Sampling With Without replacement
- Probability Non-Probability Sampling
- Types of Sampling Methods
- Determining Sample Size
2Concepts Definitions
- Population
- Sampling Frame
- Unit of Analysis
- Sampling Units
- Principal Information
- Auxiliary Information
- Sampling Error
- Non-Sampling Error
- Sampling Fraction
- Bias
3Population
- This a collection of all the units of a specified
type defined over a given space or time - It is defined by
- Content this refers to who or what exactly are
the subjects of interest. Eg. All persons above
aged 18 and over - Units this refers to how the subjects are
grouped. Eg. Within households - Extent this refers to the spatial feature of
the population. Eg. The subjects can only be
living in Jamaica. - Time this refers to what period of time that
your subjects must possess the particulars named
above. Eg. June October 1998
4Sampling Frame
- This is a list of the all the units in the target
population from which the sample is to be chosen - A subset of subjects for a survey should only be
taken from a sampling frame
5Sampling Frames II
- When conducting a national survey there are two
types of sampling frames that can be used (note
there are others) - Electoral registers this lists electors in each
polling district by street streets in
alphabetical order - Postcode Sectors as their primary sampling unit
6Sampling Frames III
- Postal Sectors are determined from a Postcode
Address File - Postal Areas (CT) 121
- Postal Districts (CT2) 2700
- Postal Sectors (CT2 7) 8900
- Postcodes (CT2 7PE) 1.6mn
- Delivery Points 26mn
7Sampling Frames IV
8Sampling Frames - Notes
Post Code Sectors
Electoral Register
- not all adults are electors (exclude felons, non
EU or Commonwealth, ...) - limited corrections
- includes institutions (colleges)
- coverage high but incomplete
- compiled annually
- compiled locally with variety of software
- in force until 18 months from data collection
- does not give names
- no indication of household size
- multi occupancy indicator available
- will contain some small business addresses
- updated quarterly
- national system
- computerised format, so lower selection cost
9Unit of Analysis
- Sometimes referred to as Sampling Units
- This is the items/units being investigated
- Eg Individuals, households, hospitals
10Sampling Units
- This refers to the items/units selected for
inclusion in the sample - Eg If John Brown was selected to be included in
the sample then he is a sampling unit
11Principal Information
- This refers to information on the central
variable of the study - Also known as principal variable or principal
data - Eg. For a household budget survey, the principal
variable would be considered to be expenditure on
food
12Auxiliary Information
- This refers to any other information other than
the principal data - In the example on the previous slide
13Sampling Error I
- This is a measure of the departure of all the
possible estimates of a probability sampling
procedure from the population quantity being
measured - In other words, it refers to the difference
between the estimate derived from a sample survey
and the 'true' value that would result if the
whole population was studied (using the same
conditions) - It is also known sometimes as the bias
14Sampling Error II
- The standard error, variance coefficient of
variation (C.V.) are all measures of the sampling
error - Note A census does not have a sampling error
- The sampling error is therefore equal to
Population Parameter
Sample Statistic
15Sampling Error III - Characteristics
- The sampling error
- generally decreases as the sample size increases
(but not proportionally) - depends on the size of the population under study
- depends on the variability of the characteristic
of interest in the population - can be accounted for and reduced by an
appropriate sample plan - can be measured and controlled in probability
sample surveys
16Sampling Error - IV
- Although the Margin of Error and Sampling Error
are sometimes used interchangeably they are TWO
different concepts - Sampling error
- .
Margin of Error
17Non-Sampling Errors
- These are errors resulting from some imperfection
in the research design that causes response error
from a mistake in the execution of the research - Examples
- sample bias
- errors in recording responses
- nonresponses
- Also know as Systematic Errors
18Sampling Fraction
- This is the size of the sample as a proportion of
the population from which it was drawn - It is equal to n/N
- If n/N gt 1 then there is sampling with
replacement
19Bias I
- This means that results based on the sample do
not (even on average) reflect the same answers as
would come from a census - They are caused by both sampling non-sampling
factors
20Bias II
21Bias III
This is an adaptation of the diagram in Kish pg
519
22Sampling with without Replacement
- Sampling with Replacement
- occurs when a unit sampled is placed back into
the population - A particular unit is can be included more than
once in the sample - It is possible that n gt N
- Sampling without Replacement
- Occurs when a unit sampled is not placed back
into the population - A particular unit can only occur ONCE in the
sample - In some cases sampling without replacement from
an infinite population can be equal to sampling
from a small population with replacement
23Probability Non-Probability Sampling I
- Probability Samples
- Aka Random Samples (though sample units are not
chosen haphazardly) - The probabilities for selecting different samples
are specified - For each unit of the population the probability
of it appearing any sample is known - It provides an estimate for the unknown
population quantity -
- It also allows for the assessment of the standard
error which can be used to obtain confidence
intervals
24- There are 3 main steps involved in choosing a
probability sample - Decide on the population of interest
- Establish a sampling frame
- Select units from the frame using a probabilistic
algorithm
25Probability Non-Probability Sampling II
- Non-Probability Sampling
- This involves the selection of a units by
arbitrary methods - The probability of selection for each unit is
unknown - It is dangerous to make inferences about the
target population - It is often used to test aspects of a survey such
as questionnaire design, processing systems etc.
rather than make inferences about the target
population
26Probability Non-Probability Sampling III
- Choosing between the two types depends on
- the objectives and scope of the survey
- the method of data collection suitable to those
objective - the precision required of the results and whether
that precision needs to be able to be measured - the availability of a sampling frame
- the resources required to maintain the frame
- the availability of extra information about the
units in the population
27Sampling Methods
Probability Sampling
Non Probability Sampling
- Simple Random
- Stratified
- Systematic
- Cluster
- Multi-stage
- Purposive
- Quota
- Snowball
- Convenience
28Simple Random Sampling I
- Each member of the population has the same
probability of being a part of the sample
independent of whether another subject is in the
sample - AKA equal probability selection method
- It is the simplest sampling method
29Simple Random Sampling II
- n units are selected from N possible units in the
population - Every combination of n units is equally likely to
be the sample selected - The selection process can either be
- sampling without replacement, which is more
common - unrestricted sampling / sampling with
replacement
30Simple Random Sampling III
- It is important because it possesses simple
mathematical properties which are useful for
statistical theory and the computations are
relatively easy - All other probability sampling methods are
restrictions of SRS (usually where some
combinations of population elements are
suppressed) - For the mathematical properties to hold we must
assume an infinite population
31Simple Random Sampling IV
- There are three main ways of choosing a simple
random sample - Table of Random Numbers
- Lottery Method
- Computer Generated Numbers
32Stratified Random Sampling I
- A population is subdivided or partitioned
- Each subdivision is called a stratum
- All the subdivisions are the strata
- The idea is to ensure that the observations of
the units of a stratum are closer to each other
than to units of another stratum
33Stratified Random Sampling II
- SRS does not produce good results in cases where
the population to be sampled contains easily
recognisable subpopulations or strata - Strata do not overlap and any member of the
population can belong to only ONE stratum
34Stratified Random Sampling III
35Stratified Random Sampling IV
- Examples
- Household income or expenditure surveys
- urban rural
- Business surveys
- employee size
- Production
- sales
- industrial classification
- Agricultural surveys
- Stratification depends on purpose of survey
36Stratified Random Sampling V
- Why Stratify?
- In situations where there is foreknowledge of
some non-homogeneity in the population,
proportional stratified sampling ensures a
representative sampling across the non-homogenous
population - Used for administrative or scientific reasons,
where each stratum needs to be reported
separately - Eg. crop yield in each agro climatic stratum
where the results for each stratum will have its
own meaning
37Stratified Random Sampling VI
- Why Stratify? Contd
- Stratification has the advantage of
administrative convenience - May be due to practical constraints of access to
the population or cost - it may be easier to have each province/parish
conduct the survey - Survey problems may be different in different
strata - Eg. A financial survey of businesses maybe done
differently for small companies who are not
required to pay a certain tax in comparison with
larger companies who are
38Stratified Random Sampling VII
- Why Stratify? Contd
- Each stratum is more homogenous than the
population when taken as a whole - stratified sample would provide relatively
precise estimates within each stratum - yield more precise population estimates than if
simple random sampling was used
39Stratification The Procedure I
- The population is divided into H strata
- Each strata doesnt overlap and is exhaustive
- A SRS of size nh is taken from each stratum with
population Nh
40Stratification The Procedure II
- There are three types of Allocation
- Equal Allocation
- used when the main interest is to compare strata
parameters - and/or the population is thought to have a
homogenous variance within each stratum (ie the
variances are similar) - The same number of elements are taken from each
strata
41Stratification The Procedure III
- Proportional Allocation
- Used in cases where the sample is supposed to
reflect the population with respect to the
stratification variable - The number of units sampled within a given
stratum is proportional to the size of the
stratum - It is best to use proportional allocation in
situations where the variances of each stratum
are approximately equal
42Stratification The Procedure IV
- Optimal Allocation
- Used in cases where the variances for the strata
differ greatly - Also used when
- primary interests are the estimates for the
entire population - it is assumed that there is unequal variance
between each stratum - It produces estimates for the population mean or
total with the lowest variance for a fixed total
sample size, n
43Stratification - Advantages
- Estimates for each stratum can be evaluated
separately - Differences among the strata can be evaluated
- Total, means and proportion can be estimated with
high precision using appropriate weights - Savings in time and cost (convenience)
44Stratification - Disadvantages
- The proportion of the total population that
belongs to each stratum needs to be known - It may be complex and time consuming
45Systematic Random Sampling I
- Most widely known method of selection
- Simple to apply
- Consists of taking every kth sampling unit after
a random start - AKA Pseudo-Random selection
- Often used jointly with stratification with
cluster sampling
46Systematic Random Sampling II
- The first element is based on random selection
but subsequent elements are not - Procedure
- The population is divided into k groups of size n
N/k in each - One unit is chosen randomly from the first k
units - Every kth unit following is included in the
sample - It is possible that N/k is not an integer
47Systematic Random Sampling III
- Examples
- Agricultural Survey Selection of every 10th
farm from 500 farms in an area (would produce 50
farms) - Industrial Quality Control every 30 minutes or
every 10th batch - Marketing or Political Surveys every 10th
person passing a particular location - Surveys to supplement censuses
- Large multistage surveys samples are selected
systematically at the different stages
48Systematic Random Sampling - Advantages
- Operationally convenient
- Flexible
- Convenient to use when the sampling frame is not
available - It is spread out more evenly over the population
so that it is more likely to produce a more
representative sample
49Systematic Random Sampling - Disadvantages
- More precise than SRS when units within the
sample are heterogeneous and imprecise when the
units are homogeneous - Generally, it is not possible to gain suitable
estimates of the variance of the estimator from
one sample. The approximate variance can be
calculated
50Cluster Sampling I
- Cluster sampling divides the population into
groups, or clusters - A number of clusters are selected randomly to
represent the population, and then all units
within selected clusters are included in the
sample - No units from non-selected clusters are included
in the sample. They are represented by those from
selected clusters - This differs from stratified sampling, where some
units are selected from each group
51Cluster Sampling II
- The unit of selection contains more than one
population element - Examples of possible clusters
52Cluster Sampling III
- Advantages
- The cost per element is lower due to the lower
cost of listing or of location. Cost is also
lower because sampling is done within clusters - All elements are in one cluster, then there is
the convenience of reaching each members - Disadvantages
- Combining the variance from two separately
homogenous clusters may cause the variance of the
entire sample to be higher when compared with SRS - less accurate results are often obtained due to
higher sampling error than for simple random
sampling with the same sample size
53Multi-Stage Sampling I
- Multi-stage sampling is like cluster sampling
- It involves selecting a sample within each chosen
cluster, rather than including all units in the
cluster - It is sometimes referred to as sub-sampling
54Multi-Stage Sampling II
- Multi-stage sampling involves selecting a sample
in at least two stages - 1st stage, large groups or clusters are selected
- These clusters are designed to contain more
population units than are required for the final
sample - 2nd stage, population units are chosen from
selected clusters to derive a final sample - This is called TWO-STAGE SAMPLING
- If more than two stages are used, the process of
choosing population units within clusters
continues until the final sample is achieved
this would be considered MULTI-STAGE SAMPLING
55Multi-Stage Sampling -Advantages
- Useful when there is no sampling frame
- If sub-units within a selected unit give similar
results, it is uneconomical to measure all the
second stage units - Lists are prepared for a small portion of the
total populations of second stage units so it is
considered economical - No need for sampling procedures at each stage to
be the same
56Multi-Stage Sampling -Disadvantages
- The sampling of compact clusters may present
practical difficulties
57Summary
58Purposive/Judgemental Sampling
- The sample is hand-picked
- The researcher exercises deliberate subjective
choice in drawing what he/she regards as a
representative sample - Often used for case study research
- It may also be used to eliminate anticipated
sources of distortion
59Quota Sampling
- Participants are selected from certain subgroups
in the population - In most cases, participants are chosen just
before the interview begins although the aim is
to be as random as possible - Usually used in market surveys opinion polls
- A proper statistical design is used to determine
what numbers are needed for each subgroup
60Snowball
- Members of the sample name other persons which
can (and usually is) included in the sample - Used mainly for populations which do not have a
proper or adequate sampling frame - Researcher identifies a few key participants who
then identify other relevant participants
61Convenience
- Participants are selected because they are
readily available - Considered to be the most unreliable method of
sampling
62Sampling Rare Populations I
- Problems arise if there is no relevant, accurate
sampling frame for a rare group - Special methods need to be used to estimate
- Prevalence / incidence of occurrence
- Characteristics of the population
- Population Means, Totals etc
- Examples
- Medical Conditions
- Social Conditions
63Sampling Rare Populations II
- There are 6 methods used to sample rare
populations - Screening
- Disproportionate Sampling
- Multiplicity Sampling
- Snowballing
- Multiple Frames
- Sequential Sampling
64Screening I
- This involves double or two phase sampling
- Procedure
- Survey general population
- Identify potential members of the group
- Detail survey of the potential members
65Screening II
- Problems
- High cost
- Should apparent non-members also be sampled
- High costs can be cut by
- Telephone interviews (which can be
unrepresentative) - Postal questionnaires (which can have low
response) - Sharing costs with other surveys
- Sampling more intensively in cluster with
relatively high concentration of the rare group
(eg sample cluster only if 1st selected element
is a member of the rare population)
66Disproportionate Sampling
- Sampling more intensively in cluster with
relatively high concentration of the rare group - eg sample cluster only if 1st selected element
is a member of the rare population - Gains are only high when stratum to be
oversampled does have a high prevalence relative
to other strata - Optimal allocation theory can be used to
determine sampling fractions in each stratum
67Multiplicity
- Sample all close neighbours and / or relatives of
selected sample members - May use proxy information to estimate prevalence
68Snowballing
- Creation of a sampling frame through other
relevant contacts as suggested by group members
69Multiple Frame
- This involves using many sampling frames
- Overlaps are dealt with by
- merging the files
- cleaning the data file
- use of weights related to the probability of
selection
70Sequential Sampling
- Continue sampling until a large enough sample of
the rare population is achieved
71Choice of Sample Size I
- An increase in sample size leads to an increase
in the precision of the sample mean as an
estimator of the population mean - An increase in sample size leads typically to an
increase in sampling costs
72Choice of Sample Size II
- The trade off between cost and precision is key
- Sample too large waste of resources
- Sample too small an estimator with inadequate
precision - Choose either precision required OR maximum cost
which can be expended and then choose sample size
73Choice of Sample Size III
- The main method presupposes that the population
variance is known - In practice, most times the population variance
is unknown - Usually the sample variance is used to replace
the population variance but there is no sample - Solution Can be chosen by using one of the
following methods
74Choice of Sample Size IV
- From Pilot Studies
- If the pilot study uses SRS then its results may
give some indication of the value of the
population variance - NB A pilot study is limited to a certain part of
the population so the estimate of the variance
will be biased - From Previous Surveys
- Usually the study of a population with similar
characteristics in a similar population has been
previously conducted - The measure of variability from earlier surveys
can be used to estimate the variance of the
population currently under study - NB Caution must be taken in using the
information
75Choice of Sample Size V
- From a Preliminary Sample
- Most reliable approach
- May not be feasible because of administration or
cost - A preliminary SRS is taken and used to estimate
the population variance - Procedure
- Preliminary sample of size n1 is chosen and used
to estimate the population variance by the sample
variance s12 - n1 is inadequate in producing the necessary
precision so another sample of (n-n1) is chosen
by using s12 as the preliminary estimate of the
population variance
76Choice of Sample Size VI
- From Practical Considerations of the Structure of
the Population - It might be able to determine what kind of
distribution an event may have - The variance is estimated using the formulas for
the specified distribution - eg if it is assumed that a specific event might
follow a Possion distribution then we can assume
that the mean and the variance are equal
77Choice of Sample Size VII- Large Populations
Table 1
Source Parker Rea, Designing and Conducting
Research
78Choice of Sample Size VIII- Small Populations
Table 2
Source Parker Rea, Designing and Conducting
Research
79Choice of Sample Size IX Interval Variables
- Large Populations
- Small Populations
- N population
- n sample
- C confidence interval (Z times std deviation)
- Z Z score for level of confidence
- s standard error for the distribution of sample
means
80Choice of Sample Size X
- If interested in BOTH proportions and intervals
then choose the higher sample size - Tables 1 and 2 should be adequate to cover sample
sizes for both