Title: Chapter 1: Data Collection
1Chapter 1 Data Collection
1.1 Introduction to the Practice of
Statistics 1.2 Observational Studies,
Experiments, and Simple Random Sampling 1.3 Other
Effective Sampling Methods 1.4 Sources of Errors
in Sampling 1.5 The Design of Experiments
1
September 3, 2008
2Definition of Statistics
- Given a question, statistics is the art and
science of designing studies, - collecting the data, summarizing the data, and
then analyzing the data - to draw conclusions. In particular, statistics
is - collecting data
- organizing this data
- summarizing the organized data
- analyzing the summarized data
- draw conclusions from this analysis
2
Section 1.1
3Data
Data is information that is collected about a
generic population (people, animals, machines,
etc.). In the social sciences it is usually about
people the characteristics (height, weight, age,
etc.) or attitudes (believes, political opinions,
religion, etc.).
3
4Types of Statistics
- Descriptive Statistics This type of statistics
uses graphs, tables, charts and the calculation
of various statistical measures (mean, standard
deviation, etc.) to organize and summarize
information about a population. This is
material in Math 127A. - Inferential Statistics This type of statistics
consists of techniques (hypothesis testing,
confidence intervals, etc.) to reach conclusions
about a population based upon information
obtained by a subset of the population. This is
the material in Math 127B.
4
5Average Yearly Temperature in Nashville
Question Is the climate of Nashville
warming? The average temperature of Nashville is
available National Weather Service website from
1872-2007. Average daily temperature is
calculated by summing the highest and lowest
hourly temperature and then dividing by 2. The
monthly average temperature is obtained by the
computing the average of the daily average
temperatures and yearly average temperature is
obtained by computing the average of the monthly
temperatures.
5
6Mathematica Notebook
6
7The Statistical Method (QDDI)
- Question What is the problem of interest?
Identify your research objective. - Design How will the data be collected? From
whom? About what? - Description Give the characteristics of the
data. This is were mathematics can play a major
role. Summarize the data. Give a graphical
description of the data. (Descriptive Statistics) - Inference What does the data tell us? If you
started with a hypothesis, does the data confirm
this hypothesis? (Inferential Statistics)
7
8Example
Harvard Medical School studied 22,000 male
physician to determine if taking aspirin could
prevent heart attacks. The physician were split
into two equal groups 11,000 would receive an
aspirin per day and the other 11,000 would
receive a placebo. The assignment of physicians
was done randomly. During the course of the
study, 0.9 of the male physicians in the study
who were taking aspirin had a heart attacked and
while 1.7 taking the placebo experienced a heart
attack. They then used the statistical method to
predict that if all male physicians could have
participated in the study, the percentage having
a heart attack would have been lower for those
taking aspirin.
8
9QDDI
- Question Does taking aspirin each day reduce the
incidence of heart attacks in male physicians? - Design Take sample with half taking aspirin and
half taking a placebo. This is called an
experiment. - Description Heart attack rate aspirin (0.9)
versus placebo (1.7). - Inference All male physicians would benefit from
taking daily aspirin.
9
10Terminology of Statistics
- Population A population is the complete
collection of all elements to be studied. - Sample Any subset or group of a population is
called a sample. - Variable A variable is characteristic of the
individuals in the population that will be
analyzed. - Parameter A parameter is numerical summary of a
variable for the population. - Statistic A statistic is numerical summary for a
variable obtained from a sample of the
population.
10
11Types of Data
- Quantitative data is composed of measurements
(numbers) about the population. - Categorical (or qualitative) data is data that
can be separated into categories and can be
identified by some non-numeric characteristic. - Continuous data is quantitative data that can
take any value. - Discrete data is quantitative data is not
continuous .
11
12Example
- Population All of the students in Math 127A that
are in WH 103 today. - Sample The students in Row 10 of the classroom.
- Variables
- Color of eyes
- Month of birth
- Home state
- Age
- Religion
12
13Example (continued)
- Data (Qualitative/Qualitative)
- Blue eyes
- October
- Georgia
- 18
- Lutheran
- Parameter
- The average age.
- The standard deviation of heights.
- Statistics
- The average age of students in Row 5.
- The fraction of students with blue eyes in Row 9.
13
14Data for Statistical Studies
- Census A census is list of all individuals in a
population along with certain characteristics of
each individual in the population (e.g., age,
race, home ownership, etc.). - Observational Study An observational study
attempts to measure a characteristic of the
population by examining a sample, but does not
manipulate the sample. An observational study
often uses a sample survey to collect data. - Experimental Study An experiment selects a
sample of the population and manipulates one or
more variables of the population. The variable
that is manipulated is called an independent
variable and variable that is effected is called
a dependent variable.
14
Section 1.2
15Census Website
http//www.census.gov
15
16Observational Study
- Observational Study An observational study
measures the characteristics of a population by
studying a sample of individuals. It attempts to
find connections between these characteristics
without manipulation of the sample. The study is
passive or ex post facto.
16
17Design of Observational Studies
17
18Example of Sample Survey
- Sample Survey A random sample of 10,000 people
were the individual are interviewed to determine
information about the following variables of the
population - age
- race
- gender
- number of children
- income bracket (0-25K, 25K-50K, .)
- wealth bracket
- homeowner
- Question Is there a relationship between
homeownership and number of children?
18
19Algorithm for Setting Up a Sample Survey
- Step 1 Identify the population from which the
sample is to be drawn. - Step 2 Compile a list of subjects in the
population from which the sample will be taken.
This is called the sampling frame. - Step 3 Specify a method for selecting subjects
from the sampling frame. This is called the
sampling design. - Step 4 Collect the data.
19
20Designed Experiments
- Experimental Study An experiment is a study in
which data is used and manipulated to determine
the effects of one or more variables (called
explanatory variables) on another variable
(called the response variable). That is, the
explanatory variable is controlled to see how the
response variable changes with changes in the
explanatory variable. The conditions placed on
the explanatory variable are called treatments.
In this type of study, the explanatory variable
is sometimes called a factor of the experiment.
20
21Design of Experiments
21
22Remark
Observational studies are useful for detecting
connections between two variables in a
population. Experimental studies are useful to
determine the nature of the connection.
22
23Types of Sampling
- Random (good)
- Non-random (bad)
Examples Suppose that our population is 200
students who are seated in a classroom of 10 rows
with 20 seats per row. If we chose a sample as
the subset of students who sit in the rows that
end with an even integer, then this would be a
non-random sample. Suppose that we place 10
balls each marked with a separate number (1-10)
in a bag. We would generate a random sample of
20 by choosing one of the balls out of the bag
and using the number on the ball as the row for
our sample.
23
Section 1.3
24Simple Random Sample
- Simple Random Sampling each individual in the
population has the same or equal chance of being
selected for a sample as any other individual. A
list of individuals in the population from which
a sample is to be drawn is called a frame.
24
25Two Sets of Random Numbers
Frequency Chart of Numbers
25
26Types of Samples
Simple Random Sample A sample that is obtained
by randomly choosing individuals in the
population. Stratified Sample A stratified
sample is sample that is obtained by separating
the population into non-overlapping groups (call
strata) and then randomly selecting individuals
from each stratum. Systematic Sample A
systematic sample is a sample that is obtained by
selecting individuals in the population is a
systematic way e.g., every 5th individual. Cluste
r Sample A cluster sample that is obtained by
selecting all individuals with a randomly
selected subset or group of the
population. Convenience Sample A convenience
sample is a type of sample that is drawn because
it is easy or convenient to collect. Convenience
samples are likely to under represent portions of
the population. They may not be random and may
contain bias due to time or location.
26
Section 1.3
27Three Main Sampling Methods
Random
Cluster
Stratified
27
28Advantages of Different Random Sampling Methods
- Simple Random Sampling Gives a good picture of
the whole population. - Cluster Random Sampling Often it easier and
cheaper to implement because subjects are close
together and well-defined once clusters are
chosen. - Stratified Random Sampling Guarantees that each
stratum (segment) is sampled.
28
29Sources of Errors in Sampling
- Fact Erroneous conclusions can be drawn from
observational or experimental studies due to
faulty statistical design and sampling. - Non-sampling Errors These errors occur when the
sampling process (design) are faulty. This
usually occurs when there is a problem with the
sampling frame or sampling design. In other
words, preference is given to selecting some
individuals over other individuals in the
population. - response errors
- non-response errors
- processing error
- analysis errors
- coverage errors
- Sampling or Estimation Errors This error
occurs when the sample gives an incomplete
picture of the population. This type of error is
due to the fact that we are using a sample
instead of the whole population.
29
Section 1.4
30Non-sampling Errors
- Response Errors Poor questionnaire design,
interview bias, respondent errors, poor survey
process. For example, the organization of the
survey could be confusing, individuals give
deceptive responses to questions, the data
collector may not speak the language of the
individual to be interviewed, etc. - Non-response Errors Complete or partial
non-response. For example, individuals may agree
to be interviewed, but then choose not to answer
some or all of the questions. - Processing Errors There are computational
errors in coding, capturing, editing and
presenting the final data. - Analysis Errors Incorrect statistical tests are
applied to the data resulting in erroneous
conclusions. - Coverage Errors There are errors in the
duplication or omission of individuals in the
sample.
31Non-sampling Bias
Example Suppose we are interested the approval
rating of Mayor Dean and we will conduct a random
telephone survey on whether citizens of Nashville
approve or disapprove of his job performance
since he took office. Is there bias in this
sample survey? Answer Maybe, since it will miss
citizens who do not have a telephone and this
group of people may have different opinions about
the mayor than those who do have a telephone.
31
32Design of Experiments
Review from Section 1.3 An experiment is a
study for the collection of data that is used to
determine the effects of one or more variables
(called explanatory variables) on another
variable (called the response variable). The
individuals from which the data is collected are
called subjects or experimental units. The
conditions placed on the explanatory variable are
called treatments. In this type of study, the
explanatory variable is sometimes called a
factor. An experiment is called double-blind if
the subjects and the experimenter do not know
which treatments are being administered to each
subject. We say that the experiment is
completely randomized if each experimental unit
is randomly assigned to a treatment. A randomized
experiment comparing medical treatments is called
a clinical trial.
32
Section 1.5
33Types of Experiments
- Completely Randomized Design Each experimental
unit is randomly assigned a treatment. - Randomized Matched-pairs Design Experimental
units are paired with each experiment unit in the
pair assigned a different treatment. The
matched-pair can be the same individual so that
the individual receives both treatments (e.g.,
before and after). - Randomized Block Design Experimental units are
grouped together in groups. Units in each group
(block) are randomly assigned treatments.
34Example
Object of Study Does aspirin reduce the heart
attack rate? Population Male physicians in the
U.S. Sample 20,071 male physicians between the
ages or 40 and 84. Study The sample was split in
two groups. One group took an aspirin per day
and the other group took a placebo. The doctors
were randomly assigned to these two groups. The
doctors were monitored over a 5 year
period. Explanatory Variable aspirin yes or no
(categorical) Response Variable heart attack
yes or no (categorical) Type of Experiment
Completely randomized design.
34
35Example (continued)
Yes No Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071
This is an experiment and the aspirin/placebo are
the treatments. We manipulated the explanatory
variable to see the effect on the response
variable.
35
36Example (continued)
Fraction of Heart Attacks for both Treatments
Yes No
Aspirin 0.0094 0.9906 1.0
Placebo 0.0171 0.9829 1.0
36
37Example (continued)
Conclusion from Study The heart attack rate per
1000 male physicians is 9.4 for those taking
aspirins and 17.1 for those not taking aspirin.
Hence, we would conclude that taking aspirin
reduces the heart attack rate.
37
38Matched-pairs Designs
A matched-pair design experiment is a study
where there are only two treatments and
experimental units are matched. One experimental
unit receives one treatment and the other
experimental unit receives the second treatment.
The pairs may be the same individual (before
treatment and after treatment) or it may be two
individuals who have similar characteristics
(e.g., gender, age, etc.). The assignment of the
treatments to each pair should be random.
38
39Example of Matched-Pairs
Purpose Study the effect of taking caffeine one
half hour before swimming. Sample 50 randomly
chosen swimmers. Explanatory Variable A
caffeine pill or a placebo. Response Variable
Time to swim one mile. Study Design
Experiment Matched-pair Design The 50 swimmers
are selected. Each swimmer is randomly given the
caffeine pill or the placebo and swims one mile
with the time recorded. After 1 week, the same
50 swimmers return and are given the treatment
that they did not receive the previous week.
They swim the mile and the time is recorded.
Each swimmers times is compared against both
treatments.
39
40Blocks and Block Designs
- A collection of experimental units that have the
same (or similar values) on a key variable is
called a block. In the previous example, each
subject (person) is a block. - Experimental units are divided into groups
(blocks) and each treatment is randomly assign to
one or more of the units in each block. In other
words, a block design identifies blocks before
the start of the experiment and assigns subjects
to treatments within those blocks. - To reduce bias, order of treatments within each
block is randomized and we call this a randomized
block design. - A matched-pair design is a special type of block
design. Here each paired experimental units form
a block. - In a block design study, an experimental unit
(subject) may receive only one treatment.
40
41Example of Block Design
Purpose Study the effect of taking caffeine one
half hour before swimming. Sample 50 swimmers,
but 16 males who swim competitively, 14 males who
do not swim competitively, 8 females who swim
competitively and 12 females who do not swim
competitively. Explanatory Variable A caffeine
pill or a placebo. Response Variable Time to
swim one mile. Study Design Experiment Randomized
Block Design We create four blocks (16, 14, 8,
12 subjects). Within each block, individuals
take either the caffeine pill or the placebo.
Each subjects swim time is recorded. The times
of each swimmer within each block as well as
across the blocks are compared (caffeine pill
versus placebo).
41
42What type of experiment?
A drug company wanted to test a new arthritis
medication. The researchers found 200 adults
aged 25-35 and randomly assigned them to two
groups. The first group received the new drug,
while the second received a placebo. After one
month of treatment, the percentage of each group
whose arthritis symptoms decreased was recorded
and compared with their original condition. What
type of experimental design is this?
43What type of experiment?
A medical journal published the results of an
experiment on insomnia. The experiment
investigated the effects of a controversial new
therapy for insomnia. Researchers measured the
insomnia levels of 86 adult women who suffer
moderate conditions of the disorder. After the
therapy, the researchers again measured the
women's insomnia levels. The differences between
the the pre- and post-therapy insomnia levels
were reported. What type of experimental design
is this?
44What type of experiment?
A farmer wishes to test the effects of a new
fertilizer on her tomato yield. She has four
equal-sized plots of land--one with sandy soil,
one with rocky soil, one with clay-rich soil, and
one with average soil. She divides each of the
four plots into three equal-sized portions and
randomly labels them A, B, and C. The four A
portions of land are treated with her old
fertilizer. The four B portions are treated with
the new fertilizer, and the four C's are treated
with no fertilizer. At harvest time, the tomato
yield is recorded for each section of land. What
type of experimental design is this?
45What type of experiment?
A random sample of 1,000 overweight male adults
is recruited. Each male is weighed and his
weight is recorded. Each individual is given a
diet and are told to follow it for one month.
After one month, each individual is weighed and
recorded. The before and after are compared.
What type of experimental design is this?
46What type of experiment?
A random sample of 30 Vanderbilt students is
selected. We are interested in the reaction times
when using or not using a cell phone during
driving. Each students reaction time was
measured when he or she was using or not using a
cell phone on a driving course in a Vanderbilt
parking lot. What type of experimental design is
this?