Chapter 1: Data Collection

About This Presentation

Title:

Chapter 1: Data Collection

Description:

Given a question, statistics is the art and science of designing studies, ... Statistic: A statistic is numerical summary for a variable obtained from a ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 47

Provided by: philip52

Learn more at: https://math.vanderbilt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 1: Data Collection

1
Chapter 1 Data Collection
1.1 Introduction to the Practice of
Statistics 1.2 Observational Studies,
Experiments, and Simple Random Sampling 1.3 Other
Effective Sampling Methods 1.4 Sources of Errors
in Sampling 1.5 The Design of Experiments
1
September 3, 2008
2
Definition of Statistics

Given a question, statistics is the art and
science of designing studies,
collecting the data, summarizing the data, and
then analyzing the data
to draw conclusions. In particular, statistics
is
collecting data
organizing this data
summarizing the organized data
analyzing the summarized data
draw conclusions from this analysis

2
Section 1.1
3
Data
Data is information that is collected about a
generic population (people, animals, machines,
etc.). In the social sciences it is usually about
people the characteristics (height, weight, age,
etc.) or attitudes (believes, political opinions,
religion, etc.).
3
4
Types of Statistics

Descriptive Statistics This type of statistics
uses graphs, tables, charts and the calculation
of various statistical measures (mean, standard
deviation, etc.) to organize and summarize
information about a population. This is
material in Math 127A.
Inferential Statistics This type of statistics
consists of techniques (hypothesis testing,
confidence intervals, etc.) to reach conclusions
about a population based upon information
obtained by a subset of the population. This is
the material in Math 127B.

4
5
Average Yearly Temperature in Nashville
Question Is the climate of Nashville
warming? The average temperature of Nashville is
available National Weather Service website from
1872-2007. Average daily temperature is
calculated by summing the highest and lowest
hourly temperature and then dividing by 2. The
monthly average temperature is obtained by the
computing the average of the daily average
temperatures and yearly average temperature is
obtained by computing the average of the monthly
temperatures.
5
6
Mathematica Notebook
6
7
The Statistical Method (QDDI)

Question What is the problem of interest?
Identify your research objective.
Design How will the data be collected? From
whom? About what?
Description Give the characteristics of the
data. This is were mathematics can play a major
role. Summarize the data. Give a graphical
description of the data. (Descriptive Statistics)
Inference What does the data tell us? If you
started with a hypothesis, does the data confirm
this hypothesis? (Inferential Statistics)

7
8
Example
Harvard Medical School studied 22,000 male
physician to determine if taking aspirin could
prevent heart attacks. The physician were split
into two equal groups 11,000 would receive an
aspirin per day and the other 11,000 would
receive a placebo. The assignment of physicians
was done randomly. During the course of the
study, 0.9 of the male physicians in the study
who were taking aspirin had a heart attacked and
while 1.7 taking the placebo experienced a heart
attack. They then used the statistical method to
predict that if all male physicians could have
participated in the study, the percentage having
a heart attack would have been lower for those
taking aspirin.
8
9
QDDI

Question Does taking aspirin each day reduce the
incidence of heart attacks in male physicians?
Design Take sample with half taking aspirin and
half taking a placebo. This is called an
experiment.
Description Heart attack rate aspirin (0.9)
versus placebo (1.7).
Inference All male physicians would benefit from
taking daily aspirin.

9
10
Terminology of Statistics

Population A population is the complete
collection of all elements to be studied.
Sample Any subset or group of a population is
called a sample.
Variable A variable is characteristic of the
individuals in the population that will be
analyzed.
Parameter A parameter is numerical summary of a
variable for the population.
Statistic A statistic is numerical summary for a
variable obtained from a sample of the
population.

10
11
Types of Data

Quantitative data is composed of measurements
(numbers) about the population.
Categorical (or qualitative) data is data that
can be separated into categories and can be
identified by some non-numeric characteristic.
Continuous data is quantitative data that can
take any value.
Discrete data is quantitative data is not
continuous .

11
12
Example

Population All of the students in Math 127A that
are in WH 103 today.
Sample The students in Row 10 of the classroom.
Variables
Color of eyes
Month of birth
Home state
Age
Religion

12
13
Example (continued)

Data (Qualitative/Qualitative)
Blue eyes
October
Georgia
18
Lutheran
Parameter
The average age.
The standard deviation of heights.
Statistics
The average age of students in Row 5.
The fraction of students with blue eyes in Row 9.

13
14
Data for Statistical Studies

Census A census is list of all individuals in a
population along with certain characteristics of
each individual in the population (e.g., age,
race, home ownership, etc.).
Observational Study An observational study
attempts to measure a characteristic of the
population by examining a sample, but does not
manipulate the sample. An observational study
often uses a sample survey to collect data.
Experimental Study An experiment selects a
sample of the population and manipulates one or
more variables of the population. The variable
that is manipulated is called an independent
variable and variable that is effected is called
a dependent variable.

14
Section 1.2
15
Census Website
http//www.census.gov
15
16
Observational Study

Observational Study An observational study
measures the characteristics of a population by
studying a sample of individuals. It attempts to
find connections between these characteristics
without manipulation of the sample. The study is
passive or ex post facto.

16
17
Design of Observational Studies
17
18
Example of Sample Survey

Sample Survey A random sample of 10,000 people
were the individual are interviewed to determine
information about the following variables of the
population
age
race
gender
number of children
income bracket (0-25K, 25K-50K, .)
wealth bracket
homeowner
Question Is there a relationship between
homeownership and number of children?

18
19
Algorithm for Setting Up a Sample Survey

Step 1 Identify the population from which the
sample is to be drawn.
Step 2 Compile a list of subjects in the
population from which the sample will be taken.
This is called the sampling frame.
Step 3 Specify a method for selecting subjects
from the sampling frame. This is called the
sampling design.
Step 4 Collect the data.

19
20
Designed Experiments

Experimental Study An experiment is a study in
which data is used and manipulated to determine
the effects of one or more variables (called
explanatory variables) on another variable
(called the response variable). That is, the
explanatory variable is controlled to see how the
response variable changes with changes in the
explanatory variable. The conditions placed on
the explanatory variable are called treatments.
In this type of study, the explanatory variable
is sometimes called a factor of the experiment.

20
21
Design of Experiments
21
22
Remark
Observational studies are useful for detecting
connections between two variables in a
population. Experimental studies are useful to
determine the nature of the connection.
22
23
Types of Sampling

Random (good)
Non-random (bad)

Examples Suppose that our population is 200
students who are seated in a classroom of 10 rows
with 20 seats per row. If we chose a sample as
the subset of students who sit in the rows that
end with an even integer, then this would be a
non-random sample. Suppose that we place 10
balls each marked with a separate number (1-10)
in a bag. We would generate a random sample of
20 by choosing one of the balls out of the bag
and using the number on the ball as the row for
our sample.
23
Section 1.3
24
Simple Random Sample

Simple Random Sampling each individual in the
population has the same or equal chance of being
selected for a sample as any other individual. A
list of individuals in the population from which
a sample is to be drawn is called a frame.

24
25
Two Sets of Random Numbers
Frequency Chart of Numbers
25
26
Types of Samples
Simple Random Sample A sample that is obtained
by randomly choosing individuals in the
population. Stratified Sample A stratified
sample is sample that is obtained by separating
the population into non-overlapping groups (call
strata) and then randomly selecting individuals
from each stratum. Systematic Sample A
systematic sample is a sample that is obtained by
selecting individuals in the population is a
systematic way e.g., every 5th individual. Cluste
r Sample A cluster sample that is obtained by
selecting all individuals with a randomly
selected subset or group of the
population. Convenience Sample A convenience
sample is a type of sample that is drawn because
it is easy or convenient to collect. Convenience
samples are likely to under represent portions of
the population. They may not be random and may
contain bias due to time or location.
26
Section 1.3
27
Three Main Sampling Methods
Random
Cluster
Stratified
27
28
Advantages of Different Random Sampling Methods

Simple Random Sampling Gives a good picture of
the whole population.
Cluster Random Sampling Often it easier and
cheaper to implement because subjects are close
together and well-defined once clusters are
chosen.
Stratified Random Sampling Guarantees that each
stratum (segment) is sampled.

28
29
Sources of Errors in Sampling

Fact Erroneous conclusions can be drawn from
observational or experimental studies due to
faulty statistical design and sampling.
Non-sampling Errors These errors occur when the
sampling process (design) are faulty. This
usually occurs when there is a problem with the
sampling frame or sampling design. In other
words, preference is given to selecting some
individuals over other individuals in the
population.
response errors
non-response errors
processing error
analysis errors
coverage errors
Sampling or Estimation Errors This error
occurs when the sample gives an incomplete
picture of the population. This type of error is
due to the fact that we are using a sample
instead of the whole population.

29
Section 1.4
30
Non-sampling Errors

Response Errors Poor questionnaire design,
interview bias, respondent errors, poor survey
process. For example, the organization of the
survey could be confusing, individuals give
deceptive responses to questions, the data
collector may not speak the language of the
individual to be interviewed, etc.
Non-response Errors Complete or partial
non-response. For example, individuals may agree
to be interviewed, but then choose not to answer
some or all of the questions.
Processing Errors There are computational
errors in coding, capturing, editing and
presenting the final data.
Analysis Errors Incorrect statistical tests are
applied to the data resulting in erroneous
conclusions.
Coverage Errors There are errors in the
duplication or omission of individuals in the
sample.

31
Non-sampling Bias
Example Suppose we are interested the approval
rating of Mayor Dean and we will conduct a random
telephone survey on whether citizens of Nashville
approve or disapprove of his job performance
since he took office. Is there bias in this
sample survey? Answer Maybe, since it will miss
citizens who do not have a telephone and this
group of people may have different opinions about
the mayor than those who do have a telephone.
31
32
Design of Experiments
Review from Section 1.3 An experiment is a
study for the collection of data that is used to
determine the effects of one or more variables
(called explanatory variables) on another
variable (called the response variable). The
individuals from which the data is collected are
called subjects or experimental units. The
conditions placed on the explanatory variable are
called treatments. In this type of study, the
explanatory variable is sometimes called a
factor. An experiment is called double-blind if
the subjects and the experimenter do not know
which treatments are being administered to each
subject. We say that the experiment is
completely randomized if each experimental unit
is randomly assigned to a treatment. A randomized
experiment comparing medical treatments is called
a clinical trial.
32
Section 1.5
33
Types of Experiments

Completely Randomized Design Each experimental
unit is randomly assigned a treatment.
Randomized Matched-pairs Design Experimental
units are paired with each experiment unit in the
pair assigned a different treatment. The
matched-pair can be the same individual so that
the individual receives both treatments (e.g.,
before and after).
Randomized Block Design Experimental units are
grouped together in groups. Units in each group
(block) are randomly assigned treatments.

34
Example
Object of Study Does aspirin reduce the heart
attack rate? Population Male physicians in the
U.S. Sample 20,071 male physicians between the
ages or 40 and 84. Study The sample was split in
two groups. One group took an aspirin per day
and the other group took a placebo. The doctors
were randomly assigned to these two groups. The
doctors were monitored over a 5 year
period. Explanatory Variable aspirin yes or no
(categorical) Response Variable heart attack
yes or no (categorical) Type of Experiment
Completely randomized design.
34
35
Example (continued)
Yes No Total
Aspirin 104 10,933 11,037
Placebo 189 10,845 11,034
Total 293 21,778 22,071
This is an experiment and the aspirin/placebo are
the treatments. We manipulated the explanatory
variable to see the effect on the response
variable.
35
36
Example (continued)
Fraction of Heart Attacks for both Treatments
Yes No
Aspirin 0.0094 0.9906 1.0
Placebo 0.0171 0.9829 1.0
36
37
Example (continued)
Conclusion from Study The heart attack rate per
1000 male physicians is 9.4 for those taking
aspirins and 17.1 for those not taking aspirin.
Hence, we would conclude that taking aspirin
reduces the heart attack rate.
37
38
Matched-pairs Designs
A matched-pair design experiment is a study
where there are only two treatments and
experimental units are matched. One experimental
unit receives one treatment and the other
experimental unit receives the second treatment.
The pairs may be the same individual (before
treatment and after treatment) or it may be two
individuals who have similar characteristics
(e.g., gender, age, etc.). The assignment of the
treatments to each pair should be random.
38
39
Example of Matched-Pairs
Purpose Study the effect of taking caffeine one
half hour before swimming. Sample 50 randomly
chosen swimmers. Explanatory Variable A
caffeine pill or a placebo. Response Variable
Time to swim one mile. Study Design
Experiment Matched-pair Design The 50 swimmers
are selected. Each swimmer is randomly given the
caffeine pill or the placebo and swims one mile
with the time recorded. After 1 week, the same
50 swimmers return and are given the treatment
that they did not receive the previous week.
They swim the mile and the time is recorded.
Each swimmers times is compared against both
treatments.
39
40
Blocks and Block Designs

A collection of experimental units that have the
same (or similar values) on a key variable is
called a block. In the previous example, each
subject (person) is a block.
Experimental units are divided into groups
(blocks) and each treatment is randomly assign to
one or more of the units in each block. In other
words, a block design identifies blocks before
the start of the experiment and assigns subjects
to treatments within those blocks.
To reduce bias, order of treatments within each
block is randomized and we call this a randomized
block design.
A matched-pair design is a special type of block
design. Here each paired experimental units form
a block.
In a block design study, an experimental unit
(subject) may receive only one treatment.

40
41
Example of Block Design
Purpose Study the effect of taking caffeine one
half hour before swimming. Sample 50 swimmers,
but 16 males who swim competitively, 14 males who
do not swim competitively, 8 females who swim
competitively and 12 females who do not swim
competitively. Explanatory Variable A caffeine
pill or a placebo. Response Variable Time to
swim one mile. Study Design Experiment Randomized
Block Design We create four blocks (16, 14, 8,
12 subjects). Within each block, individuals
take either the caffeine pill or the placebo.
Each subjects swim time is recorded. The times
of each swimmer within each block as well as
across the blocks are compared (caffeine pill
versus placebo).
41
42
What type of experiment?
A drug company wanted to test a new arthritis
medication. The researchers found 200 adults
aged 25-35 and randomly assigned them to two
groups. The first group received the new drug,
while the second received a placebo. After one
month of treatment, the percentage of each group
whose arthritis symptoms decreased was recorded
and compared with their original condition. What
type of experimental design is this?
43
What type of experiment?
A medical journal published the results of an
experiment on insomnia. The experiment
investigated the effects of a controversial new
therapy for insomnia. Researchers measured the
insomnia levels of 86 adult women who suffer
moderate conditions of the disorder. After the
therapy, the researchers again measured the
women's insomnia levels. The differences between
the the pre- and post-therapy insomnia levels
were reported. What type of experimental design
is this?
44
What type of experiment?
A farmer wishes to test the effects of a new
fertilizer on her tomato yield. She has four
equal-sized plots of land--one with sandy soil,
one with rocky soil, one with clay-rich soil, and
one with average soil. She divides each of the
four plots into three equal-sized portions and
randomly labels them A, B, and C. The four A
portions of land are treated with her old
fertilizer. The four B portions are treated with
the new fertilizer, and the four C's are treated
with no fertilizer. At harvest time, the tomato
yield is recorded for each section of land. What
type of experimental design is this?
45
What type of experiment?
A random sample of 1,000 overweight male adults
is recruited. Each male is weighed and his
weight is recorded. Each individual is given a
diet and are told to follow it for one month.
After one month, each individual is weighed and
recorded. The before and after are compared.
What type of experimental design is this?
46
What type of experiment?
A random sample of 30 Vanderbilt students is
selected. We are interested in the reaction times
when using or not using a cell phone during
driving. Each students reaction time was
measured when he or she was using or not using a
cell phone on a driving course in a Vanderbilt
parking lot. What type of experimental design is
this?

Write a Comment

User Comments (0)