Title: STA616611
1Introduction and Data Gathering (Chapters 1 2)
- At the end of this lecture, the student should
- Be able to provide a definition of Statistics.
- Discuss the role of statistics in research.
- Be able to state reasons for using statistics.
- Identify the difference between observational and
experimental studies. - Be able to organize data into a two-dimensional
matrix or array.
I hear and I forget I see and I understand
I do and I remember Chinese
Proverb
2A Motivating Example The HIP Trial
- Breast cancer common malignancy among women in
rich countries. - Mammography (screening) is today known to lead
to fewer deaths. - HIP Trial (1960s). First study to conclusively
show merits of screening. - 62,000 women age 40-64 (members of Health
Insurance Plan, NY). - Randomized into treatment and control groups
31,000 in each. - Treatment an invitation to 4 rounds of annual
screening. - Control received usual health care prevalent at
that time.
If we compare screened (1.1) vs. refused
(1.5), theres hardly a difference? (More later!)
3 What Is Statistics?
- Descriptive Statistics. Summary measures, such as
totals, averages or percentages of measurements,
counts, or ranks. Graphics used to present,
organize, and summarize data, e.g. pie-charts,
histograms, boxplots, scatterplots, etc. - Inferential Statistics. The analysis and
interpretation of data. Concerned with the
extraction of information from data and its use
in reaching conclusions (inferences) about a
population from which the data are obtained. E.g.
confidence intervals, hypothesis tests. - We will concentrate on (2), although the
distinction will not always be clear.
4Basic Definitions
- Experimental unit. The basic object on which
measurements are taken. (May be composed of
measurement units.) - Factors. Variables in an experiment that are set
by the investigator. (Controllable.) - Response. Variable that is observed in an
experiment. (Not Controllable.) - Treatments. Conditions constructed from the
factors in order to observe the impact on the
response. - Control Treatment. Benchmark with respect to
which the remaining treatments are compared. - Population. The set of all measurements of
interest. - Sample. A subset of measurements taken from the
population actually measured. - Statistic. A number calculated from the sample,
e.g. the sample average, the sample variance. - Parameter. A number calculated from the entire
population, e.g. the population average, the
population variance.
5Population vs. Sample
Using the sample average to make statements about
the population average is an example of
inferential statistics.
Descriptive statistical methods describe the
sample. Inferential statistical methods make
statements about the population based on the
sample.
6First Principle of Statistical Inference
- You make inference about the population from
which the sample was obtained. (Seems obvious,
but is often forgotten.) - In each of the examples below, identify the
population being sampled and the inference being
made - Study cow grazing behavior. One cow (Daisy) in
pasture (A). Randomly select time intervals for
observation during month of May. - Study capital punishment and homicide rates.
Randomly select 100 US cities. Objective is to
make causal statements about a process. - In a pilot study, 20 runs of a manufacturing
process are carried out in the lab. Objective
find out how the process will work in large scale
production. - Study yield of 3 varieties of winter wheat.
Randomly sample 30 farms in Kansas, 10 farms grow
variety A, 10 variety B, and 10 variety C.
Measure the yield per acre over one growing
season.
7Scientific Method
- The pursuit of systematic interrelation of facts
by logical arguments from accepted postulates,
observation, and experimentation and a
combination of these three in varying proportions.
Roles of Statistics
- Aid in creating the best' research design with
which to generate new data. - Extract the information from the noise or
variability at the data analysis step.
8Logical Arguments
- Deductive argument Conclusion follows with
logical necessity or certainty from the premises.
Nothing new is revealed because we are arguing
from the general to the specific. - Specialization Moving from a large set of
objects, postulates, or events, to consideration
of a smaller set of objects or events. - Inductive argument Discovering general laws by
the observation and combination of particular
instances. Passing from the specific to the
general. - Generalization Passing from the consideration of
one object, postulate, or occurrence, to the
consideration of a set of objects, postulates, or
occurrences.
In statistics we attempt to formalize and use
these concepts in a quantitative way.
9Scientific Progress
We gain knowledge by iterating between models and
data.
10Basic Study Steps
- State the problem. What are the questions?
- Devise a plan of solution. What will I do?
- Implement the plan. This is how I do it?
- Analysis of data. What happened?
- Interpretation of results. What does this mean?
- Reexamination. Is my logic correct? What next?
Study design and study implementation may require
iteration.
11Graphical Depiction of Scientific Study
Problem
12Research Design Categories
- Census (Complete Enumeration) Every individual
in the population of interest is observed. In a
census, the sample equals the population. - Observational Studies (Mensurative Experiments)
Populations to be compared are defined, and
individuals are randomly selected from these
populations for measurement. This involves mere
data collection no interference with the
processes generating the data. - Experimental Studies (Manipulative Experiments)
Individuals in one or more populations are
carefully chosen or created to test specific
manipulations under highly controlled conditions.
Explanatory variables are manipulated their
effect on the response variable(s) is then
observed.
13Observational Study Design
- Observational studies are of 3 varieties
- Sample survey studies a population at a
particular point in time. - Prospective study observes a population in the
present using a sample survey, and proceeds to
follow subjects into the future. - Retrospective study observes a population in the
present using a sample survey, and collects data
about the subjects on events in the past. - The possible presence of confounding variables
poses a severe limitation in observational
studies. - Confounder. A (non-measured) variable, other than
the explanatory variable, that affects the
response variable. Confounders may affect both
response and explanatory variables, and are
outside the control of the researcher.
14Observational Study Design
- Example Study lung cancer rates among smokers
and non-smokers. - What are populations of interest?
- How will individuals be selected for measurement?
- What will be measured?
- Which analyses will be performed?
- How many individuals are needed?
- How large an effect will be considered important?
- Are available resources adequate for this study?
Many of these questions are answered by subject
matter experts, some can be answered by a
statistical analysis.
15Observational Study( Mensuration Experiment)
What is measured?
16How are individuals selected?
- Individually identified (the sample unit).
- Randomly chosen (no biases introduced in
selection). - Each possible set of individuals has the same
probability of selection (Simple Random Sampling).
Special situations allow for increased efficacy
of selection.
- Stratification (account for an extraneous
factor) - Clusters (select natural groups of sample units)
- Multi-stage (select large units then parts of
units) - Systematic (set pattern)
17Simple Random Sampling
A researcher wishes to determine the prevalence
of a disease in a greenhouse of tomato seedlings.
Each seedling tested for the disease is destroyed
in the process, hence only a minimal number
should be tested. Expectations are that only
about .01 of the roughly 50,000 seedlings in the
greenhouse have the disease.
How to select a simple random sample?
- Number each pot. Use a random number table (or
spreadsheet random number generator) to produce a
list of numbers, in random order from 1 to the
total number of pots. Measure plants in pots
whose numbers are selected (difficult). - Align pots in rows and columns. Use random number
table to select a list of row and column number
pairs. Measure plant in pots located in the (row,
column) pair selected (easier).
Table 13 in Ott and Longnecker.
18Simple Random Sample
Textbook definition.
A simple random sample of n units is defined such
that each possible sample of size n is equally
likely to be drawn.
Practical definition.
This sampling principle assures that each unit in
the population has the same probability
(likelihood) of being selected in the sample.
19Stratification
Allows us to take into account a factor we
already know affects the response of interest. To
remove a source of known variability.
16 years healthy
20 years diseased
22 years healthy
Pine forest Estimate expected yield from plot.
Individuals selected at random within each
strata. Variability in diseased subpopulation
expected to be much greater than in healthy area.
Mean yield greater at 22y than 16y.
20Clusters
Estimate the average sponge size on natural reefs.
9
REEF
25
12
Number of sponges on reef
21
14
7
5
Selecting sponges at random would be very
resource inefficient. Cheaper to select reefs
(sponge clusters) at random with probability
proportional to size. All sponges on selected
reefs are measured (a cheap thing to do that
increases the sample size easily).
21MultiStage Sampling
Typically large areas or large complex
populations can be more effectively sampled in
stages. At the first stage, natural or synthetic
clusters are selected. At subsequent stages the
selected clusters are subdivided into units and
samples of these are selected.
Example National crop yield survey.
22Greenhouse Example
Stratification Maybe we have observed that
plants near the door seem less healthy than those
further into greenhouse. Divide room into plants
near door and plants inside. Random samples
from each stratum. Cluster Suppose plants are
arranged on tables. We could select tables at
random then examine all plants on each table
selected. Note that if one plant on a table is
diseased, all plants on table have an increased
probability of also being diseased. Multi-Stage
Again suppose plants are on tables. Select some
tables at random. Next select a few plants from
each selected table for testing. First stage unit
is the table. Second stage unit is the plant.
Third stage unit could be the leaf on the plant,
etc. Systematic Imagine plants arranged on a
large table. Randomly pick a row and column to
start. Then, following a systematic route, pick,
say, every 10th plant.
23What is measured?
Variable Apt or liable to vary or change from
individual to individual, capable of being varied
or changed (factor), alterable, inconsistent,
having much variation or diversity, a quantity
that may assume any given value from a set of
values (the variables range).
24Types of Variables Categorical
- Categorical, classification, or qualitative
variable - Discrete essentially describes some
characteristic of a sample unit. E.g. color,
gender, grade, health status, treatment group.
Further subdivided into - nominal (think name) arithmetic doesnt make
sense, e.g. gender M,F even if coded 0,1 - ordinal (think order) nominal data with order,
e.g. grades A,B,C,D,F, strength of agreement
1strongly agree, 2agree, 3neutral,
4disagree, 5strongly disagree. - In ordinal data the order is meaningful, but the
difference between responses isnt. Also,
arithmetic is sometimes done, but its meaning is
debatable.
25Types of Variables Quantitative
- Quantitative or amount variable
- Can be either discrete or continuous measures
the amount or level of a characteristic of a
sample unit. For example age, weight, height,
temperature, biomass, volume. Further subdivided
into - interval - differences between values have
meaning but there is no definite or meaningful
zero point, e.g. GPA, SAT scores, temperature - ratio like interval but with a meaningful zero
point, e.g. weight, money, yield.
In this course we will deal primarily with
quantitative variables (ratio).
26Study Design Questions
- How is the response (effect) to be measured?
- What characteristics of the response are to be
analyzed? - What factors influence the characteristics to be
analyzed? - Which of these factors will be studied in this
investigation? - How many times should the basic experiment be
performed? - What should be the form of the analysis?
- How large an effect (effect size) will be
considered important? - What resources are available for this study? Are
they adequate?
It is important to be able to define the
underlined words.
27Terminology
- The response typically refers to the measured
variable(s) of primary interest (e.g. weight,
health status, growth, etc). - Characteristics Is it change in the average
response, the spread of responses, the maximum
response, etc, that will be examined? These
characteristics typically refer to some
statistical aspect of effects measured among
individuals in the populations being studied. - A factor refers to the characteristic(s) that
primarily differ among the populations being
studied (compared). Some factors we cannot
manipulate (I.e. such as descriptors like gender,
geographic location, genetic makeup). Other
factors identify characteristics we have caused
to be different between the two populations (as
in an experiment where we manipulate the
populations by giving them different
treatments). - Basic Experiment The selecting of an individual
for measurement. In an observational study, the
basic experiment is the selection and measurement
of an individual from the population. In an
Experimental Study, the basic experiment is the
selection of an individual from the pool, the
application of a treatment, and the measurement
of responses.
28Terminology (Cont)
- By the form of the analysis, we refer to the
statistical procedure(s) that match the
characteristics of the study design, the
characteristics of the responses measured and the
estimates and hypothesis tests needed to answer
the questions of interest. So, when someone asks
What form will your analysis take? you might
answer with something like I will be using
regression analysis (the statistical method) to
explore associations between fat intake and
cholesterol level (the hypotheses of interest)
between two populations identified geographically
and by gender (study design factors). - The size of the effect of interest refers to how
big of a difference must there be before I (or
others) would conclude that there is a real
difference. Typically we are interested in
specifying this at the design phase of a study
since the size of the effect of interest drives
the sample size question. Thus if you say a
difference of less than 2 points in cholesterol
level between gender groups would not be
significant but anything greater than 2 is
significant, you could use this to set the study
sample size. If the difference were raised to 10
points, a much smaller sample size would be
needed. - Resources Money, personnel, time, access,
material.
29Experimental Study
- Manipulation Experiment A research design in
which the researcher deliberately introduces
certain changes in the levels of factors that are
hypothesized as affecting the process of
interest, and then makes observations to
determine the effect of these changes. - Experimental Design A study plan which assures
that measurements will be relevant to the problem
under study. - Treatments Changes to those factors which are
suspected of affecting the process under study.
30Ex Factorial Experiment
31Standard Form for a Data Set
Observation Number
CATEGORIES
AMOUNTS
1 1 F RED x x ... 10.2 x x ... 2 1 F
WHITE x x ... 12.9 x x ... 3 1 M BLUE
x x ... 20.1 x x ... . . . . . . n 1
F BLUE x x ... 16.0 x x ...
strata
Other quantitative variable
Other categorical variable
gender
weight
color
32Example Data Set in Spreadsheet Format
Indicator of missing data
33Inventor's Paradox
The more ambitious the plan, the more chances of
success, and the more opportunity for failure.
How does one decide on what to do?
Are there open questions ? Are there available
resources? Does someone really want the
answer? Can a study be done? Will the study be
able to answer the question?
Statistics may help answer the last question!
34The HIP Trial Revisited
- Seems natural to compare screened (cancer
rate1.1) vs. refused (cancer rate1.5), in the
treatment group hardly a difference! - But realize that this is an observational
comparison (in an experimental study), and hence
is prone to confounding. - Social status is a confounder. Richer and better
educated women were more likely to accept the
screening, and breast cancer hits the richer
harder than the poorer. (Pregnancy, esp. early
pregnancy, is now known to protect against breast
cancer.) - So the analysis by treatment received is biased.
But the analysis by intention-to-treat is
appropriate.
- Intention to screen cancer rate (1.3).
- Control cancer rate (2.0).
- A sizeable difference.
- Five-year cancer rate ratio (treat/control) is
39/6362.