STA616611 - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

STA616611

Description:

Cheaper to select reefs (sponge clusters) at random with probability proportional to size. All sponges on selected reefs are measured (a cheap thing to do that ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 35
Provided by: kennethmp
Category:
Tags: sta616611

less

Transcript and Presenter's Notes

Title: STA616611


1
Introduction and Data Gathering (Chapters 1 2)
  • At the end of this lecture, the student should
  • Be able to provide a definition of Statistics.
  • Discuss the role of statistics in research.
  • Be able to state reasons for using statistics.
  • Identify the difference between observational and
    experimental studies.
  • Be able to organize data into a two-dimensional
    matrix or array.

I hear and I forget I see and I understand
I do and I remember Chinese
Proverb
2
A Motivating Example The HIP Trial
  • Breast cancer common malignancy among women in
    rich countries.
  • Mammography (screening) is today known to lead
    to fewer deaths.
  • HIP Trial (1960s). First study to conclusively
    show merits of screening.
  • 62,000 women age 40-64 (members of Health
    Insurance Plan, NY).
  • Randomized into treatment and control groups
    31,000 in each.
  • Treatment an invitation to 4 rounds of annual
    screening.
  • Control received usual health care prevalent at
    that time.

If we compare screened (1.1) vs. refused
(1.5), theres hardly a difference? (More later!)
3
What Is Statistics?
  • Descriptive Statistics. Summary measures, such as
    totals, averages or percentages of measurements,
    counts, or ranks. Graphics used to present,
    organize, and summarize data, e.g. pie-charts,
    histograms, boxplots, scatterplots, etc.
  • Inferential Statistics. The analysis and
    interpretation of data. Concerned with the
    extraction of information from data and its use
    in reaching conclusions (inferences) about a
    population from which the data are obtained. E.g.
    confidence intervals, hypothesis tests.
  • We will concentrate on (2), although the
    distinction will not always be clear.

4
Basic Definitions
  • Experimental unit. The basic object on which
    measurements are taken. (May be composed of
    measurement units.)
  • Factors. Variables in an experiment that are set
    by the investigator. (Controllable.)
  • Response. Variable that is observed in an
    experiment. (Not Controllable.)
  • Treatments. Conditions constructed from the
    factors in order to observe the impact on the
    response.
  • Control Treatment. Benchmark with respect to
    which the remaining treatments are compared.
  • Population. The set of all measurements of
    interest.
  • Sample. A subset of measurements taken from the
    population actually measured.
  • Statistic. A number calculated from the sample,
    e.g. the sample average, the sample variance.
  • Parameter. A number calculated from the entire
    population, e.g. the population average, the
    population variance.

5
Population vs. Sample
Using the sample average to make statements about
the population average is an example of
inferential statistics.
Descriptive statistical methods describe the
sample. Inferential statistical methods make
statements about the population based on the
sample.
6
First Principle of Statistical Inference
  • You make inference about the population from
    which the sample was obtained. (Seems obvious,
    but is often forgotten.)
  • In each of the examples below, identify the
    population being sampled and the inference being
    made
  • Study cow grazing behavior. One cow (Daisy) in
    pasture (A). Randomly select time intervals for
    observation during month of May.
  • Study capital punishment and homicide rates.
    Randomly select 100 US cities. Objective is to
    make causal statements about a process.
  • In a pilot study, 20 runs of a manufacturing
    process are carried out in the lab. Objective
    find out how the process will work in large scale
    production.
  • Study yield of 3 varieties of winter wheat.
    Randomly sample 30 farms in Kansas, 10 farms grow
    variety A, 10 variety B, and 10 variety C.
    Measure the yield per acre over one growing
    season.

7
Scientific Method
  • The pursuit of systematic interrelation of facts
    by logical arguments from accepted postulates,
    observation, and experimentation and a
    combination of these three in varying proportions.

Roles of Statistics
  • Aid in creating the best' research design with
    which to generate new data.
  • Extract the information from the noise or
    variability at the data analysis step.

8
Logical Arguments
  • Deductive argument Conclusion follows with
    logical necessity or certainty from the premises.
    Nothing new is revealed because we are arguing
    from the general to the specific.
  • Specialization Moving from a large set of
    objects, postulates, or events, to consideration
    of a smaller set of objects or events.
  • Inductive argument Discovering general laws by
    the observation and combination of particular
    instances. Passing from the specific to the
    general.
  • Generalization Passing from the consideration of
    one object, postulate, or occurrence, to the
    consideration of a set of objects, postulates, or
    occurrences.

In statistics we attempt to formalize and use
these concepts in a quantitative way.
9
Scientific Progress
We gain knowledge by iterating between models and
data.
10
Basic Study Steps
  • State the problem. What are the questions?
  • Devise a plan of solution. What will I do?
  • Implement the plan. This is how I do it?
  • Analysis of data. What happened?
  • Interpretation of results. What does this mean?
  • Reexamination. Is my logic correct? What next?

Study design and study implementation may require
iteration.
11
Graphical Depiction of Scientific Study
Problem
12
Research Design Categories
  • Census (Complete Enumeration) Every individual
    in the population of interest is observed. In a
    census, the sample equals the population.
  • Observational Studies (Mensurative Experiments)
    Populations to be compared are defined, and
    individuals are randomly selected from these
    populations for measurement. This involves mere
    data collection no interference with the
    processes generating the data.
  • Experimental Studies (Manipulative Experiments)
    Individuals in one or more populations are
    carefully chosen or created to test specific
    manipulations under highly controlled conditions.
    Explanatory variables are manipulated their
    effect on the response variable(s) is then
    observed.

13
Observational Study Design
  • Observational studies are of 3 varieties
  • Sample survey studies a population at a
    particular point in time.
  • Prospective study observes a population in the
    present using a sample survey, and proceeds to
    follow subjects into the future.
  • Retrospective study observes a population in the
    present using a sample survey, and collects data
    about the subjects on events in the past.
  • The possible presence of confounding variables
    poses a severe limitation in observational
    studies.
  • Confounder. A (non-measured) variable, other than
    the explanatory variable, that affects the
    response variable. Confounders may affect both
    response and explanatory variables, and are
    outside the control of the researcher.

14
Observational Study Design
  • Example Study lung cancer rates among smokers
    and non-smokers.
  • What are populations of interest?
  • How will individuals be selected for measurement?
  • What will be measured?
  • Which analyses will be performed?
  • How many individuals are needed?
  • How large an effect will be considered important?
  • Are available resources adequate for this study?

Many of these questions are answered by subject
matter experts, some can be answered by a
statistical analysis.
15
Observational Study( Mensuration Experiment)
What is measured?
16
How are individuals selected?
  • Individually identified (the sample unit).
  • Randomly chosen (no biases introduced in
    selection).
  • Each possible set of individuals has the same
    probability of selection (Simple Random Sampling).

Special situations allow for increased efficacy
of selection.
  • Stratification (account for an extraneous
    factor)
  • Clusters (select natural groups of sample units)
  • Multi-stage (select large units then parts of
    units)
  • Systematic (set pattern)

17
Simple Random Sampling
A researcher wishes to determine the prevalence
of a disease in a greenhouse of tomato seedlings.
Each seedling tested for the disease is destroyed
in the process, hence only a minimal number
should be tested. Expectations are that only
about .01 of the roughly 50,000 seedlings in the
greenhouse have the disease.
How to select a simple random sample?
  • Number each pot. Use a random number table (or
    spreadsheet random number generator) to produce a
    list of numbers, in random order from 1 to the
    total number of pots. Measure plants in pots
    whose numbers are selected (difficult).
  • Align pots in rows and columns. Use random number
    table to select a list of row and column number
    pairs. Measure plant in pots located in the (row,
    column) pair selected (easier).

Table 13 in Ott and Longnecker.
18
Simple Random Sample
Textbook definition.
A simple random sample of n units is defined such
that each possible sample of size n is equally
likely to be drawn.
Practical definition.
This sampling principle assures that each unit in
the population has the same probability
(likelihood) of being selected in the sample.
19
Stratification
Allows us to take into account a factor we
already know affects the response of interest. To
remove a source of known variability.
16 years healthy
20 years diseased
22 years healthy
Pine forest Estimate expected yield from plot.
Individuals selected at random within each
strata. Variability in diseased subpopulation
expected to be much greater than in healthy area.
Mean yield greater at 22y than 16y.
20
Clusters
Estimate the average sponge size on natural reefs.
9
REEF
25
12
Number of sponges on reef
21
14
7
5
Selecting sponges at random would be very
resource inefficient. Cheaper to select reefs
(sponge clusters) at random with probability
proportional to size. All sponges on selected
reefs are measured (a cheap thing to do that
increases the sample size easily).
21
MultiStage Sampling
Typically large areas or large complex
populations can be more effectively sampled in
stages. At the first stage, natural or synthetic
clusters are selected. At subsequent stages the
selected clusters are subdivided into units and
samples of these are selected.
Example National crop yield survey.
22
Greenhouse Example
Stratification Maybe we have observed that
plants near the door seem less healthy than those
further into greenhouse. Divide room into plants
near door and plants inside. Random samples
from each stratum. Cluster Suppose plants are
arranged on tables. We could select tables at
random then examine all plants on each table
selected. Note that if one plant on a table is
diseased, all plants on table have an increased
probability of also being diseased. Multi-Stage
Again suppose plants are on tables. Select some
tables at random. Next select a few plants from
each selected table for testing. First stage unit
is the table. Second stage unit is the plant.
Third stage unit could be the leaf on the plant,
etc. Systematic Imagine plants arranged on a
large table. Randomly pick a row and column to
start. Then, following a systematic route, pick,
say, every 10th plant.
23
What is measured?
Variable Apt or liable to vary or change from
individual to individual, capable of being varied
or changed (factor), alterable, inconsistent,
having much variation or diversity, a quantity
that may assume any given value from a set of
values (the variables range).
24
Types of Variables Categorical
  • Categorical, classification, or qualitative
    variable
  • Discrete essentially describes some
    characteristic of a sample unit. E.g. color,
    gender, grade, health status, treatment group.
    Further subdivided into
  • nominal (think name) arithmetic doesnt make
    sense, e.g. gender M,F even if coded 0,1
  • ordinal (think order) nominal data with order,
    e.g. grades A,B,C,D,F, strength of agreement
    1strongly agree, 2agree, 3neutral,
    4disagree, 5strongly disagree.
  • In ordinal data the order is meaningful, but the
    difference between responses isnt. Also,
    arithmetic is sometimes done, but its meaning is
    debatable.

25
Types of Variables Quantitative
  • Quantitative or amount variable
  • Can be either discrete or continuous measures
    the amount or level of a characteristic of a
    sample unit. For example age, weight, height,
    temperature, biomass, volume. Further subdivided
    into
  • interval - differences between values have
    meaning but there is no definite or meaningful
    zero point, e.g. GPA, SAT scores, temperature
  • ratio like interval but with a meaningful zero
    point, e.g. weight, money, yield.

In this course we will deal primarily with
quantitative variables (ratio).
26
Study Design Questions
  • How is the response (effect) to be measured?
  • What characteristics of the response are to be
    analyzed?
  • What factors influence the characteristics to be
    analyzed?
  • Which of these factors will be studied in this
    investigation?
  • How many times should the basic experiment be
    performed?
  • What should be the form of the analysis?
  • How large an effect (effect size) will be
    considered important?
  • What resources are available for this study? Are
    they adequate?

It is important to be able to define the
underlined words.
27
Terminology
  • The response typically refers to the measured
    variable(s) of primary interest (e.g. weight,
    health status, growth, etc).
  • Characteristics Is it change in the average
    response, the spread of responses, the maximum
    response, etc, that will be examined? These
    characteristics typically refer to some
    statistical aspect of effects measured among
    individuals in the populations being studied.
  • A factor refers to the characteristic(s) that
    primarily differ among the populations being
    studied (compared). Some factors we cannot
    manipulate (I.e. such as descriptors like gender,
    geographic location, genetic makeup). Other
    factors identify characteristics we have caused
    to be different between the two populations (as
    in an experiment where we manipulate the
    populations by giving them different
    treatments).
  • Basic Experiment The selecting of an individual
    for measurement. In an observational study, the
    basic experiment is the selection and measurement
    of an individual from the population. In an
    Experimental Study, the basic experiment is the
    selection of an individual from the pool, the
    application of a treatment, and the measurement
    of responses.

28
Terminology (Cont)
  • By the form of the analysis, we refer to the
    statistical procedure(s) that match the
    characteristics of the study design, the
    characteristics of the responses measured and the
    estimates and hypothesis tests needed to answer
    the questions of interest. So, when someone asks
    What form will your analysis take? you might
    answer with something like I will be using
    regression analysis (the statistical method) to
    explore associations between fat intake and
    cholesterol level (the hypotheses of interest)
    between two populations identified geographically
    and by gender (study design factors).
  • The size of the effect of interest refers to how
    big of a difference must there be before I (or
    others) would conclude that there is a real
    difference. Typically we are interested in
    specifying this at the design phase of a study
    since the size of the effect of interest drives
    the sample size question. Thus if you say a
    difference of less than 2 points in cholesterol
    level between gender groups would not be
    significant but anything greater than 2 is
    significant, you could use this to set the study
    sample size. If the difference were raised to 10
    points, a much smaller sample size would be
    needed.
  • Resources Money, personnel, time, access,
    material.

29
Experimental Study
  • Manipulation Experiment A research design in
    which the researcher deliberately introduces
    certain changes in the levels of factors that are
    hypothesized as affecting the process of
    interest, and then makes observations to
    determine the effect of these changes.
  • Experimental Design A study plan which assures
    that measurements will be relevant to the problem
    under study.
  • Treatments Changes to those factors which are
    suspected of affecting the process under study.

30
Ex Factorial Experiment
31
Standard Form for a Data Set
Observation Number
CATEGORIES
AMOUNTS
1 1 F RED x x ... 10.2 x x ... 2 1 F
WHITE x x ... 12.9 x x ... 3 1 M BLUE
x x ... 20.1 x x ... . . . . . . n 1
F BLUE x x ... 16.0 x x ...
strata
Other quantitative variable
Other categorical variable
gender
weight
color
32
Example Data Set in Spreadsheet Format
Indicator of missing data
33
Inventor's Paradox
The more ambitious the plan, the more chances of
success, and the more opportunity for failure.
How does one decide on what to do?
Are there open questions ? Are there available
resources? Does someone really want the
answer? Can a study be done? Will the study be
able to answer the question?
Statistics may help answer the last question!
34
The HIP Trial Revisited
  • Seems natural to compare screened (cancer
    rate1.1) vs. refused (cancer rate1.5), in the
    treatment group hardly a difference!
  • But realize that this is an observational
    comparison (in an experimental study), and hence
    is prone to confounding.
  • Social status is a confounder. Richer and better
    educated women were more likely to accept the
    screening, and breast cancer hits the richer
    harder than the poorer. (Pregnancy, esp. early
    pregnancy, is now known to protect against breast
    cancer.)
  • So the analysis by treatment received is biased.
    But the analysis by intention-to-treat is
    appropriate.
  • Intention to screen cancer rate (1.3).
  • Control cancer rate (2.0).
  • A sizeable difference.
  • Five-year cancer rate ratio (treat/control) is
    39/6362.
Write a Comment
User Comments (0)
About PowerShow.com