Introduction to Clinical Investigation - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Introduction to Clinical Investigation

Description:

none – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Introduction to Clinical Investigation


1
Introduction to Clinical Investigation
Analyzing the Data Applied Biostatistics
October 22, 2003 James Patrie MS Senior
Biostatistician Department of Health Evaluation
Sciences University of Virginia Health Science
Center Charlottesville, Virginia.
2
Goal of Data Analysis
The goal of data analysis is to use the
information in a sample of data, from a defined
study population, to make valid statements about
the study population. This is accomplished by
reducing the sample of data to a small number of
summary measures. Together these summary
measures retain sufficient information to allow
features of the study population to be estimated.

3
Analysis of the Marathon Completion
Times Of Four US Marathons A
Case Study
4
Study Population
The study population will
consist of those marathon runners who completed
either the 2002 New York Marathon, the 2002
Chicago Marathon, the 2002 Twin Cities
Marathon, or the 2002 Philadelphia Marathon.
As inclusion criteria, the runners had to be at
least 18 years of age at the time of the marathon
and they had to have completed their respective
marathon in a time of no more than 6 hours.
5
Four Study Populations
Study Population 1
Study Population 2
Study Population 3
Study Population 4

New York Marathon (31184)


Chicago Marathon (31120)
Phil Marathon (4565)
Twin Cities Marathon (6641)
Ex
Ex
Ex
Ex
88
990
866
41
4477
30194
30254
6600
6
Available Data
7
Questions of Interest

Were the marathon completion times similar across
the four different marathon study
populations? By how many minutes would we
estimate the mean marathon completion times to
differ between these four populations of
marathon runners?
8
Steps in the Statistical Analysis Process

There are basically three steps in the
statistical analysis process. I Data
description. II Data Analysis. III Data
Analysis Interpretation.
9
Important Definitions
A sample is a collection of values of one or
more characteristics of the study population
(data set). An element is a member of the
sample (runner). A variable is a quantity that
varies from one member of the sample to the
next (marathon time). A parameter is a numeric
characteristic of the study population
(population mean marathon time). A statistic
is a numerical characteristic of a sample (the
mean marathon time among runners sampled).
10
Important Definitions
The empirical frequency distribution of a
variable is a listing of the values or the
range of values of the variable together with
the frequencies with which these values or
range of values occur in the sample. The
relative empirical frequency distribution is
the empirical frequency distribution divided by
the sample size.
11
Step I Data Description
The goal of data description is to describe the
empirical frequency distribution of the observed
values of the variable. The manner of the
description will be dependent on the class of
the variable. There are essentially two classes
of variables qualitative variables, and
quantitative variables.
12
Qualitative Variables
Qualitative variables (categorical) are
intrinsically non-numeric, and two types of
qualitative variables are distinguishable
nominal variables and ordinal variables.
Nominal variables such as gender (female,male)
and marathon (New, York, Chicago, Twin Cities,
and Philadelphia) have categories that have no
natural order of rank. Ordinal variables
such as the runners age class (18-34,
35-39,40-44, 45-49, 50-54, and 55) have
categories that have a natural order of rank.
13
Qualitative Variable Description
Nominal and ordinal variables are described by
their respective Empirical frequency
distribution. Relative empirical frequency
distribution.
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Quantitative Variables
Quantitative variables are intrinsically numeric,
and two types of quantitative variables are
distinguishable discrete variables, and
continuous variables. Discrete variables such
as the number of previous marathons the runner
had completed (count 0,1,2,?), take on a
limited number of unique values. Continuous
variables such as the runners age, and the
runners marathon finishing time take on a
large or an infinite number of unique values.
21
Quantitative Variable Description
Discrete and continuous variables are
generally described by one of the following types
of statistics. Statistics derived from the
sample moments of the empirical frequency
distribution. Statistics derived from the
percentiles of the empirical frequency
distribution.
22
Statistics Derived from Sample Moments
The arithmetic mean is a statistic that is
derived from the first sample moment of the
empirical frequency distribution. The
arithmetic mean is computed by the following
formula.
The arithmetic mean estimates the central
location of the distribution of the values of
the variable within the study population.
23
The standard deviation is a statistic that is
derived from the second sample moment of the
empirical frequency distribution. The standard
deviation is computed by the following
formula. The standard deviation estimates
the degree of the dispersion of the
observations about the mean of the
distribution of the values of the variable
within the study population.
24
(No Transcript)
25
Statistics Derived from Sample Percentiles
The Pth percentile of a sample of n
observations is the value of the variable with
rank (P/100)(1n). The 50th percentile of a
sample of n observations is referred to as
the sample median. The sample median
estimates the central location of the
distribution of the values of the variable
within the study population.
26
The 25th percentile of a sample of n
observations is referred to as the lower
quartile, and the 75th percentile of a sample
of n observations is referred to as the upper
quartile. The difference between the upper
quartile value and the lower quartile value is
referred to as the interquartile range of the
frequency distribution. The interquartile
range estimates the degree of the dispersion
of the observations within the middle 50 of
the distribution of the values of the variable
within the study population.

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Choosing Between Statistics
Robustness The robustness of a statistic is
relate to the statistics resistance to being
affected by extreme values. The arithmetic mean
is a nonrobust statistic while the median is a
robust statistic. If the empirical distribution
is skewed or extreme values are present the
median will provide a better measure of central
location. Summarizing Capability The
arithmetic mean is a more appropriate statistic
if the data can be described by a particular
mathematical model such as the Normal
(Gaussian) distribution.
39
Case Study Marathon Completion Time
40
Step II Data Analysis
There are two distinct types of data analysis
methods Nonparametric methods. Parametric
methods.
41
Nonparametric Methods
Nonparametric methods are typically utilized when
the form of the distribution of the values of
the variable within the study population is
assumed unknown or thought not to be easily
characterized by a few parameters. Nonparametric
procedures are generally used when the sample
size is too small to make a reliable decision
about the shape of the distribution of the
values of the variable within the study
population.
42
Parametric Methods
Parametric methods are typically utilized when
the form of the distribution of the values of
the variable within the study population is
assumed known and easily characterized by a few
parameters. Parametric methods are generally
used when the sample size is large enough to
make a reliable decision about the shape of the
distribution of the values of the variable
within the study population.
43
Case Study
Since the sample size is large (n71525) there is
sound theoretical justification (central limit
theorem) to assume the 4 sample means will be
approximately normally distributed. Therefore, a
parametric data analysis method that utilizes
the population mean as the parameter of
comparison would be well suited to address the
study objectives.
44
Parametric Data Analysis
  • Parametric data analysis involves the following
    five
  • steps.
  • Model formulation.
  • Parameter Estimation.
  • Hypotheses testing.
  • Mean Separation
  • Confidence Interval Construction.

45
I. Model Formulation
Analysis of variance (ANOVA) is a linear model
in which response variable is a continuous
variable and all of the predictor variables
are categorical. Each categorical variable
(marathon) is referred to as a factor and each
category within a factor is referred to as a
level of the factor (New York). ANOVA
estimates a study population mean for each
level of the factor and provides a global test
for equal means across all levels of the
factor.
46
One-Way ANOVA Layout
47
One-Way ANOVA Linear Model
yij ? ?j ?ij
48
Model Assumptions
t
1


t
3
t
2
t
4
m2
m1
m3
m4
m
49
Marathon Completion Time
50
II. Parameter Estimation
The ANOVA model parameters are estimated by the
method of least squares. The least squares
estimates for the ?j are unbiased and have the
minimum variance of all linear estimators if the
?ij are independent and the ?ij have homogenous
variance. Note that this method of estimation is
inappropriate if ?ij are correlated, such as
when measurements are obtained from the same
observational unit on multiple occasions.

51
III. Hypothesis Testing
The hypothesis testing procedure involves
the following three steps. Based on the study
objective we formulate a null hypothesis (Ho),
which generally we wish to reject. We
select a statistical test that will allow us to
test the hypothesis. We specify the
probability of making an incorrect decision to
reject the null hypothesis (Ho) when we should
not.
52
Case Study
The study objective is to determine whether the
marathon completion times were similar across
the four marathon study populations. As the
null hypothesis, we state that the mean
marathon completion time is equal for the four
study populations Ho m1 m2 m3 m4.
As the alternative hypothesis we state that the
mean marathon completion time is not equal for
at least two of the study populations Ha mj?
mj.
53
Hypotheses
Hom1 m2 m3 m4
m
m2
m4
m1
m3
Hamj ? mj for at least one j?j
54
Test Statistic
If the assumptions of the ANOVA model are valid
we can formulate a test statistic F that is a
function of the ratio of two chi-square random
variables. Under the null hypothesis m1
m2 . . . mt, the value of the F-statistic has a
known probability distribution F with t-1 and
N-t degrees of freedom.
55
(No Transcript)
56
Type I Error Rate
The type I error rate (a) of a statistical test
is the probability that the test leads to
falsely rejecting the null hypothesis (Ho) when
Ho is true. The type I error rate should always
be set at the study design stage.
Traditionally, a is set at either the 0.05 or
0.01 level. For a given type I error rate a, we
can find a critical F value such that if the
observed value of the F statistic exceeds the
critical value, the type I error rate for a
false rejection of Ho m1 m2 . . . mt will
be ? a.
57
Null F(3 ,71521) Probability Distribution

Fc F(1-0.05,3,71521)
Acceptance Region
2.605
1- 0.05
0.05
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
F-value
58
Case Study One-Way ANOVA
ANOVA summary for marathon completion time.
The Pvalue is interpreted as the probability of
observing an F-statistic more extreme than the
one observed under the null hypothesis
m1m2m3m4.
59
IV. Mean Separation
If the null hypothesis that m1 m2 . . . mt
is rejected then it implies that the mean
response differs between at least two of the
groups. The process of determining which group
means differ is referred to as mean separation.
The objective of mean separation is determine
which pair of means differ while still
maintaining an overall type I error rate ? a.

60
Steps In Mean Separation
The process of mean separation includes the
following three steps. selecting an
appropriate multiple comparison test.
selecting a multiple type I error rate
adjustment criterion. selecting the
appropriate critical value of the test.
61
I. Multiple Comparison Test
The multiple comparison test procedure is
carried out by performing the equivalence of
multiple Student t-tests, where the test
statistic is defined as

Under the null hypothesis ?j ?j, the test
statistic t has a known probability distribution
t with N-t degrees of freedom.
62
II. Multiple Comparison Adjustment

To maintain a pre-specified type I error rate (a)
over interrelated hypothesis tests, a more
stringent criterion is required to reject the
null hypothesis ?j ?j. The Bonferroni type I
error rate adjustment is widely used when the
number of hypothesis tests is small. The
Bonferroni criterion states that to maintain a
pre-specified type I error rate of (a) over
multiple hypothesis tests, the original type I
error rate must be divided by the number of
hypothesis test, or equivalently by the total
number of comparisons (a/c).
63
(No Transcript)
64
III. Selecting the Appropriate Critical Value
A two-sided null hypothesis ?j ?j, has a
single alternative hypothesis mj ? mj. For a
two-sided hypothesis test with an adjusted type
I error rate (a/c) we can find a critical t-value
(tc), such that if the absolute value of observe
t-statistic exceeds tc we reject the null
hypothesis ?j ?j.
65
Null t(N-t)-Probability Distribution
Ho mj mj
Ha mj ? mj
Acceptance Region
tct(1-a/2c,N-t)
Reject
Reject
tc
-tc
a/2c
a/2c
1-a/c
0
t-value
66
Case Study Example
Since the study objective is to determine which
of the mean marathon completion times differ,
regard- less of the direction of the difference,
the alternative hypothesis is two-sided mj ?
mj. Since there are 4 marathons, 6 hypothesis
tests will have to be conducted. To maintain a
overall type I error rate of 0.05 will require
the comparison type I error rate for any single
test to be 0.05/6 ?0.008.
67
Null t(71521)-Probability Distribution
Ho mj mj
Ha mj ? mj
Acceptance Region
tc t(1-0.05/12,71521)
2.64
Reject
Reject
0.05/12
0.05/12
1- 0.05/6
-5
-4
-3
-2
-1
0
1
2
3
4
5
t-value
68
Mean Separation
The pvalue represent the probability of
obtaining a t-statistic more extreme than the
one observed under the null hypothesis mj
mj.
69
V. Confidence Interval Construction
A confidence interval for a parameter ?
represents a plausible range of values for the
parameter ?.
A interval (L,U) is a 100(1-?) confidence
interval for the parameter ? if the probability
(L ? ? ? U)1-?.
The quantity 1-? is called the confidence
coefficient or the confidence level.
70
Case Study
To compute a 95 confidence interval for the true
difference (?) between the mean marathon times of
any two marathon study populations we invert the
t-test and evaluate the formula at t tc.
71
Estimated ? in Marathon Completion Time
72
(No Transcript)
73
Step III Data Analysis Interpretation
The following two features of study design play a
major role in how the data analysis is
interpreted. How were the subjects sampled
(selected). How were the subjects assigned to
groups.
74
Study Design and Data Interpretation
Assigned to Groups
By Randomization
Not by Randomization
Selected
(Strongest Inference) A random sample is
selected from one population units are then
randomly assigned to different groups.
Random samples are selected from
existing distinct populations
Random
A group of study units is found units are
then randomly assigned to study groups
(Weakest Inference) Collection of
available units from distinct groups are
examined
Non-Random
75
Data Analysis Summary
Statistical Methods The marathon completion
times from the 4 marathon sites were analyzed by
one-way analysis of variance. The response
variable was the runners completion time (h)
and the independent factor was the marathon
site, which had four levels (New York, Chicago,
Twin Cities, and Philadelphia). Multiple
comparison adjustment was base on a Bonferroni
criterion in which the overall type I error rate
(a) was ? 0.05.
76
Data Analysis Summary
Results There was an association between the
marathon site and the mean marathon completion
time (Plt0.001). The largest difference between
the mean marathon completion time (h) was
between the New York and Philadelphia marathon
(16.4 min, 95CI14.6,18.1,Plt0.001), followed by
the Chicago and Philadelphia marathon (11.2 min,
9.98,12.92, Plt0.001) and the Twin Cities and
Philadelphia marathon (10.4 min, 8.2,
12.5,Plt0.001). There was no statistical
difference between the Chicago mean marathon
completion time and the Twin Cities mean
marathon completion time (0.78 min, -0.72,
2.3, P1.00).
77
Data Analysis Summary
Conclusions Although, we found an association
between the marathon site and the mean
marathon completion time, assigning cause to this
relationship would only be speculative. Since the
runners were not selected at random, nor were
they randomly assigned to participate in a
particular marathon, there may be multiple
confounding factors that contributed to the
discrepancy between the average marathon
completion time of these 4 marathons.
78
Statistical Resource Material
Rosner, B. Fundamentals of Biostatistics. 5th ed.
2000. Duxbury Press. Pacific Grove, CA. Fisher,
L.D., van Belle, G. Biostatistics A
Methodology for the Health Sciences. 1993. John
Wiley Sons. NY. Campbell, M.J., Machin, D.,
Medical Statistics A Common Sense Approach.
1993. John Wiley Sons. NY. Bailar, J.C.,
Mosteller, F. Medical User of Statistics. 2nd
Ed. 1992. NEJM Books, Boston MA.
Write a Comment
User Comments (0)
About PowerShow.com