Title: STATISTICS WORKSHOP - 2
1STATISTICS WORKSHOP - 2
- Contingency tables
- Correlation
- Analysis of variance
2Why relations between variables are important
- The ultimate goal of every research or scientific
analysis is finding relations between variables. - The philosophy of science teaches us that there
is no other way of representing meaning except
in terms of relations between some quantities or
qualities either way involves relations between
variables. - The advancement of science must always involve
finding new relations between variables.
3Qualitative Data (Contingency Table)
Example This test would be the one to use if we
have, say, different classes of patients (e.g.,
six types of cancers) and for a set of 1000
markers we can have either presence/absence of
each marker in each patient (this would yield
1000 contingency tables of dimensions 6x2 ---each
marker by each cancer type)
4Contingency Table
Question Is there evidence in the data for
association between the categorical variables?
For cross-classified data, the Pearson chi-square
test for independence and Fisher's exact test can
be used to test the null hypothesis that the row
and column classification variables of the data's
two-way contingency table are independent.
5Chi-Square test
2
2
Odds Ratio (OR) (ad)/(bc) Relative risk
a(cd)/c(ab)
6Contingency Table
Chi-Square Test Example 3500 were observed
whether they snore or not Is there an
association between snoring and gender ?
7Contingency Table
Example - Is there an association between snoring
and gender?
8Contingency Table
Odds ratio 1.58 95 CI 1.39 to 1.81
9Contingency Table
Is there evidence of differences in smoking
pattern between the sexes?
10Contingency Table
11Measuring treatment differences with Y/N response
- For outcomes such as reduction in blood pressure
there are obvious summaries of treatment effect
such as the difference between the average of
each group - For yes/no outcomes like death or cure the choice
of summary is not so obvious
Dead
Y N
aspirin
804
7783
9.4
placebo
1016
7584
11.8
TOTAL
1820
15367
12 Relative Risk or Risk Ratio
- Relative risk or risk ratio risk of death in
aspirin group divided by risk in placebo group - Relative Risk 9.4 / 11.8 0.80
- mortality is reduced by 20
- Relative risk estimates are likely to generalise
well from one population to another
13Absolute Risk Difference
- Absolute risk difference is the proportion of
deaths in the aspirin group minus the proportion
in the placebo group - risk difference 9.7 - 11.8 -2.1
- "2.1 lives saved for each 100 patients treated"
- Risk difference has a more direct clinical
interpretation, especially when considering cost-
effectiveness
14Odds Ratio
- Odds ratio the odds of death in the aspirin
group divided by the odds in the placebo group - Odds Ratio (9.7/90.3) / (11.8/88.2) 0.77
- "reduction of 23 in the odds of death
- The odds ratio has some purely mathematical
advantages. It is not much used in randomised
studies
15Berksons Fallacy
- It is a treatment-seeking bias so called because
Berkson indicated that individuals with more than
one disorder are more likely to seek clinical
services than are those with only one disorder. - This leads to an erroneously higher estimate of
the prevalence of the association between these
disorders than would be the case if each single
disorder independently led the patient to seek
care.
16Berksons Fallacy
- 2784 individuals were surveyed to determine
whether each subject suffered from either a
disease A or disease B or both. It is found that
257 out of the 2784 patients were hospitalised
for the condition.
Disease A Disease B Disease B Total
Disease A Yes No Total
Yes 7 29 36
No 13 208 221
Total 20 237 257
Disease A Disease B Disease B Total
Disease A Yes No Total
Yes 22 171 193
No 202 2389 2591
Total 224 2560 2784
P lt 0.025 There is some association between
having disease A and having disease B
P gt 0.1 There is no association between having
disease A and having disease B
17Gene Association Studies Typically Wrong
- Evolution of the strength of an association as
more information is accumulated. The strength of
the association is shown as an estimate of the
odds ratio (OR) without confidence intervals. - a, Eight topics in which the results of the first
study or studies differed beyond chance (Plt0.05)
when compared with the results of the subsequent
studies. - b, Eight topics in which the first study or
studies did not claim formal statistical
significance for the genetic association but
formal significance was reached by the end of the
meta-analysis. - Each trajectory starts at the OR of the first
study or studies. Updated cumulative OR estimates
are obtained at the end of each subsequent year,
summarizing all information to that time.
(Adapted from J.P.Ioannidis et al., Nature
Genetics 29306-9, 2001)
18(No Transcript)
19Studies of disease association
- Given the number of potentially identifiable
genetic markers and the multitude of clinical
outcomes to which these may be linked, the
testing and validation of statistical hypotheses
in genetic epidemiology is a task of
unprecedented scale
20Testing for equality of two proportions
- Example Two groups of genes 1. genes for
transcription and translation 2. genes in the
immune system Question Do they have similar
purine-pyrimidine compositions? The question is
asking whether the percentage of purines (or
pyrimidines) in group 1 is the same as the
percentage of purines (or pyrimidines) in group
2. To form the null and alternative hypotheses
we can say G1 the percentage of purines in
group 1 G2 the percentage of purines in group
2 H0 G1 G2 H1 G1 gt G2 or G2 gt G1
21Correlation
- Correlation can be used to summarise the amount
of linear association between two continuous
variables x and y. - Let (x1, y1), (x2, y2), ..., (xn, yn) denote the
data points. - A scatter plot gives a "cloud" of points
Y
Y
X
X
Positive correlation
Negative correlation
No correlation
22Positive and Negative Association
- If the points are nearly in a straight line then
knowing the value of one variable helps you to
predict the value of the other. - If there is little or no association, the "cloud"
is more spread out and information about one
variable doesn't tell you much about the other.
23A simple correlation formula
- Suppose there are n points altogether and that
n(A) is the number in region A and similarly for
n(B), n(C) and n(D - Give a value of 1/n to every point in A or C and
-1/n to every point in B or D - Define
- Cor n(A)n(C)-n(B)-n(D)
- n
- What are the properties of cor?
D
A
C
B
24The Pearson product moment correlation coefficient
- The formula for cor works, but it is rather
crude. For example both the diagrams below
would give cor 1.
and
are positive or negative in the different regions
and so is the product
- Sum will not lie between -1 and 1. It depends
on - The scale of x and y
- The number of points
25Correlation formula
where
Partial correlation Correlation between 2
variables that controls for the effects of one or
more other variables. Rank Correlation
26Pearson Correlation Coefficient
- A measure of linear association between two
variables, denoted as r. - Values of the correlation coefficient range from
-1 to 1. - The sign of the coefficient indicates the
direction of the relationship, and its absolute
value indicates the strength, with larger
absolute values indicating stronger relationships.
27Interpretation of correlation
- r measures the extent of linear association
between two continuous variables. - Association does not imply causation - both
variables may be affected by a third variable. - If r 0, there is no association between X and Y
- r does not indicate the extent of non linear
associations - The value of r can be affected by outliers
- Correlations Do Not Establish Causality
- Example When a gene is isolated that has some
positive correlation to cancer, claim is often
made that it enhances the susceptibility to the
disease, and not cause it.
28Some misconceptions
- When the value of the correlation coefficient is
large (small), the relation between the two
variates is close to linear, thus, when r 0.9
or 0.95 the relation is nearly linear - When the value of the correlation coefficient is
zero or near zero the two variates have no or
almost no functional relation - When the value of the correlation coefficient is
positive (negative), the value of Y becomes
larger (smaller) as a whole, as the value of X
becomes large - Example Let (X,Y) take (1,-1),(2,-2),(3,-3),(4,-
4),(5,20) each with probability 1/5. Then we
have Cor(X,Y) 0.62 - Concerning the first four points Y decreases as
X increases. This example shows that even when
the correlation coefficient between X and Y is
positive, Y does not always increase as a whole
as X increases.
29Examples
- Eg1. In Australia total alcohol consumption and
the number of ministers of religion have both
increased over time and would be positively
correlated but the increase in one has not caused
the increase in the other (both are related to
the total population size) - Eg2. In Japanese schoolchildren shoe size was
reported to be correlated (positively) with
scores on a test of mathematical ability. - Eg3. Extracting informative genes with negative
correlation for accurate cancer classification
30Effectiveness of the first Cold-War arms agreement
- "Most important, the negative correlation between
the mutation rate and the parental year of birth
among those born between 1950 and 1956 provides
experimental evidence for change in human
germline mutation rate with declining exposure to
ionizing radiation and therefore shows that the
Moscow treaty banning nuclear weapon tests in the
atmosphere (August 1963) has been effective in
reducing genetic risk to the affected
population."
31Example - Heights and weights of 6 female students
- The table below shows the heights and weights of
6 female students. How closely related are the
heights and the weights? -
The correlation coefficient 0.904
32Spearman Correlation Coefficient
- Commonly used nonparametric measure of
correlation between two ordinal variables. For
all of the cases, the values of each of the
variables are ranked from smallest to largest,
and the Pearson correlation coefficient is
computed on the ranks.
33Rank Correlation
- 10 students, arranged in alphabetical order, were
ranked according to their achievements in both
the laboratory and lecture sections of a biology
course. Find the coefficient of rank
correlation.
Rank correlation 0.8545
34Thoughts
Patterns often emerge before the reasons for them
become apparent. - Vasant Dhar If you do not
expect, you cannot find the unexpected. -
Heracletes To consult the statistician after an
experiment is finished is often merely to ask him
to conduct a post mortem examination. He can
perhaps say what the experiment died of. -
R.A.Fisher