STATISTICS WORKSHOP - 2 - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

STATISTICS WORKSHOP - 2

Description:

STATISTICS WORKSHOP - 2 Contingency tables Correlation Analysis of variance Why relations between variables are important The ultimate goal of every research or ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 35
Provided by: SING64
Category:

less

Transcript and Presenter's Notes

Title: STATISTICS WORKSHOP - 2


1
STATISTICS WORKSHOP - 2
  • Contingency tables
  • Correlation
  • Analysis of variance

2
Why relations between variables are important
  • The ultimate goal of every research or scientific
    analysis is finding relations between variables.
  • The philosophy of science teaches us that there
    is no other way of representing meaning except
    in terms of relations between some quantities or
    qualities either way involves relations between
    variables.
  • The advancement of science must always involve
    finding new relations between variables.

3
Qualitative Data (Contingency Table)
Example This test would be the one to use if we
have, say, different classes of patients (e.g.,
six types of cancers) and for a set of 1000
markers we can have either presence/absence of
each marker in each patient (this would yield
1000 contingency tables of dimensions 6x2 ---each
marker by each cancer type)
4
Contingency Table
Question Is there evidence in the data for
association between the categorical variables?
For cross-classified data, the Pearson chi-square
test for independence and Fisher's exact test can
be used to test the null hypothesis that the row
and column classification variables of the data's
two-way contingency table are independent.
5
Chi-Square test
2
2
Odds Ratio (OR) (ad)/(bc) Relative risk
a(cd)/c(ab)
6
Contingency Table
Chi-Square Test Example 3500 were observed
whether they snore or not Is there an
association between snoring and gender ?
7
Contingency Table
Example - Is there an association between snoring
and gender?
8
Contingency Table
Odds ratio 1.58 95 CI 1.39 to 1.81
9
Contingency Table
Is there evidence of differences in smoking
pattern between the sexes?
10
Contingency Table
11
Measuring treatment differences with Y/N response
  • For outcomes such as reduction in blood pressure
    there are obvious summaries of treatment effect
    such as the difference between the average of
    each group
  • For yes/no outcomes like death or cure the choice
    of summary is not so obvious

Dead

Y N
aspirin
804
7783
9.4
placebo
1016
7584
11.8
TOTAL
1820
15367
12
Relative Risk or Risk Ratio
  • Relative risk or risk ratio risk of death in
    aspirin group divided by risk in placebo group
  • Relative Risk 9.4 / 11.8 0.80
  • mortality is reduced by 20
  • Relative risk estimates are likely to generalise
    well from one population to another

13
Absolute Risk Difference
  • Absolute risk difference is the proportion of
    deaths in the aspirin group minus the proportion
    in the placebo group
  • risk difference 9.7 - 11.8 -2.1
  • "2.1 lives saved for each 100 patients treated"
  • Risk difference has a more direct clinical
    interpretation, especially when considering cost-
    effectiveness

14
Odds Ratio
  • Odds ratio the odds of death in the aspirin
    group divided by the odds in the placebo group
  • Odds Ratio (9.7/90.3) / (11.8/88.2) 0.77
  • "reduction of 23 in the odds of death
  • The odds ratio has some purely mathematical
    advantages. It is not much used in randomised
    studies

15
Berksons Fallacy
  • It is a treatment-seeking bias so called because
    Berkson indicated that individuals with more than
    one disorder are more likely to seek clinical
    services than are those with only one disorder. 
  • This leads to an erroneously higher estimate of
    the prevalence of the association between these
    disorders than would be the case if each single
    disorder independently led the patient to seek
    care.

16
Berksons Fallacy
  • 2784 individuals were surveyed to determine
    whether each subject suffered from either a
    disease A or disease B or both. It is found that
    257 out of the 2784 patients were hospitalised
    for the condition.

Disease A Disease B Disease B Total
Disease A Yes No Total
Yes 7 29 36
No 13 208 221
Total 20 237 257
Disease A Disease B Disease B Total
Disease A Yes No Total
Yes 22 171 193
No 202 2389 2591
Total 224 2560 2784
P lt 0.025 There is some association between
having disease A and having disease B
P gt 0.1 There is no association between having
disease A and having disease B
17
Gene Association Studies Typically Wrong
  • Evolution of the strength of an association as
    more information is accumulated. The strength of
    the association is shown as an estimate of the
    odds ratio (OR) without confidence intervals.
  • a, Eight topics in which the results of the first
    study or studies differed beyond chance (Plt0.05)
    when compared with the results of the subsequent
    studies.
  • b, Eight topics in which the first study or
    studies did not claim formal statistical
    significance for the genetic association but
    formal significance was reached by the end of the
    meta-analysis.
  • Each trajectory starts at the OR of the first
    study or studies. Updated cumulative OR estimates
    are obtained at the end of each subsequent year,
    summarizing all information to that time.

(Adapted from J.P.Ioannidis et al., Nature
Genetics 29306-9, 2001)
18
(No Transcript)
19
Studies of disease association
  • Given the number of potentially identifiable
    genetic markers and the multitude of clinical
    outcomes to which these may be linked, the
    testing and validation of statistical hypotheses
    in genetic epidemiology is a task of
    unprecedented scale

20
Testing for equality of two proportions
  • Example Two groups of genes 1. genes for
    transcription and translation 2. genes in the
    immune system Question Do they have similar
    purine-pyrimidine compositions? The question is
    asking whether the percentage of purines (or
    pyrimidines) in group 1 is the same as the
    percentage of purines (or pyrimidines) in group
    2. To form the null and alternative hypotheses
    we can say G1 the percentage of purines in
    group 1 G2 the percentage of purines in group
    2 H0 G1 G2 H1 G1 gt G2 or G2 gt G1

21
Correlation
  • Correlation can be used to summarise the amount
    of linear association between two continuous
    variables x and y.
  • Let (x1, y1), (x2, y2), ..., (xn, yn) denote the
    data points.
  • A scatter plot gives a "cloud" of points

Y
Y
X
X
Positive correlation
Negative correlation
No correlation
22
Positive and Negative Association
  • If the points are nearly in a straight line then
    knowing the value of one variable helps you to
    predict the value of the other.
  • If there is little or no association, the "cloud"
    is more spread out and information about one
    variable doesn't tell you much about the other.

23
A simple correlation formula
  • Suppose there are n points altogether and that
    n(A) is the number in region A and similarly for
    n(B), n(C) and n(D
  • Give a value of 1/n to every point in A or C and
    -1/n to every point in B or D
  • Define
  • Cor n(A)n(C)-n(B)-n(D)
  • n
  • What are the properties of cor?

D
A
C
B
24
The Pearson product moment correlation coefficient
  • The formula for cor works, but it is rather
    crude. For example both the diagrams below
    would give cor 1.

and
are positive or negative in the different regions
and so is the product
  • Sum will not lie between -1 and 1. It depends
    on
  • The scale of x and y
  • The number of points

25
Correlation formula
where
Partial correlation Correlation between 2
variables that controls for the effects of one or
more other variables. Rank Correlation
26
Pearson Correlation Coefficient
  • A measure of linear association between two
    variables, denoted as r.
  • Values of the correlation coefficient range from
    -1 to 1.
  • The sign of the coefficient indicates the
    direction of the relationship, and its absolute
    value indicates the strength, with larger
    absolute values indicating stronger relationships.

27
Interpretation of correlation
  • r measures the extent of linear association
    between two continuous variables.
  • Association does not imply causation - both
    variables may be affected by a third variable.
  • If r 0, there is no association between X and Y
  • r does not indicate the extent of non linear
    associations
  • The value of r can be affected by outliers
  • Correlations Do Not Establish Causality
  • Example When a gene is isolated that has some
    positive correlation to cancer, claim is often
    made that it enhances the susceptibility to the
    disease, and not cause it.

28
Some misconceptions
  • When the value of the correlation coefficient is
    large (small), the relation between the two
    variates is close to linear, thus, when r 0.9
    or 0.95 the relation is nearly linear
  • When the value of the correlation coefficient is
    zero or near zero the two variates have no or
    almost no functional relation
  • When the value of the correlation coefficient is
    positive (negative), the value of Y becomes
    larger (smaller) as a whole, as the value of X
    becomes large
  • Example Let (X,Y) take (1,-1),(2,-2),(3,-3),(4,-
    4),(5,20) each with probability 1/5. Then we
    have Cor(X,Y) 0.62
  • Concerning the first four points Y decreases as
    X increases. This example shows that even when
    the correlation coefficient between X and Y is
    positive, Y does not always increase as a whole
    as X increases.

29
Examples
  • Eg1. In Australia total alcohol consumption and
    the number of ministers of religion have both
    increased over time and would be positively
    correlated but the increase in one has not caused
    the increase in the other (both are related to
    the total population size)
  • Eg2. In Japanese schoolchildren shoe size was
    reported to be correlated (positively) with
    scores on a test of mathematical ability.
  • Eg3. Extracting informative genes with negative
    correlation for accurate cancer classification

30
Effectiveness of the first Cold-War arms agreement
  • "Most important, the negative correlation between
    the mutation rate and the parental year of birth
    among those born between 1950 and 1956 provides
    experimental evidence for change in human
    germline mutation rate with declining exposure to
    ionizing radiation and therefore shows that the
    Moscow treaty banning nuclear weapon tests in the
    atmosphere (August 1963) has been effective in
    reducing genetic risk to the affected
    population."

31
Example - Heights and weights of 6 female students
  • The table below shows the heights and weights of
    6 female students. How closely related are the
    heights and the weights?

The correlation coefficient 0.904
32
Spearman Correlation Coefficient
  • Commonly used nonparametric measure of
    correlation between two ordinal variables. For
    all of the cases, the values of each of the
    variables are ranked from smallest to largest,
    and the Pearson correlation coefficient is
    computed on the ranks.

33
Rank Correlation
  • 10 students, arranged in alphabetical order, were
    ranked according to their achievements in both
    the laboratory and lecture sections of a biology
    course. Find the coefficient of rank
    correlation.

Rank correlation 0.8545
34
Thoughts
Patterns often emerge before the reasons for them
become apparent. - Vasant Dhar If you do not
expect, you cannot find the unexpected. -
Heracletes To consult the statistician after an
experiment is finished is often merely to ask him
to conduct a post mortem examination. He can
perhaps say what the experiment died of. -
R.A.Fisher
Write a Comment
User Comments (0)
About PowerShow.com