Data Analysis: Simple Statistical Tests - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Data Analysis: Simple Statistical Tests

Description:

Title: Slide 1 Author: UNC Last modified by: Rachel Wilfert Created Date: 10/31/2003 3:48:11 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 46
Provided by: UNC61
Category:

less

Transcript and Presenter's Notes

Title: Data Analysis: Simple Statistical Tests


1
Data AnalysisSimple Statistical Tests
2
Goals
  • Understand confidence intervals and p-values
  • Learn to use basic statistical tests including
    chi square and ANOVA

3
Types of Variables
  • Types of variables indicate which estimates you
    can calculate and which statistical tests you
    should use
  • Continuous variables
  • Always numeric
  • Generally calculate measures such as the mean,
    median and standard deviation
  • Categorical variables
  • Information that can be sorted into categories
  • Field investigation often interested in
    dichotomous or binary (2-level) categorical
    variables
  • Cannot calculate mean or median but can calculate
    risk

4
Measures of Association
  • Strength of the association between two
    variables, such as an exposure and a disease
  • Two measure of association used most often are
    the relative risk, or risk ratio (RR), and the
    odds ratio (OR)
  • The decision to calculate an RR or an OR depends
    on the study design
  • Interpretation of RR and OR
  • RR or OR 1 exposure has no association with
    disease
  • RR or OR gt 1 exposure may be positively
    associated with disease
  • RR or OR lt 1 exposure may be negatively
    associated with disease

5
Risk Ratio or Odds Ratio?
  • Risk ratio
  • Used when comparing outcomes of those who were
    exposed to something to those who were not
    exposed
  • Calculated in cohort studies
  • Cannot be calculated in case-control studies
    because the entire population at risk is not
    included in the study
  • Odds ratio
  • Used in case-control studies
  • Odds of exposure among cases divided by odds of
    exposure among controls
  • Provides a rough estimate of the risk ratio

6
Analysis Tool 2x2 Table
  • Commonly used with dichotomous variables to
    compare groups of people
  • Table puts one dichotomous variable across the
    rows and another dichotomous variable along the
    columns
  • Useful in determining the association between a
    dichotomous exposure and a dichotomous outcome

7
Calculating an Odds Ratio
Table 1. Sample 2x2 table for Hepatitis A at
Restaurant A
Outcome Outcome Outcome Outcome
Exposure Hepatitis A No Hepatitis A Total
Exposure Ate salsa 218 45 263
Exposure Did not eat salsa 21 85 106
Exposure Total 239 130 369
  • Table displays data from a case control study
    conducted in Pennsylvania in 2003 (2)
  • Can calculate the odds ratio
  • OR ad (218)(85) 19.6
  • bc (45)(21) 

8
Confidence Intervals
  • Point estimate a calculated estimate (like risk
    or odds) or measure of association (risk ratio or
    odds ratio)
  • The confidence interval (CI) of a point estimate
    describes the precision of the estimate
  • The CI represents a range of values on either
    side of the estimate
  • The narrower the CI, the more precise the point
    estimate (3)

9
Confidence Intervals - Example
  • Examplelarge bag of 500 red, green and blue
    marbles
  • You want to know the percentage of green marbles
    but dont want to count every marble
  • Shake up the bag and select 50 marbles to give an
    estimate of the percentage of green marbles
  • Sample of 50 marbles
  • 15 green marbles, 10 red marbles, 25 blue marbles

10
Confidence Intervals - Example
  • Marble example continued
  • Based on sample we conclude 30 (15 out of 50)
    marbles are green
  • 30 point estimate
  • How confident are we in this estimate?
  • Actual percentage of green marbles could be
    higher or lower, ie. sample of 50 may not reflect
    distribution in entire bag of marbles
  • Can calculate a confidence interval to determine
    the degree of uncertainty

11
Calculating Confidence Intervals
  • How do you calculate a confidence interval?
  • Can do so by hand or use a statistical program
  • Epi Info, SAS, STATA, SPSS and Episheet are
    common statistical programs
  • Default is usually 95 confidence interval but
    this can be adjusted to 90, 99 or any other
    level

12
Confidence Intervals
  • Most commonly used confidence interval is the 95
    interval
  • 95 CI indicates that our estimated range has a
    95 chance of containing the true population
    value
  • Assume that the 95 CI for our bag of marbles
    example is 17-43
  • We estimated that 30 of the marbles are green
  • CI tells us that the true percentage of green
    marbles is most likely between 17 and 43
  • There is a 5 chance that this range (17-43)
    does not contain the true percentage of green
    marbles

13
Confidence Intervals
  • If we want less chance of error we could
    calculate a 99 confidence interval
  • A 99 CI will have only a 1 chance of error but
    will have a wider range
  • 99 CI for green marbles is 13-47
  • If a higher chance of error is acceptable we
    could calculate a 90 confidence interval
  • 90 CI for green marbles is 19-41

14
Confidence Intervals
  • Very narrow confidence intervals indicate a very
    precise estimate
  • Can get a more precise estimate by taking a
    larger sample
  • 100 marble sample with 30 green marbles
  • Point estimate stays the same (30)
  • 95 confidence interval is 21-39 (rather than
    17-43 for original sample)
  • 200 marble sample with 60 green marbles
  • Point estimate is 30
  • 95 confidence interval is 24-36
  • CI becomes narrower as the sample size increases

15
Confidence Intervals
  • Returning to example of Hepatitis A in a
    Pennsylvania restaurant
  • Odds ratio 19.6
  • 95 confidence interval of 11.0-34.9 (95 chance
    that the range 11.0-34.9 contained the true OR)
  • Lower bound of CI in this example is 11.0 (e.g.,
    gt1)
  • Odds ratio of 1 means there is no difference
    between the two groups, OR gt 1 indicates a
    greater risk among the exposed
  • Conclusion people who ate salsa were truly more
    likely to become ill than those who did not eat
    salsa

16
Confidence Intervals
  • Must include CIs with your point estimates to
    give a sense of the precision of your estimates
  • Examples
  • Outbreak of gastrointestinal illness at 2 primary
    schools in Italy (4)
  • Children who ate corn/tuna salad had 6.19 times
    the risk of becoming ill as children who did not
    eat salad
  • 95 confidence interval 4.81 7.98
  • Pertussis outbreak in Oregon (5)
  • Case-patients had 6.4 times the odds of living
    with a 6-10 year-old child than controls
  • 95 confidence interval 1.8 23.4
  • Conclusion true association between exposure and
    disease in both examples

17
Analysis of Categorical Data
  • Measure of association (risk ratio or odds ratio)
  • Confidence interval
  • Chi-square test
  • A formal statistical test to determine whether
    results are statistically significant

18
Chi-Square Statistics
  • A common analysis is whether Disease X occurs as
    much among people in Group A as it does among
    people in Group B
  • People are often sorted into groups based on
    their exposure to some disease risk factor
  • We then perform a test of the association between
    exposure and disease in the two groups

19
Chi-Square Test Example
  • Hypothetical outbreak of Salmonella on a cruise
    ship
  • Retrospective cohort study conducted
  • All 300 people on cruise ship interviewed, 60 had
    symptoms consistent with Salmonella
  • Questionnaires indicate many of the case-patients
    ate tomatoes from the salad bar

20
Chi-Square Test Example (cont.)
Table 2a. Cohort study Exposure to tomatoes and
Salmonella infection
Salmonella? Salmonella?
Yes No Total
Tomatoes 41 89 130
No Tomatoes 19 151 170
Total 60 240 300
  • To see if there is a statistical difference in
    the amount of illness between those who ate
    tomatoes (41/130) and those who did not (19/170)
    we could conduct a chi-square test

21
Chi-Square Test Example (cont.)
  • To conduct a chi-square the following conditions
    must be met
  • There must be at least a total of 30 observations
    (people) in the table
  • Each cell must contain a count of 5 or more
  • To conduct a chi-square test we compare the
    observed data (from study results) with the data
    we would expect to see

22
Chi-Square Test Example (cont.)
Table 2b. Row and column totals for tomatoes and
Salmonella infection
Salmonella? Salmonella?
Yes No Total
Tomatoes 130
No Tomatoes 170
Total 60 240 300
  • Gives an overall distribution of people who ate
    tomatoes and became sick
  • Based on these distributions we can fill in the
    empty cells with the expected values

23
Chi-Square Test Example (cont.)
  • Expected Value Row Total x Column Total
  • Grand Total
  • For the first cell, people who ate tomatoes and
    became ill
  • Expected value 130 x 60 26
  • 300
  • Same formula can be used to calculate the
    expected values for each of the cells

24
Chi-Square Test Example (cont.)
Table 2c. Expected values for exposure to tomatoes
Salmonella? Salmonella?
Yes No Total
Tomatoes 130 x 60 26 300 130 x 240 104 300 130
No Tomatoes 170 x 60 34 300   170 x 240 136 300   170
Total 60 240 300
  • To calculate the chi-square statistic you use the
    observed values from Table 2a and the expected
    values from Table 2c
  • Formula is (Observed Expected)2/Expected for
    each cell of the table

25
Chi-Square Test Example (cont.)
Table 2d. Expected values for exposure to tomatoes
Salmonella? Salmonella?
Yes No Total
Tomatoes (41-26)2 8.7 26   (89-104)2 2.2 104   130
No Tomatoes (19-34)2 6.6 34   (151-136)2 1.7 136   170
Total 60 240 300
  • The chi-square (?2) for this example is 19.2
  • 8.7 2.2 6.6 1.7 19.2

26
Chi-Square Test
  • What does the chi-square tell you?
  • In general, the higher the chi-square value, the
    greater the likelihood there is a statistically
    significant difference between the two groups you
    are comparing
  • To know for sure, you need to look up the p-value
    in a chi-square table
  • We will discuss p-values after a discussion of
    different types of chi-square tests

27
Types of Chi-Square Tests
  • Many computer programs give different types of
    chi-square tests
  • Each test is best suited to certain situations
  • Most commonly calculated chi-square test is
    Pearsons chi-square
  • Use Pearsons chi-square for a fairly large
    sample (gt100)

28
Types of Statistical Tests
Parade of Statistics Guys
The right test...   To use when.  
Pearson chi-square (uncorrected) Sample size gt100 Expected cell counts gt 10
Yates chi-square (corrected) Sample size gt30 Expected cell counts 5
Mantel-Haenszel chi-square Sample size gt 30 Variables are ordinal
Fishers exact test Sample size lt 30 and/or Expected cell counts lt 5
29
Using Statistical TestsExamples from Actual
Studies
  • In each study, investigators chose the type of
    test that best applied to the situation (Note
    while the chi-square value is used to determine
    the corresponding p-value, often only the p-value
    is reported.)
  • Pearson (Uncorrected) Chi-Square A North
    Carolina study investigated 955 individuals
    because they were identified as partners of
    someone who tested positive for HIV. The study
    found that the proportion of partners who got
    tested for HIV differed significantly by
    race/ethnicity (p-value lt0.001). The study also
    found that HIV-positive rates did not differ by
    race/ethnicity among the 610 who were tested (p
    0.4). (6)

30
Using Statistical TestsExamples from Actual
Studies
  • Additional examples
  • Yates (Corrected) Chi-Square In an outbreak of
    Salmonella gastroenteritis associated with eating
    at a restaurant, 14 of 15 ill patrons studied had
    eaten the Caesar salad, while 0 of 11 well
    patrons had eaten the salad (p-value lt0.01). The
    dressing on the salad was made from raw eggs that
    were probably contaminated with Salmonella. (7)
  • Fishers Exact Test A study of Group A
    Streptococcus (GAS) among children attending
    daycare found that 7 of 11 children who spent 30
    or more hours per week in daycare had
    laboratory-confirmed GAS, while 0 of 4 children
    spending less than 30 hours per week in daycare
    had GAS (p-value lt0.01). (8)

31
P-Values
  • Using our hypothetical cruise ship Salmonella
    outbreak
  • 32 of people who ate tomatoes got Salmonella as
    compared with 11 of people who did not eat
    tomatoes
  • How do we know whether the difference between 32
    and 11 is a real difference?
  • In other words, how do we know that our
    chi-square value (calculated as 19.2) indicates a
    statistically significant difference?
  • The p-value is our indicator

32
P-Values
  • Many statistical tests give both a numeric result
    (e.g. a chi-square value) and a p-value
  • The p-value ranges between 0 and 1
  • What does the p-value tell you?
  • The p-value is the probability of getting the
    result you got, assuming that the two groups you
    are comparing are actually the same

33
P-Values
  • Start by assuming there is no difference in
    outcomes between the groups
  • Look at the test statistic and p-value to see if
    they indicate otherwise
  • A low p-value means that (assuming the groups are
    the same) the probability of observing these
    results by chance is very small
  • Difference between the two groups is
    statistically significant
  • A high p-value means that the two groups were not
    that different
  • A p-value of 1 means that there was no difference
    between the two groups

34
P-Values
  • Generally, if the p-value is less than 0.05, the
    difference observed is considered statistically
    significant, ie. the difference did not happen by
    chance
  • You may use a number of statistical tests to
    obtain the p-value
  • Test used depends on type of data you have

35
Chi-Squares and P-Values
  • If the chi-square statistic is small, the
    observed and expected data were not very
    different and the p-value will be large
  • If the chi-square statistic is large, this
    generally means the p-value is small, and the
    difference could be statistically significant
  • Example Outbreak of E. coli O157H7 associated
    with swimming in a lake (1)
  • Case-patients much more likely than controls to
    have taken lake water in their mouth (p-value
    0.002) and swallowed lake water (p-value 0.002)
  • Because p-values were each less than 0.05, both
    exposures were considered statistically
    significant risk factors

36
Note Assumptions
  • Statistical tests such as the chi-square assume
    that the observations are independent
  • Independence value of one observation does not
    influence value of another
  • If this assumption is not true, you may not use
    the chi-square test
  • Do not use chi-square tests with
  • Repeat observations of the same group of people
    (e.g. pre- and post-tests)
  • Matched pair designs in which cases and controls
    are matched on variables such as sex and age

37
Analysis of Continuous Data
  • Data do not always fit into discrete categories
  • Continuous numeric data may be of interest in a
    field investigation such as
  • Clinical symptoms between groups of patients
  • Average age of patients compared to average age
    of non-patients
  • Respiratory rate of those exposed to a chemical
    vs. respiratory rate of those who were not exposed

38
ANOVA
  • May compare continuous data through the Analysis
    Of Variance (ANOVA) test
  • Most statistical software programs will calculate
    ANOVA
  • Output varies slightly in different programs
  • For example, using Epi Info software
  • Generates 3 pieces of information ANOVA results,
    Bartletts test and Kruskal-Wallis test

39
ANOVA
  • When comparing continuous variables between
    groups of study subjects
  • Use a t-test for comparing 2 groups
  • Use an f-test for comparing 3 or more groups
  • Both tests result in a p-value
  • ANOVA uses either the t-test or the f-test
  • Example testing age differences between 2 groups
  • If groups have similar average ages and a similar
    distribution of age values, t-statistic will be
    small and the p-value will not be significant
  • If average ages of 2 groups are different,
    t-statistic will be larger and p-value will be
    smaller (p-value lt0.05 indicates two groups have
    significantly different ages)

40
ANOVA and Bartletts Test
  • Critical assumption with t-tests and f-tests
    groups have similar variances (e.g., spread of
    age values)
  • As part of the ANOVA analysis, software conducts
    a separate test to compare variances Bartletts
    test for equality of variance
  • Bartletts test
  • Produces a p-value
  • If Bartletts p-value gt0.05, (not significant) OK
    to use ANOVA results
  • Bartletts p-value lt0.05, variances in the groups
    are NOT the same and you cannot use the ANOVA
    results

41
Kruskal-Wallis Test
  • Kruskal-Wallis test generated by Epi Info
    software
  • Used only if Bartletts test reveals variances
    dissimilar enough so that you cant use ANOVA
  • Does not make assumptions about variance,
    examines the distribution of values within each
    group
  • Generates a p-value
  • If p-value gt0.05 there is not a significant
    difference between groups
  • If p-value lt 0.05 there is a significant
    difference between groups

42
Analysis of Continuous Data
43
Conclusion
  • In field epidemiology a few calculations and
    tests make up the core of analytic methods
  • Learning these methods will provide a good set of
    field epidemiology skills.
  • Confidence intervals, p-values, chi-square tests,
    ANOVA and their interpretations
  • Further data analysis may require methods to
    control for confounding including matching and
    logistic regression

44
References
  • 1. Bruce MG, Curtis MB, Payne MM, et al.
    Lake-associated outbreak of Escherichia coli
    O157H7 in Clark County, Washington, August 1999.
    Arch Pediatr Adolesc Med. 20031571016-1021.
  • 2. Wheeler C, Vogt TM, Armstrong GL, et al. An
    outbreak of hepatitis A associated with green
    onions. N Engl J Med. 2005353890-897.
  • 3. Gregg MB. Field Epidemiology. 2nd ed. New
    York, NY Oxford University Press 2002.
  • 4. Aureli P, Fiorucci GC, Caroli D, et al. An
    outbreak of febrile gastroenteritis associated
    with corn contaminated by Listeria monocytogenes.
    N Engl J Med. 20003421236-1241.

45
References
  • 5. Schafer S, Gillette H, Hedberg K, Cieslak P. A
    community-wide pertussis outbreak an argument
    for universal booster vaccination. Arch Intern
    Med. 20061661317-1321.
  • 6. Centers for Disease Control and Prevention.
    Partner counseling and referral services to
    identify persons with undiagnosed HIV --- North
    Carolina, 2001. MMWR Morb Mort Wkly
    Rep.2003521181-1184.
  • 7. Centers for Disease Control and Prevention.
    Outbreak of Salmonella Enteritidis infection
    associated with consumption of raw shell eggs,
    1991. MMWR Morb Mort Wkly Rep. 199241369-372.
  • 8. Centers for Disease Control and Prevention.
    Outbreak of invasive group A streptococcus
    associated with varicella in a childcare center
    -- Boston, Massachusetts, 1997. MMWR Morb Mort
    Wkly Rep. 199746944-948.
Write a Comment
User Comments (0)
About PowerShow.com