Title: BASIC STATISTICS
1BASIC STATISTICS
- Alan J. Chaput, BScPhm, PharmD, MD, MSc (Epid),
FRCPC - June 28, 2007
2Descriptive statistics
3Types of data
- Categorical
- Places individuals into one of several categories
- Quantitative (numerical)
- Numerical values for which arithmetic operations
such as adding and averaging make sense
4How would you categorize these variables?
- Height
- Hours of sleep
- Hair color
- Eye color
- Ever taken stats course
- Heart rate
- 7-point Likert scale
5Distribution of a variable
- Tells us values a variable takes and how often it
takes them - Start data analyses by exploring distributions of
single variables with a graph - Later, move on to studying relationships between
variables
6Displaying distributions
- Categorical
- Pie charts
- Bar graphs
- Quantitative
- Histograms
- Stem-and-leaf plots
- Boxplots
7Graphs vs. tables
- Graphs
- A pictures worth
- Can show ALL the data points
- Visual impact of data (presentations)
- Tables
- More efficient (usually)
- Show actual values (more precision)
- Easier to produce (historically)
8Categorical VariablesExample U.S. Solid Waste
(2000)
Pie Chart
Difficult to do by hand, use computer program
(e.g., Excel) for production if necessary.
9Categorical Variables Example U.S. Solid Waste
(2000)
Bar Graph
Notes 1) Bars do not touch (categorical data)
2) Can plot counts or percents
10Quantitative variables - histograms
- Divide data into class intervals of equal width
- Count how many observations fall in each interval
- Draw histogram
11Weight Data Histogram
Number of students
Weight
12Interpreting histograms
- Shape
- Symmetric (bell, other)
- Asymmetric (right-tail, left-tail)
- Unimodal, bimodal (mode a high point)
- Centre (find middle position)
- Count observations (n)
- Find middle position (n 1) / 2
- Find middle value the value that has the middle
position - Spread (from low to high)
- Outliers (values outside the regular pattern)
13Interpreting histogramIllustrative example
State population, Hispanic (Fig 1.3)
- Shape asymmetrical w/ right tail
- Center
- n 50 states
- Middle position (50 1) 2 25.5
- Middle value is in the first category, so is
between 0 and 5 - Spread From 0.7 (W. Virginia) to 42.1 (New
Mexico) - Outlier New Mexico
14Interpreting histogramIllustrative example Fig
1.4 in text
- Shape symmetrical bell
- Center
- n 100
- Middle position 50.5
- Middle value around 7
- Spread From 2 to 12
- Outlier 12 (maybe)
15Stem-and-leaf plots
- For quantitative variables
- Separate each value into a stem value (first part
of the number) and leaf value (the remaining part
of the number) - Create a stem axis
- Write each leaf to the right of its stem value
- Place leaves in rank order
- Interpretation like a histogram on its side
16Weight DataStemplot(Stem Leaf Plot)
10 0166 11 009 12 0034578 13 00359 14 08 15
00257 16 555 17 000255 18 000055567 19 245 20
3 21 025 22 0 23 24 25 26 0 (10)
Key 203 means203 pounds Stems 10sLeaves
1s
17Interpretation like a histogram on its side
100166 11009 120034578 1300359 1408 1500257
16555 17000255 18000055567 19245 203 21025
220 23 24 25 260 (10)
- Shape positive skew
- Center
- n 53
- middle position (531)/2 27
- middle value 165
- Spread from 100 to 260
- Outlier 260
18Boxplot
- Central box spans Q1 to Q3
- A line in the box marks the median
- Lines extend from the box out to the minimum and
maximum
19Weight Data Boxplot
20Boxplots are esp. useful for comparing groups
21Numerical summaries
- Centre
- Median
- Mean
- Spread
- Quartiles (and IQR)
- Standard deviation (and variance)
22Mean (Arithmetic Average)
- Traditional measure of center
- Notation (xbar)
- Sum the values and divide by the sample size (n)
23Median
- Half the ordered values are less than or equal to
the median (and half are greater) - If n is odd, the median is the middle ordered
value - If n is even, the median is the average of the
two middle ordered values
24Comparing the mean and median
- Mean median when data are symmetrical
- Mean ? median when data are skewed
25Spread variability
- Variability the amount the values spread above
and below the centre - Can be measured in several ways
- Range
- Quartiles and inter-quartile range
- Variance and standard deviation
26Range
- Range maximum minimum
- The range is NOT a reliable measure of spread
27Quartiles
- The quartiles are the 3 numbers that divide the
ordered data into 4 equally sized groups - Q1 has 25 of the data below it
- Q2 has 50 of the data below it (median)
- Q3 has 75 of the data below it
28Obtaining the quartiles
- Order the data
- Find the median (this is Q2)
- Look at the lower half of the data (those below
the median) - The median of this lower half Q1
- Look at the upper half of the data (those above
the median) - The median of this upper half Q3
295 number summary
- Minimum
- Q1
- Median (Q2)
- Q3
- Maximum
- Note
- IQR Q3 Q1
- IQR gives spread of middle 50 of data
30Variances and standard deviation
- The most common measures of spread
- Based on deviations around the mean
- Each data value has a deviation
31Variance Formula
32Standard Deviation Square root of the variance
33Choosing summary statistics
- Use the mean and standard deviation for
reasonably symmetric distributions that are free
of outliers - Use the median and IQR (or 5-point summary) when
data are skewed or when outliers are present
34The normal distribution
35Who is this??
36Mathematical formula of a normal curve
37Normal curves
- Bell-shaped
- Not too steep, not too fat
- Defined by means standard deviations
38Normal Curves
- The mean and standard deviation computed from
actual observations (data) are denoted by and
s, respectively.
- The mean and standard deviation of the
distribution represented by the density curve are
denoted by µ (mu) and ? (sigma), respectively.
39Bell-Shaped CurveThe Normal Distribution
Standard deviation
40The Normal Distribution
- Mean µ defines the center of the curve
- Standard deviation ? defines the spread
- Notation is N(µ,?).
4168-95-99.7 Rule forAny Normal Curve
- 68 of the observations fall within one standard
deviation of the mean - 95 of the observations fall within two standard
deviations of the mean - 99.7 of the observations fall within three
standard deviations of the mean
4268-95-99.7 Rule forAny Normal Curve
43Standard Normal (Z) Distribution
- The Standard Normal distribution has mean 0 and
standard deviation 1 - We call this a Z distribution ZN(0,1)
- Any Normal variable x can be turned into a Z
variable (standardized) by subtracting µ and
dividing by s
44Standard Normal Table
45Statistical tests of normality
- Kolmogorov-Smirnoff test
- Anderson-Darling test
- Shapiro-Wilk test
- DAgostino-Pearson omnibus test
46Idea of probability
- Probability is the science of chance behavior
- Chance behavior is unpredictable in the short
run, but has a predictable pattern in the long
run - A phenomenon is random if individual outcomes are
uncertain but there is a predictable distribution
of outcomes in a large number of repetitions.
The probability of any outcome of a random
phenomenon can be defined as the proportion of
times the outcome would occur in a very long
series of repetitions.
47How probabilities behave
Eventually, the proportion of heads in fair coin
tosses approaches 0.5
48Recall the Normal curve
- We use the Normal density curve to determine
probabilities
49Normal probability distribution
individuals with X such that x1 lt X lt x2
The shaded area under the density curve shows the
proportion, or percent, of individuals in the
population with values of X between x1 and x2.
Because the probability of drawing one individual
at random depends on the frequency of this type
of individual in the population, the probability
is also the shaded area under the curve.
50Normal probability distribution
A variable whose value is a number resulting from
a random process is a random variable. The
probability distribution of many random variables
is the normal distribution. It shows what values
the random variable can take and is used to
assign probabilities to those values.
Example Probability distribution of womens
heights. Here, since we chose a woman randomly,
her height, X, is a random variable.
To calculate probabilities with the normal
distribution, we will standardize the random
variable
51Mens Height Example (NHANES, 1980)
- What proportion of men are less than 68 inches
tall?
-0.71 0 (standardized values)
52Standardized Scores
- How many standard deviations is 68 from µ on
XN(70,2.8)? - z (x µ) / s
- (68 - 70) / 2.8 -0.71
- The value 68 is 0.71 standard deviations below
the mean 70
53Table AStandard Normal Probabilities
.01
?0.7
.2389
54Mens Height Example (NHANES, 1980)
- What proportion of men are greater than 68 inches
tall? - Area under curve sums to 1, so Pr(X gt x) 1
Pr(X lt x), as shown below
1?.2389 .7611
.2389
55Reminder standardizing N (m,s)
We standardize Normal data by calculating
z-scores.
Any N(µ,s) can be standardized to a N(0,1).
56Distribution of womens heights
N (µ, ?) N (64.5, 2.5)
Example What's the proportion of women with a
height between 57" and 72"? Thats within 3
standard deviations s of the mean m, thus that
proportion is roughly 99.7.
Since about 99.7 of all women have heights
between 57" and 72", the chance of picking one
woman at random with a height in that range is
also about 99.7.
57What is the probability, if we pick one woman at
random, that her height will be some value X?
For instance, between 68 and 70 P(68 lt X lt
70)? Because the woman is selected at random, X
is a random variable.
N(µ, s) N(64.5, 2.5)
0.9192
0.9861
The area under the curve for the interval 68" to
70" is 0.9861 - 0.9192 0.0669. Thus, the
probability that a randomly chosen woman falls
into this range is 6.69
58Odds and probability
59Standard 2x2 table
Outcome/Disease
Exposure/Treatment
60Calculations from 2x2 table
- RR a/(ab) / c/(cd)
- RRR c/(cd) a/(ab) / c/(cd)
- ARR c/(cd) a/(ab)
- NNT 1/ARR
- OR a/b / c/d ad/cb
61Why use odds ratios?
- Perfectly good measure of association
- Can be estimated from a case-control study (RR
cannot) - Offers advantages in meta-analysis
- Easier to model than RD or RR
- As the risk falls, the odds and risk come closer
together for low event rates, the OR and RR are
very close
62Odds and probability/risk
- Odds are an alternative way of describing the
chance of an event - Odds Prob / 1- Prob
- Rearranging
- Prob Odds / 1 Odds
63Risk and odds
64Producing data sampling
65Inference!
- We often want answers to questions about a large
group of individuals -- this is the population - We seldom study the entire population, but
instead select a subset of the population -- this
is the sample - We use inferential techniques to make conclusions
about the population based on the data in the
sample
66Two types of studies
- Observational individuals are studied without
intervention - E.g., case-control and cohort studies
- Experimental studies the investigator assigns
an explanatory factor to some subjects but not to
others - E.g., clinical trial
67Observations vs. Experiments
- Both types of studies may be used to learn about
relationships between variables - Experimental studies are better suited for
determining cause-and-effect because they can
deal with confounding variables via randomization
68Sample Quality
- To do a good study, you need a good sample
- Poor quality samples produce misleading results
- Study should be designed to generate high quality
data that can then be used to infer population
characteristics
69Examples of Poor Quality Sampling Designs
- Voluntary response sampling
- Allows individuals to choose to be in the study
- e.g., Call-in polls (pp. 1789 in text)
- Convenience sampling
- Individuals that are easiest to reach are
selected - e.g., Interviewing at the mall (p. 179)
- These techniques favor certain outcomes and
cannot be trusted to reveal population
characteristics (sampling bias)
70Simple Random Sample (SRS)
- To avoid biased sampling, use impersonal chance
mechanisms as the basis for selection - Simple Random Sample (SRS)
- (1) Each individual in population has the same
chance of being selected - (2) Every possible sample has an equal chance to
be chosen
71Methods for selecting SRSs
- Physical, e.g., pick numbers from a hat
- Computerized random number generators
- Use a table of random digits
72Picking SRS (Illustration)
- Suppose 30 individuals labeled 01 30 and we
want to select two at random - Random digit table
- Select a row in table at random
- Break digits into couples
- 68417 35013 15529
- First two individuals in the sample are 13 and
15
73Producing data experiments
74Experimentation
- In an experiment, the investigator exposes the
explanatory factor to some individuals but not to
others. The investigator then measures the effect
on the response variable. - In an observational (non-experimental) study,
individuals are studied without imposition of
intervention, creating greater opportunity for
confounding
75Comparison
Comparison is first principle of experimentation
- The effects of treatment can be judged only in
relation to what would happen in its absence (all
other things being equal) - You cannot assess the effects of a treatment
without a proper comparison group because - Many factors contribute to a response
- Conditions change on their own over time
- People are open to suggestion (Placebo effect)
- Observation changes behavior (Hawthorne effect)
76Randomization
Randomization is the second principle of
experimentation
- Randomization use of chance mechanisms to
assign treatments - Randomization balances confounding variables
among treatments groups
77Blinding
Blinding is the third principle of experimentation
- Blinding assessment of the response in subjects
is made without knowledge of which treatment they
are receiving - Single blinding subjects are unaware of
treatment group - Double blinding subjects and investigators are
blinded
78The Logic of Randomization
- Randomization ensures that differences in the
response are due to either - Treatment
- Chance in the assignment of treatments
- If an experiment finds a difference among groups,
we then ask whether this difference is due to the
treatment or due to chance - If the observed difference is larger than what
would be expected just by chance, then we say it
is statistically significant
79The logic of randomization
- Consider an experiment of weight gain in
laboratory rats - Just by luck, some faster-growing rats are going
to end up in one group or the other - If we assign many rats to each group, the effects
of chance will balance out - Use enough controls to balance out chance
differences
80Sampling distribution of means
81Parameters and Statistics
- Parameter a fixed number that describes the
location or spread of a population (the value of
parameter NOT known) - Statistic a number calculated from data in the
sample (the value of statistics IS known after it
is calculated) - Sampling variability different samples or
experiments from the same population yield
different values of the same statistic
82Parameters and statistics
- The mean of a population is denoted µ ? this is
a parameter - The mean of a sample is called x-bar ? this is
a statistic - Illustration
- Average age of all U of O students (µ) is 26.5
- A SRS of 10 U of O students yields an average age
(x-bar) of 22.3 - x-bar and µ are related but are not the same
thing!
83Law of Large Numbers
The figure to the right demonstrates the law of
large numbers. The average of the first 50 or so
observations is unreliable. As n increases, the
sample mean becomes a better reflection of the
population mean.
84Sampling distribution of xbar
Key questions What would happen if we took many
samples or did the experiment many times? How
would this affect the statistics calculated from
such samples?
85Case Study Does This Wine Smell Bad?
- For the variable with ? 25 µg / L, ? 7 µg / L
with a Normal distribution, suppose we take 1,000
samples, each of n 10 from this population,
calculate x-bar from each sample, and plot x-bars
as a histogram
86The distribution of all x-bars
x-bar is an unbiased estimate of µ
Averages are less variable than individual
observations.
87Central Limit Theorem
No matter the shape of the population, the
distribution of x-bars will tend to be Normal
when the sample is large.
88Central Limit Theorem Illustrative example time
to perform activity
- Data time to perform an activity (hours)
- Population NOT Normal (Fig a) with µ 1 s 1
- Fig (b) is for x-bars based on n 2
- Fig (c) is for x-bars based on n 10
- Fig (d) is for x-bars based on n 25
- Distributions become increasingly Normal because
of the Central Limit Theorem
89Confidence intervals the basics
90Statistical Inference
- Two types of statistical inference
- Confidence Intervals
- Tests of Significance
91Confidence IntervalMean of a Normal Population
- Take an SRS of size n from a Normal population
with unknown mean m and known standard deviation
s. A level C confidence interval for m is
92Confidence IntervalMean of a Normal Population
93How Confidence Intervals Behave
- The margin of error is
- The margin of error gets smaller, resulting in
more accurate inference, - when n gets larger
- when z gets smaller (confidence level gets
smaller) - when s gets smaller (less variation)
94Interpretation of a confidence interval
- We are (1-a) x 100 confident that the true value
of µ lies in the interval µL to µH - If we used the interval several times, then (1-a)
x 100 of the time, it will cover the true value
of µ - The main purpose of a CI is to estimate an
unknown parameter with an indication of how
accurate the estimate is and how confident we are
that the result is accurate
95Level of Confidence (C)
- Confidence level success rate of method
- e.g., a 95 CI says we got this interval by a
method that gives correct results 95 of the
time (next slide) - Most common levels of confidence are 90, 95,
and 99
96Common MISinterpretations of a CI
- The means µ will lie within the interval with
probability 0.95 - µ is in this interval with probability 0.95
- The mean of a future sample from this population
will lie in the interval - 95 of the data will lie in the interval
97Factors that influence a CI
- The higher the level of confidence, the wider the
CI - The larger the variability in the sample, the
wider the CI - The larger the sample size, the narrower the CI
98Tests of significance the basics
99Recall basics about inference
- Goal to generalize from the sample (statistic)
to the population (parameter) - Two forms of inference
- Confidence intervals
- Significance testing
- Both CI and significance testing are based on the
idea of a sampling distribution
100Stating Hypotheses
- The goal of this procedure is to quantify the
evidence against a claim of no difference? - The claim being tested is called the null
hypothesis H0 - The null hypothesis is contradicted by the
alternative hypothesis Ha (which indicates that
there is difference) - The test is designed to assess the strength of
evidence against the null hypothesis.
101Test hypotheses
- Null H0 m m0
- One sided alternatives
- Ha m gt m0
- Ha m lt m0
- Two sided alternative
- Ha m ¹ m0
102Hypothesis testing
103P-value
- The P-value provides the probability that the
test statistic would take a value as extreme or
more extreme than the value observed if the null
hypothesis were true. - The smaller the P-value, the stronger the
evidence the data provide against the null
hypothesis
104P-value
- A measure of strength of evidence in support of
the null hypothesis - Large p-values indicate that the Ho is quite
plausible given the data - Small p-values indicate Ho is implausible, data
are inconsistent with Ho - Rejecting the null hypothesis usually implies a
treatment effect or a real difference exists - Strength of evidence is a continuous spectrum but
we tend to use plt0.05 as the point at which we
reject Ho - The value of p which just rejects Ho is called a,
the Type I error we will inappropriately reject
Ho with risk a - Always try to compute an exact p-value is
possible (as opposed to saying p lt 0.05 or
0.025ltplt0.05)
105Strength of evidence
106Statistical Significance
- If the P-value is as small or smaller than the
significance level a (i.e., P-value a), then we
say that data are statistically significant at
level a. - If we choose a 0.05, we are requiring that the
data give evidence against H0 so strong that it
would occur no more than 5 of the time when H0
is true. - If we choose a 0.01, we are insisting on
stronger evidence against H0, evidence so strong
that it would occur only 1 of the time when H0
is true.
107One vs. two-sided tests
- Determined by the alternative hypothesis
- Two-sided test implies that we are interested in
detecting departures from Ho in BOTH directions - One-sided test implies that we are specifying a
DIRECTIONAL alternative hypothesis (usually
dictated by our understanding of the biology) - Must specify direction of effect a-priori
108One vs. two-sided tests
- Some people refuse to accept that one-sided
tests are legitimate (e.g. NEJM, FDA) - Statistical trickery
- Always possible that treatment has negative
effect - But
- Perfectly acceptable statistically speaking
- Accepting Ho does not rule out a negative
treatment effect - It is legitimate and often desirable not to
prove that treatment is harmful
109Inference about a population mean
110Conditions for Inferenceabout a Mean
- Data are a SRS of size n
- Population has a Normal distribution with mean µ
and standard s (both µ and s are unknown) - Because s is unknown (realistic situation), we
can NOT use z procedures for our confidence
intervals and significance tests
111Standard Error
When we do not know population standard deviation
s, we use sample standard deviation s to
calculate the standard error
This is called the standard error of the mean
112One-Sample t Statistic
- The one-sample z statistic now becomes a
one-sample t statistic - The t statistic does NOT follow a Normal
distribution - It follows a t distribution with n 1 degrees
of freedom
113The t Distributions
- t distributions are a family of distributions
similar to the Standard Normal Z distribution - Each member of t identified by its degree of
freedom - Notation t(k) denotes a t distribution with k
degrees of freedom
114The t Distributions
As k increases, the t(k) curve approaches the Z
curve As n increases, s becomes a better
estimate of s
115t Table
Table C gives t critical values with upper tail
probability p and corresponding confidence level C
The bottom row of the table applies to z because
a t with infinite degrees of freedom is a
standard Normal (Z) variable
116One-Sample t Confidence Interval
- Take a SRS of size n from a population with
unknown mean m and unknown standard deviation s.
A level C confidence interval for m is given by
where t is the critical value with (n 1) for
confidence level C (from t Table)
117One-Sample t Test
- The t test is similar in form to the z test
learned earlier. The test statistic is
The test statistic has n 1 degrees of
freedom. Get the approximate P-value from the t
table.
118Matched Pairs t Procedures
- Matched pair samples allow us to compare
responses to two treatments in paired couples - Apply the one-sample t procedures to the observed
differences within pairs - The parameter m is the mean difference in the
responses to the two treatments within matched
pairs in the entire population
119Case Study Matched Pairs
Air Pollution
- Pollution index measurements were recorded for
two areas of a city on each of 8 days - To analyze, subtract Area B levels from Area A
levels. The 8 differences form a single sample. - Are the average pollution levels the same for the
two areas of the city?
120Normality Assumptiont Procedure Robustness
- t procedures produce perfect results when the
population is Normal. They are robust, producing
almost perfect confidence intervals and P-values
when the sample is moderate to small - Sample size less than 15 Use t procedures if
the data appear about Normal (symmetric, single
peak, no outliers). If the data are skewed or if
outliers are present, do not use t. - Sample size at least 15 The t procedures can be
used except in the presence of outliers or strong
skewness in the data. - Large samples The t procedures can be used even
for clearly skewed distributions when the sample
is large, roughly n 40.
121Can we use a t procedure?
Moderately sized data set (n 20), with strong
negative skew. t procedures cannot be trusted
122Can we use t?
- This histogram shows the distribution of word
lengths in Shakespeares plays. The sample is
very large. - The data has a strong positive skew, but there
are no outliers. We can use the t procedures
since n 40
123Can we use t?
The distribution has no clear violations of
Normality. Therefore, we trust the t procedure.
1242-sample problems
125Conditions for inference comparing two means
- We have two independent SRSs (simple random
samples) coming from two distinct populations
(like men vs. women) with (m1,s1) and (m2,s2)
unknown. - Both populations should be Normally distributed.
However, in practice, it is enough that the two
distributions have similar shapes and that the
sample data contain no strong outliers.
126Two-sample t-test
- The null hypothesis is that both population means
m1 and m2 are equal, thus their difference is
equal to zero. - H0 m1 m2 ltgt m1 - m2 0
- with either a one-sided or a two-sided
alternative hypothesis. - We find how many standard errors (SE) away from
(m1 - m2) is ( 1 - 2) by standardizing with
t - Because in a two-sample test H0 poses (m1 - m2)
0, we simply use - with df smallest(n1 - 1,n2 - 1)
127Two sample t-confidence interval
- Because we have two independent samples we use
the difference between both sample averages (
1 - 2) to estimate (m1 - m2).
- Practical use of t t
- C is the area between -t and t.
- We find t in the line of Table C for df
smallest (n1-1 n2-1) and the column for
confidence level C. - The margin of error m is
128Robustness
- The two-sample statistic is the most robust when
both sample sizes are equal and both sample
distributions are similar. But even when we
deviate from this, two-sample tests tend to
remain quite robust. - As a guideline, a combined sample size (n1 n2)
of 40 or more will allow you to work even with
the most skewed distributions.
129Two-sample test assuming equal variance
- There are two versions of the two-sample t-test
one assuming equal variance (pooled two-sample
test) and one not assuming equal variance
(unequal variance) for the two populations. You
may have noticed slightly different formulas and
degrees of freedom.
The pooled (equal variance) two-sample t-test was
often used before computers because it has
exactly the t distribution for degrees of freedom
n1 n2 - 2. However, the assumption of equal
variance is hard to check, and thus the unequal
variance test is safer.
Two normally distributed populations with unequal
variances
130Comparing two standard deviations
- It is also possible to compare two population
standard deviations s1 and s2 by comparing the
standard deviations of two SRSs. However, the
procedures are not robust at all against
deviations from normality. - When s12 and s22 are sample variances from
independent SRSs of sizes n1 and n2 drawn from
normal populations, the F-statistic F s12 / s22 - has the F distribution with n1 - 1 and n2 - 1
degrees of freedom when H0 s1 s2 is true. - The F-value is then compared with critical values
from Table D for the P-value with a one-sided
alternative this P-value is doubled for a
two-sided alternative.
131Proportions
132Two-Way Tables
- Cross-tabulate counts to form two-way table
- row variable
- column variable
- The count of observations falling into each
combination of categories fall in tables cell - Counts are totaled to create marginal totals
1332 independent samples
- Described as a 2 x 2 table
- Exact test
- Fishers Exact test
- Approximate tests
- Z-test
- Chi-squared test
- Summary measures
- RD (risk difference) Ho RD 0
- RR Ho RR 1
- OR Ho OR 1
134Fishers test or Chi-square?
- Either can be used for analyzing contingency
tables with 2 rows and 2 colums - Fishers test is the best choice as it always
gives the exact P value - Chi-square test is simpler to calculate but
yields only an approximate P value - If using computer, choose Fishers test
- Avoid chi-square when the numbers in the
contingency table are very small (lt 6)
1352 related (paired) samples
- Exact test
- Based on binomial distribution
- Approximate test
- McNemars Chi-squared test
- Summary measure
- OR estimator
- OR confidence interval
1362 independent stratified samples
- 2 x 2 x k table
- Estimation of the OR
- Mantel-Haenszel chi-square (good because can
handle cells with 0) - Woolf (precision-weighted) chi-square (most
commonly used for meta-analysis because formulas
simple, but not as good as M-H)) - Peto (O-E-V) chi-square
- Generalized M-H and P-W estimators
- Tests for homogeneity over strata i.e. can ORs
be pooled across strata? - Exact test
- Zelens test
- Approximate chi-squared tests
- Breslow-Day
- Woolf/Precision-weighted
137Recommended analysis of 2 x 2 x k tables
- Choose a suitable effect measure (RD, RR or OR)
- Test for homogeneity of stratum-specific effect
measures - Compute the summary effect measure estimate and
its associated CI - Test for association
138Sample size calculation
139http//statpages.org/
- (for all your statistical needs, including sample
size calculations)
140Nonparametric tests
141Non-parametric tests
- Make no distributional assumptions about data
(non-normal data) - Main focus is p-value, dont lend themselves to
CIs or sample size calculations - Sign test equivalent to one-sample test
- Wilcoxon Rank Sum (Mann-Whitney Test)
equivalent to independent two sample t-test - Wilcoxon Signed Rank equivalent to the paired
t-test
142Non-parametric tests - advantages
- Fewer assumptions are required (i.e. no
distributional assumptions or assumptions about
equality of variances) - Only nominal (categorical) or ordinal (ranked)
data are required, rather than numerical
(interval) data
143Non-parametric tests - disadvantages
- They are less efficient
- Less powerful than parametric counterparts
- Often lead to overestimation of variances of test
statistics when there are large proportions of
tied observations - They dont lend themselves easily to CIs and
sample size calculations - Interpretation of non-parametric results is quite
hard
144Scatterplots and correlation
145Variable X and variable Y
- Relationship between 2 quantitative variables
- Explanatory variable X
- Response variable Y
- Does X cause Y ?
146Scatterplot
- Start by plotting bivariate data points to make a
scatterplot
147Example of a scatterplot
X students taking SAT Y mean SAT verbal
score What is the relation between X and Y?
148Interpreting scatterplots
- Form can data be described by a straight line?
- Exceptions (outliers) to form
- Direction upward or downward
- Strength extent to which data points adhere to
trend line
149Examples of Forms
150Strength and direction
- Direction positive, negative, neither
- Strength How closely do points adhere to trend
line? - Close fitting ? strong
- Loose fitting ? weak
151Strength cannot be judged by eye alone
- These two scatterplots are of the same data set
- The second scatter plot looks like a stronger
correlation, but this is an artifact of the axis
scaling
152Correlation coefficient (r)
- r the correlation coefficient
- r is always between -1 and 1, inclusive
- r 1? all points on upward sloping line
- r -1 ? all points on downward line
- r 0 ? no line or horizontal line
- The closer r is to 1 or 1, the better the fit,
the stronger the correlation
153Interpretation of r
Direction positive or negative Strength the
closer r is to 1, the stronger the
correlation 0.0 ? r lt 0.3 ? weak
correlation 0.3 ? r lt 0.7 ? moderate
correlation 0.7 ? r lt 1.0 ? strong correlation
r 1.0 ? perfect correlation
154Not all Relationships are Linear Miles per
Gallon versus Speed
- r ? 0 (flat line) with strong non-linear relation
- But
-
155Not all Relationships are Linear Miles per
Gallon versus Speed
- Very strong non-linear (curved) relationship
- r was misleading!
156Outliers and Correlation
The outlier in the above graph decreases r so
that r 0 If we remove the outlier ? strong
relation
157Beware!
- Not all relations are linear
- Outliers can have large influence on r
- Lurking variables confound relations
158Regression
159Objectives of Regression
- Quantitative X (explanatory)
- Quantitative Y (response)
- Objectives
- Describe the change in Y per unit X
- Predict average Y at given X
160Equation for an algebraic line
Y (intercept) (slope)(X) or Y (slope)(X)
(intercept)
Intercept ? where line crosses Y axisSlope ?
angle of line
161Equation for a regression line
- Algebraic line ? every point falls on lineExact
Y intercept (slope)(X) - Statistical line ? scatter around linear trend
- Predicted Y intercept (slope)(X)
y a bx where y (y-hat) is the predicted
value of Y, a is the intercept, and b is the slope
162What Line Fits Best?
- The method we use to draw the best fitting line
is called the least squares method - If we try to draw the line by eye, different
people will draw different lines
163The least squares regression line
- Each point has
- Residual observed y predicted y
- distance of point from line predicted by
model
The least squares line minimizes the sum of the
square residuals
164Regression Line
- For the bird population data
- a 31.9343
- b ?0.3040
- The linear regression equation is
- y 31.9343 ? 0.3040x
The slope -0.3040 represents the average change
in Y per unit X
165CautionsAbout Correlation and Regression
- Describe only linear relationships
- Beware of influential outliers
- Cannot predict beyond the range of X (do not
extrapolate) - Beware of lurking variables (variables other than
X and Y) - Association does not equal causation!
166Transformations
- Often we work with transformed data rather than
the original numbers - 1/x for skewed left data
- Log (x) or ln (x) when skewed right
- Transforming data can
- Make a skewed distribution more symmetric
- Make the distribution more normal-like
- Stabilize variability
- Linearize a relationship between 2 or more
variables - Show summary statistics in original units but
test on the transformed scale
167Caution
- Even strong correlations may be non-causal
(Beware lurking variables!)
168Survival analysis
169One sample case
- Time-to-event data
- Estimating the survival curve Kaplan-Meier
(product limit) method - Inferences based on survival curve
170Two-sample case
- Comparison at a fixed time
- Hazard rates
- Overall comparison
- Mantel-Haenszel method
- Logrank method
- Estimation of common hazard ratio
- Summary event rates
- Sample size
171What statistical test should I use?
172Inferential statistics
173Inferential statistics
174Multiple comparisons
- ANOVA, Kruskall-Wallis and chi-square will test
whether there are any differences between groups - If doing multiple comparisons, must correct
- Bonferroni ( p by number of comparisons)
- Student-Neuman-Keuls
- Tukeys
- Scheffes f test
175Common errors in statistical interpretation
176Error 1
- The p-value is the probability of the null
hypothesis being true base on the observed result
177Error 2
- Failure to reject the null hypothesis equals
acceptance of the null hypothesis
178Error 3
- Failure to reject the null hypothesis equals
rejection of the alternate hypothesis
179Error 4
- An a level of 0.05 is a standard with an
objective basis
180Error 5
- The smaller the p-value, the larger the effect
181Error 6
- Statistical significance implies importance
182Error 7
- Data verify or refute theory