Title: Biostatistics Academic Preview
1 Biostatistics Academic Preview Session 2
Descriptive Statistics
2Outline
- Descriptive statistics
- The what and why of descriptive statistics
- Types of variables
- Formulas and interpretations of commonly used
descriptive statistics - Pictorial representations of descriptive
statistics - Examining the relationship between two or more
variables
3Descriptive Statistics
- Used to describe the basic features of the data
in the study - Types of variables
- Summary statistics
- Distribution of variables
- Pictorial representation
- Allows you to get a feel for the data
4Purpose of Descriptive Statistics
- Characterize subjects in a study
- Sample size
- Patterns of sampling
- Summary measures
- Distribution
- Finding errors in data collection or data entry
- Impossible, improbable, or inappropriate values
- Values too high or too low
- Outliers
- Strange combinations
- Missing data
- Response rates
5Purpose (cont)
- Validity of assumptions
- Distribution
- outliers
- Equal variance
- Linearity
- Hypothesis generating
- Exploring unanticipated effects
- Difference in effects across subgroups
- Characterization of dose response
- Linear
- exponential
6Types of Descriptive statistics
- Univariate
- Describing one variable
- Bivariate
- Describing two variables simultaneously
- Trivariate
- Describing three variables simultaneously
7Types of variables
8Definitions
- Variable a characteristic that changes or varies
over time and/or different subjects under
consideration. - Changing over time
- Blood pressure, height, weight
- Changing across a population
- gender, race/ethnicity
9Definitions (cont)
- Quantitative variables (numeric) measure a
numerical quantity of amount on each experimental
unit - Qualitative variables (categorical) measure a
non numeric quality or characteristic on each
experimental unity by classifying each subject
into a category
10Quantitative variables
- Discrete variables can only take values from a
list of possible values - Number of co-morbidities
- Continuous variables can assume the infinitely
many values corresponding to the points on a line
interval - weight, height
11Categorical variables
- Nominal unordered categories
- Race/ethnicity
- Gender
- Ordinal ordered categories
- likert scales( disagree, neutral, agree )
- Income categories
12Univariate statistics(numerical variables)
- Summary measures
- Measures of location
- Measures of spread
- Overall pattern (distribution)
- Unimodal (one major peak) vs. bimodal) (2 peaks)
- Symmetric vs. skewed
- Outliers-an individual value that falls outside
the overall pattern
13Summary Statistics Measures of central tendency
(location)
- Mean The mean of a data set is the sum of the
observations divided by the number of observation - Population mean Sample
mean - Median The median of a data set is the middle
value - For an odd number of observations, the median is
the observation exactly in the middle of the
ordered list - For an even number of observation, the median is
the mean of the two middle observation is the
ordered list - Mode The mode is the single most frequently
occurring data value
14Skewness
- The skewness of a distribution is measured by
comparing the relative positions of the mean,
median and mode. - Distribution is symmetrical
- Mean Median Mode
- Distribution skewed right
- Median lies between mode and mean, and mode is
less than mean - Distribution skewed left
- Median lies between mode and mean, and mode is
greater than mean
15Relative positions of the mean and median for (a)
right-skewed, (b) symmetric, and(c) left-skewed
distributions
Note The mean assumes that the data is normally
distributed. If this is not the case it is better
to report the median as the measure of location.
16Summary statisticsMeasures of spread (scale)
- Variance The average of the squared deviations
of each sample value from the sample mean, except
that instead of dividing the sum of the squared
deviations by the sample size N, the sum is
divided by N-1. -
- Standard deviation The square root of the sample
variance - Range the difference between the maximum and
minimum values in the sample.
17Normal curvessame mean but different standard
deviation
18Summary statistics measures of spread (scale)
- We can describe the spread of a distribution by
using percentiles. - The pth percentile of a distribution is the value
such that p percent of the observations fall at
or below it. - Median50th percentile
- Quartiles divide data into four equal parts.
- First quartileQ1
- 25 of observations are below Q1 and 75 above Q1
- Second quartileQ2
- 50 of observations are below Q2 and 50 above Q2
- Third quartileQ3
- 75 of observations are below Q3 and 25 above Q3
19Quartiles
20Five number system
- Maximum
- Minimum
- Median50th percentile
- Lower quartile Q150th percentile
- Upper quartile Q375th percentile
21Graphical display of numerical variables(histogra
m)
Class Interval Frequency 20-under 30 6 30-under
40 18 40-under 50 11 50-under 60 11 60-under
70 3 70-under 80 1
22Graphical display of numerical variables(stem
and leaf plot)
Stem
Leaf
Raw Data
2 3 4 5 6 7 8 9
3 9 7 9 5 6 9 0 7 7 8 8 0 2 4 5 5 6 7 7 8 9 1 1
2 3 3 6 8 9 1 1 2 4 7
23Graphical display of numerical variables(box
plot)
Median
24Graphical display of numerical variables(box
plot)
25Univariate statistics(categorical variables)
- Summary measures
- Countfrequency
- Percentfrequency/total sample
- The distribution of a categorical variable lists
the categories and gives either a count or a
percent of individuals who fall in each category
26Displaying categorical variables
27(No Transcript)
28 Bivariate relationships
- An extension of univariate descriptive statistics
- Used to detect evidence of association in the
sample - Two variables are said to be associated if the
distribution of one variable differs across
groups or values defined by the other variable
29Bivariate Relationships
- Two quantitative variables
- Scatter plot
- Side by side stem and leaf plots
- Two qualitative variables
- Tables
- Bar charts
- One quantitative and one qualitative variable
- Side by side box plots
- Bar chart
30Response and explanatory variables
- Response variable the variable which we intend
to model. - we intend to explain through statistical modeling
- Explanatory variable the variable or variables
which may be used to model the response variable - values may be related to the response variable
31Two quantitative variablesCorrelation
A relationship between two variables.
Explanatory (Independent)Variable
Response (Dependent)Variable
y
x
Hours of Training
Number of Accidents
Shoe Size
Height
Cigarettes smoked per day
Lung Capacity
Score on SAT
Grade Point Average
Height
IQ
What type of relationship exists between the two
variables and is the correlation significant?
32Scatter Plots and Types of Correlation
x hours of training y number of accidents
Accidents
Negative Correlation as x increases, y decreases
33Scatter Plots and Types of Correlation
x SAT score y GPA
GPA
Positive Correlation as x increases y increases
34Scatter Plots and Types of Correlation
x height y IQ
IQ
No linear correlation
35Correlation Coefficient
A measure of the strength and direction of a
linear relationship between two variables
The range of r is from -1 to 1.
If r is close to 1 there is a strong positive
correlation
If r is close to -1 there is a strong negative
correlation
If r is close to 0 there is no linear correlation
36Positive and negative correlation
- 1 If two variables x and y are positively
correlated this means that - large values of x are associated with large
values of y, and - small values of x are associated with small
values of y - 2 If two variables x and y are negatively
correlated this means that - large values of x are associated with small
values of y, and - small values of x are associated with large
values of y
37Positive correlation
38Negative correlation
39Two qualitative variables(Contingency Tables)
- Categorical data is usually displayed using a
contingency table, which shows the frequency of
each combination of categories observed in the
data value - The rows correspond to the categories of the
explanatory variable - The columns correspond the categories of the
response variable
40Example
- Aspirin and Heart Attacks
- Explanatory variabledrug received
- placebo
- Aspirin
- Response variableheart attach status
- yes
- no
41Contingency table heart attack example
42Two qualitative variables
Marijuana Use in College xparental use,
ystudent use
43Case Study 1Mean birth weight by race
44One quantitative, One qualitative
Box plot of age by low birth weight
Mean age by low birth weight
low birth weight
45Case Study 1Birth weight and age
r.09
46Trivariate Relationships
- An extension of bivariate descriptive statistics
- We focus on description that helps us decide
about the role variables might play in the
ultimate statistical analyses - Identify variables that can increase the
precision of the data analysis used to answer
associations between two other variables
47Confounding and effect modification
- A factor, Z, is said to confound a relationship
between a risk factor, X, and an outcome, Y, if
it is not an effect modifier and the unadjusted
strength of the relationship between X and Y
differs from the common strength of the
relationship between X and Y for each level of Z.
- A factor, Z, is said to be an effect modifier of
a relationship between a risk factor, X, and an
outcome measure, Y, if the strength of the
relationship between the risk factor, X, and the
outcome, Y, varies among the levels of Z.
48Example confounding
- In our low birth weight data suppose we wish to
investigate the association between race and low
birth weight. - Our ability to detect this association might be
affected by - Smoking status being associated with low birth
weight - Smoking status being associated with race
49Case study 1Race and smoking status
50Case Study 1Race, smoking status, LBW
smokers
Non-smokers
51Multivariate Statistics
- Allows one to calculated the association between
and response and outcome of interest, after
controlling for potential confounders. - Allows for one to assess the association between
an outcome and multiple response variables of
interest.
Statistical Models
52Next Session
- The what and why of statistical inference
- Statististical estimation and confidence
intervals - Statistical significance tests