Title: Describing Variation in Data Chapter 9
1Describing Variation in DataChapter 9
- Kimberly R. Barber, PhD
- HED 547
2Variation
- Variation is present in every aspect of every
characteristic - In test measures, behaviors, environment.
- All characteristics have some level of variation.
- Level of variation depends on
- Source
- Population
- Accuracy of measure
3Variation Defined
- Differences in the values of a characteristic.
- Variation inherently always exists
- Because individuals differ,
- Because individuals are not all going to have the
same exact value for every variable. - Student height
- Student age
- Student blood pressure
4Sources of Variation
- Biological differences
- Measurement errors
- Differences in measurement technique
- Differences in measurement conditions
- Random variation
5Biological Variation
- Differences in
- Genes,
- Nutrition,
- Environmental exposures.
- Although people tend to be similar through
genetics, they are not exactly the same. - In addition, differing exposures create different
outcomes - i.e., tall parents tend to have tall children,
- But malnutrition can result in shortness even
with tall parents.
6Disease Variation
- Differences in the presence / absence of disease.
- Differences in the stages of disease.
- Differences in co-morbidity.
- Exercise
- Using Diabetes as an example explain how
individuals can differ on all of the above
factors.
7Measurement Variation
- Differences in conditions during measurement
- Ambient factors (i.e., temp, noise)
- Environment and blood pressure
- Patient factors (i.e., fatigue, anxiety)
- White coat syndrome and blood pressure.
- Differences in methods of measurement
- Instrument, technique, and operator.
- Exercise how can weight differ according to
instrument, technique, or operator?
8Measurement Error
- Differences in instrument recordings.
- Machine error, survey typos, etc.
- Differences in instrument observation.
- Operator interpretation errors,
- Observer errors,
- These measurement variations are all systematic
- They occur for a reason.
- We can control for them.
9Random Error
- Unexplained variation.
- Also called background noise.
- Unsystematic differences in values
- For unknown, random reasons.
- We cannot control for random errors.
- We can estimate a level of random error
- Poll results ( or 5).
10Random Error, continued.
- Random error produces a distribution of values
even if all systematic error is eliminated. - Statistical tests look for systematic differences
between samples beyond the random variation
within a sample.
11Statistics
- Statistical methods explain variation in data
- Describe the pattern of spread (variation) in
your data set, - Compare the pattern of two or more groups,
- Determine whether the differences between two
patterns are real (significant) or random (non
significant).
12Data
- A variable is the characteristic
- Skin color, blood type, age, cholesterol level.
- The data is the value of that characteristic
- White, pink, yellow, brown.
- A, B, O, AB.
- 20, 35, 50 years.
- 136, 201, 400ml.
- Exercise which two are qualitative data and
which two are quantitative?
13Types of Variables
- Nominal
- Naming (categorical)
- No measurement scaling
- Blood type A, B, AB, O
- Dichotomous (binary)
- Categorical but only two levels
- Indicates a direction (normal abnormal, good -
bad) - Cancer Yes / No
- Health status Well / Sick
14Types of Variables, cont.
- Ordinal (ranked)
- Naming with an order.
- From better to worse.
- Illness scale no illness / dizzy / nauseated /
vomiting. - 1 2
3 4 - Test scale excellent / good / fair / poor
- A B C
D - Contains more information than nominal variables
(ill / not ill) (passed / failed)
15Continuous Variables
- Measured on a scale
- Height, weight, age, glucose level.
- Provides even more information than ordinal
variables - Shows position relative to each other,
- Shows extent each observation differs from the
other. - i.e., with age we know just how much older we are
than the average student.
16Units of Observation
- Counts
- Counts of a characteristic in persons, things.
- Patterns presented in a frequency table (2 X 2).
- Can compare proportions within or across groups
(female, male). - Proportions (risk)
- Number of persons with a characteristic.
- Number of persons who died, who ill, etc.
- Can compare the ratio of counts between groups.
17Collapsing Data in Variables(Combining data)
- Continuous variable my be converted to an ordinal
variable - By grouping values together,
- To form categories.
- Are shrinking many values into a few values
(collapsing).
18Collapsing data, continued.
- Disadvantage
- Information is lost because,
- Individual values are no longer apparent.
- (500g through1200g) (lt501g, gt500g)
- Advantage
- Percentages can be created
- Relationships easier to show.
19Frequency Distributions
- The number of persons with each value in a
variable. - Age Distribution
- 5 people are 22y
- 10 people are 23y
- 12 people are 24y
- 15 people are 25y
- 12 people are 26y
- 10 people are 27y
- 5 people are 28y
-
- - - 22 23 24 25 26 27 28
20Frequency Distributions, cont.
- Real distribution
- The pattern obtained from the actual data in a
population or sample population. - Can be one or may shaped curves.
- Gaussian distribution
- The distribution in a population expected
(calculated) under normal or average conditions. - Is a smooth, bell-shaped curve.
21Distributions
- Discuss range of values histogram
- Refer to Table 9-2 and Figure 9-2
- Textbook page 141
- Discuss normal distribution
- Refer to Figure 9-3 and 9-4
- Textbook page 142
22Parameters of Frequency Distribution
- Ways to summarize and define the distribution.
- Measures of central tendency
- Where among the values the commonest value lies.
- Measures of dispersion
- How widely the values are spread out.
23Measures of Central Tendency
- In a normal distribution
- The density of observed values is greatest near
the center. - Each tail of the curve diminishes in similar
frequency toward zero. - The mean, median, and mode are located in the
center of the bell curve.
24Measures of Central Tendency, continued.
- Distribution is not normal if
- mean, medial, mode are in different locations
on the curve.
Mean median
mode
25Skewed Distributions
- Refer to Figure 9-8 of text.
- Curve is pushed (- skew) to the right (Fig.
9-8A). - Curve is pushed ( skew) to the left (Fig. 9-8B).
- Curve is abnormally peaked is leptokurtic (Fig.
9-8C). - Curve is abnormally flat is platykurtic (Fig.
9-8D).
26Measures of Dispersion
Range of values Amount of spread in the
distribution
20 30 40 50 60
20 30 40 50 60
27Variance
- How far each value is from the mean.
- How far in both directions.
- Example Age of a sample.
- Mean is 35 years
- Mary is 20 yrs.
- John is 50 yrs.
- Mean is 35 years
- Mary is 30 yrs.
- John is 40 yrs.
20 35 50
30 35 40
28How Variation is Measured
- S2 ? (Xi µ)2
- N - 1
- See Box 9-3.
- Degrees of freedom (N-1)
- A way to make up for small sample sizes which
through off the estimates. - 200 1 199 (a .05 difference)
- 2 1 1 (a 50 difference)
29Standard Deviation
- Also, describes the amount of spread in the
frequency distribution. - Is the square root of the variance.
- ________
- S v ?(xi - µ)2
- N 1
-
30Standard Deviation, continued.
- For normal distribution
- 99 of values fall between 2.5 SD.
- 95 of values fall between 1.96 SD.
- 68 of values fall between 1.0 SD.
- Refer to Figure 9-6.
- Symbols
- µ mean of the population (theoretical).
- s standard deviation of population.
- _
- X sample mean, s sample std deviation.
31Normalized Dataset
- Choice of units effect mean and std dev.
- Cant compare two groups using two different
measuring units. - For example
- Weight of a group of elephants (µ 5000lbs).
- Weight of a group of mice (µ 5 mg).
- To compare we have to equalize both into the same
units.
32Normalized Data, continued
- To eliminate the effects produced by the choice
of units - Data is put into a unit-free form (normalized).
- Calculate a z-score
- Z distributions have a mean 0 and sd 1.
- Values in terms of how many std deviations each
value is away from the mean.
Zi xi µ sd from the group mean.
s
33Assumptions of Normal Distributions
- In order to test whether a sample differs from
the population its drawn from - The data must meet some assumptions
- That the values are normally distributed,
- The sample size is sufficiently large,
- Bias and error are minimized.
- When these assumptions are met, you may conduct
inferential analysis.
34Effect of Error on Distributions
- Random error effects the distribution differently
than systematic error. - How well we can compare two distributions
- How accurate our conclusions are,
- Depend on how much error is in the data.
- Random error does not effect the group average
(regardless of the value) - Systematic error does effect the average and can
really through off the estimates.
35(No Transcript)
36(No Transcript)
37Reducing Measurement Error
- Pilot test instruments.
- Train the data collectors.
- Double check or verify data entries.
- Use multiple instrument measures.
- Consult statistician about adjusting for
measurement error.
38Nonparametric Distributions
- Data from categorical (nominal ordinal)
variables are not normally distributed. - Statistical tests based on assumptions of the
normal curve cannot be used. - Their parameters are not the mean and standard
deviation. - Their tests do not require that the data follow a
particular distribution. - There are special tests called nonparametric
statistics.
39Nonparametric Distributions
- The nonparametric distribution
- Is not normal,
- Is based on a distribution of counts,
- There are special tests for nonparametric data,
- These tests only require that the data be
ordinal.
40Summary
- Fundamental to any analysis (descriptive or
inferential) is - Understanding your variables and the type of data
they are, - Continuous data is normally distributed and has
two parameters - Measures of central tendency (mean)
- Measures of dispersion (variance)
41Summary, continued
- The normal distribution has
- Mean, median, mode that coincide.
- 95 of observations are within 1.96 standard
deviations from the mean. - Some distributions are skewed
- Outliers pull the mean away from the center.
- Non-normal distributions require special
nonparametric statistical tests.