Title: Measure of Variability (Dispersion, Spread)
1Measure of Variability (Dispersion, Spread)
- Range
- Inter-Quartile Range
- Variance, standard deviation
- Pseudo-standard deviation
2Measure of Central Location
- Mean
- Median
3Range
- Inter-Quartile Range (IQR)
Inter-Quartile Range IQR Q3 - Q1
4Example
- The data Verbal IQ on n 23 students arranged in
increasing order is - 80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102
104 105 105 109 111 118 119
Q3 105
Q2 96
Q1 89
min 80
max 119
5Range and IQR
- Range max min 119 80 39
- Inter-Quartile Range
- IQR Q3 - Q1 105 89 16
6Sample Variance
- Let x1, x2, x3, xn denote a set of n numbers.
- Recall the mean of the n numbers is defined as
7- The numbers
-
- are called deviations from the the mean
8- The sum
- is called the sum of squares of deviations from
the the mean. - Writing it out in full
- or
9The Sample Variance
- Is defined as the quantity
- and is denoted by the symbol
-
10The Sample Standard Deviation s
- Definition The Sample Standard Deviation is
defined by - Hence the Sample Standard Deviation, s, is the
square root of the sample variance.
11- Example
- Let x1, x2, x3, x4, x5 denote a set of 5 denote
the set of numbers in the following table.
12- Then
- x1 x2 x3 x4 x5
- 10 15 21 7 13
- 66
- and
13- The deviations from the mean d1, d2, d3, d4, d5
are given in the following table.
14 15- Also the standard deviation is
16Interpretations of s
- In Normal distributions
- Approximately 2/3 of the observations will lie
within one standard deviation of the mean - Approximately 95 of the observations lie within
two standard deviations of the mean - In a histogram of the Normal distribution, the
standard deviation is approximately the distance
from the mode to the inflection point
17Mode
Inflection point
s
182/3
s
s
192s
20Example
- A researcher collected data on 1500 males aged
60-65. - The variable measured was cholesterol and blood
pressure. - The mean blood pressure was 155 with a standard
deviation of 12. - The mean cholesterol level was 230 with a
standard deviation of 15 - In both cases the data was normally distributed
21Interpretation of these numbers
- Blood pressure levels vary about the value 155 in
males aged 60-65. - Cholesterol levels vary about the value 230 in
males aged 60-65.
22- 2/3 of males aged 60-65 have blood pressure
within 12 of 155. i.e. between 155-12 143 and
15512 167. - 2/3 of males aged 60-65 have Cholesterol within
15 of 230. i.e. between 230-15 215 and 23015
245.
23- 95 of males aged 60-65 have blood pressure
within 2(12) 24 of 155. Ii.e. between 155-24
131 and 15524 179. - 95 of males aged 60-65 have Cholesterol within
2(15) 30 of 230. i.e. between 230-30 200 and
23030 260.
24A Computing formula for
- Sum of squares of deviations from the the mean
- The difficulty with this formula is that
will have many decimals. - The result will be that each term in the above
sum will also have many decimals.
25- The sum of squares of deviations from the the
mean can also be computed using the following
identity
26- To use this identity we need to compute
27 28(No Transcript)
29Example
- The data Verbal IQ on n 23 students arranged in
increasing order is - 80 82 84 86 86 89 90 94
- 94 95 95 96 99 99 102 102
- 104 105 105 109 111 118 119
30- 80 82 84 86 86 89
- 90 94 94 95 95 96 99 99
102 102 104 - 105 105 109 111 118 119 2244
- 802 822 842 862 862 892
- 902 942 942 952 952 962 992
992 1022 1022 1042 - 1052 1052 1092 1112
- 1182 1192 221494
31You will obtain exactly the same answer if you
use the left hand side of the equation
32(No Transcript)
33(No Transcript)
34A quick (rough) calculation of s
- The reason for this is that approximately all
(95) of the observations are between - and
- Thus
35Example
- Verbal IQ on n 23 students
- min 80 and max 119
- This compares with the exact value of s which is
10.782. - The rough method is useful for checking your
calculation of s.
36The Pseudo Standard Deviation (PSD)
- Definition The Pseudo Standard Deviation (PSD)
is defined by
37Properties
- For Normal distributions the magnitude of the
pseudo standard deviation (PSD) and the standard
deviation (s) will be approximately the same
value - For leptokurtic distributions the standard
deviation (s) will be larger than the pseudo
standard deviation (PSD) - For platykurtic distributions the standard
deviation (s) will be smaller than the pseudo
standard deviation (PSD)
38Example
- Verbal IQ on n 23 students
- Inter-Quartile Range
- IQR Q3 - Q1 105 89 16
- Pseudo standard deviation
- This compares with the standard deviation
-
39- An outlier is a wild observation in the data
- Outliers occur because
- of errors (typographical and computational)
- Extreme cases in the population
- We will now consider the drawing of box-plots
where outliers are identified
40Box-whisker Plots showing outliers
41- An outlier is a wild observation in the data
- Outliers occur because
- of errors (typographical and computational)
- Extreme cases in the population
- We will now consider the drawing of box-plots
where outliers are identified
42To Draw a Box Plot we need to
- Compute the Hinge (Median, Q2) and the Mid-hinges
(first third quartiles Q1 and Q3 ) - To identify outliers we will compute the inner
and outer fences
43- The fences are like the fences at a prison. We
expect the entire population to be within both
sets of fences. - If a member of the population is between the
inner and outer fences it is a mild outlier. - If a member of the population is outside of the
outer fences it is an extreme outlier.
44- Lower outer fence
- F1 Q1 - (3)IQR
- Upper outer fence
- F2 Q3 (3)IQR
45- Lower inner fence
- f1 Q1 - (1.5)IQR
- Upper inner fence
- f2 Q3 (1.5)IQR
46- Observations that are between the lower and upper
fences are considered to be non-outliers. - Observations that are outside the inner fences
but not outside the outer fences are considered
to be mild outliers. - Observations that are outside outer fences are
considered to be extreme outliers.
47- mild outliers are plotted individually in a
box-plot using the symbol - extreme outliers are plotted individually in a
box-plot using the symbol - non-outliers are represented with the box and
whiskers with - Max largest observation within the fences
- Min smallest observation within the fences
48Extreme outlier
Box-Whisker plot representing the data that are
not outliers
Mild outliers
Inner fences
Outer fence
49Example
- Data collected on n 109 countries in 1995.
- Data collected on k 25 variables.
50The variables
- Population Size (in 1000s)
- Density Number of people/Sq kilometer
- Urban percentage of population living in cities
- Religion
- lifeexpf Average female life expectancy
- lifeexpm Average male life expectancy
51- literacy of population who read
- pop_inc increase in popn size (1995)
- babymort Infant motality (deaths per 1000)
- gdp_cap Gross domestic product/capita
- Region Region or economic group
- calories Daily calorie intake.
- aids Number of aids cases
- birth_rt Birth rate per 1000 people
52- death_rt death rate per 1000 people
- aids_rt Number of aids cases/100000 people
- log_gdp log10(gdp_cap)
- log_aidsr log10(aids_rt)
- b_to_d birth to death ratio
- fertility average number of children in family
- log_pop log10(population)
53- cropgrow ??
- lit_male of males who can read
- lit_fema of females who can read
- Climate predominant climate
54The data file as it appears in SPSS
55Consider the data on infant mortality
Stem-Leaf diagram stem 10s, leaf unit digit
56Summary Statistics
median Q2 27
Quartiles Lower quartile Q1 the median of
lower half Upper quartile Q3 the median of
upper half
Interquartile range (IQR) IQR Q1 - Q3 66.5
12 54.5
57The Outer Fences
lower Q1 - 3(IQR) 12 3(54.5) - 151.5
upper Q3 3(IQR) 66.5 3(54.5) 230.0
No observations are outside of the outer fences
The Inner Fences
lower Q1 1.5(IQR) 12 1.5(54.5) - 69.75
upper Q3 1.5(IQR) 66.5 1.5(54.5) 148.25
Only one observation (168 Afghanistan) is
outside of the inner fences (mild outlier)
58Box-Whisker Plot of Infant Mortality
Infant Mortality
59Example 2
- In this example we are looking at the weight
gains (grams) for rats under six diets differing
in level of protein (High or Low) and source of
protein (Beef, Cereal, or Pork). - Ten test animals for each diet
60Table Gains in weight (grams) for rats under six
diets differing in level of protein (High or
Low) and source of protein (Beef, Cereal, or
Pork)
Â
61High Protein
Low Protein
Beef
Cereal
Pork
Cereal
Pork
Beef
62Conclusions
- Weight gain is higher for the high protein meat
diets - Increasing the level of protein - increases
weight gain but only if source of protein is a
meat source
63Measures of Shape
64Measures of Shape
Negatively skewed
Symmetric
Positively skewed
Leptokurtic
Normal (mesokurtic)
Platykurtic
65- Measure of Skewness based on the sum of cubes
- Measure of Kurtosis based on the sum of 4th
powers
66 67The 3 is subtracted so that g2 is zero for the
normal distribution
68Interpretations of Measures of Shape
g1 gt 0
g1 0
g1 lt 0
g2 lt 0
g2 0
g2 gt 0
69Descriptive techniques for Multivariate data
In most research situations data is collected on
more than one variable (usually many variables)
70Graphical Techniques
- The scatter plot
- The two dimensional Histogram
71The Scatter Plot
- For two variables X and Y we will have a
measurements for each variable on each case - xi, yi
- xi the value of X for case i
- and
- yi the value of Y for case i.
72- To Construct a scatter plot we plot the points
- (xi, yi)
- for each case on the X-Y plane.
(xi, yi)
yi
xi
73Â Data Set 3 The following table gives data on
Verbal IQ, Math IQ, Initial Reading Acheivement
Score, and Final Reading Acheivement Score for 23
students who have recently completed a reading
improvement program  Initial Final Verbal
Math Reading Reading Student IQ IQ Acheivement
Acheivement  1 86 94 1.1 1.7 2 104 103 1.5 1.7
3 86 92 1.5 1.9 4 105 100 2.0 2.0 5 118 115 1.9
3.5 6 96 102 1.4 2.4 7 90 87 1.5 1.8 8 95 100
1.4 2.0 9 105 96 1.7 1.7 10 84 80 1.6 1.7 11 94
87 1.6 1.7 12 119 116 1.7 3.1 13 82 91 1.2 1.8
14 80 93 1.0 1.7 15 109 124 1.8 2.5 16 111 119
1.4 3.0 17 89 94 1.6 1.8 18 99 117 1.6 2.6 19 9
4 93 1.4 1.4 20 99 110 1.4 2.0 21 95 97 1.5 1.3
22 102 104 1.7 3.1 23 102 93 1.6 1.9
74(No Transcript)
75(84,80)
76(No Transcript)
77Some Scatter Patterns
78(No Transcript)
79(No Transcript)
80- Circular
- No relationship between X and Y
- Unable to predict Y from X
81(No Transcript)
82(No Transcript)
83- Ellipsoidal
- Positive relationship between X and Y
- Increases in X correspond to increases in Y (but
not always) - Major axis of the ellipse has positive slope
84(No Transcript)
85Example
86(No Transcript)
87Some More Patterns
88(No Transcript)
89(No Transcript)
90- Ellipsoidal (thinner ellipse)
- Stronger positive relationship between X and Y
- Increases in X correspond to increases in Y (more
freqequently) - Major axis of the ellipse has positive slope
- Minor axis of the ellipse much smaller
91(No Transcript)
92- Increased strength in the positive relationship
between X and Y - Increases in X correspond to increases in Y
(almost always) - Minor axis of the ellipse extremely small in
relationship to the Major axis of the ellipse.
93(No Transcript)
94(No Transcript)
95- Perfect positive relationship between X and Y
- Y perfectly predictable from X
- Data falls exactly along a straight line with
positive slope
96(No Transcript)
97(No Transcript)
98- Ellipsoidal
- Negative relationship between X and Y
- Increases in X correspond to decreases in Y (but
not always) - Major axis of the ellipse has negative slope slope
99(No Transcript)
100- The strength of the relationship can increase
until changes in Y can be perfectly predicted
from X
101(No Transcript)
102(No Transcript)
103(No Transcript)
104(No Transcript)
105(No Transcript)
106Some Non-Linear Patterns
107(No Transcript)
108(No Transcript)
109- In a Linear pattern Y increase with respect to X
at a constant rate - In a Non-linear pattern the rate that Y
increases with respect to X is variable
110Growth Patterns
111(No Transcript)
112(No Transcript)
113- Growth patterns frequently follow a sigmoid curve
- Growth at the start is slow
- It then speeds up
- Slows down again as it reaches it limiting size
114Measures of strength of a relationship
(Correlation)
- Pearsons correlation coefficient (r)
- Spearmans rank correlation coefficient (rho, r)
115- Assume that we have collected data on two
variables X and Y. Let - (x1, y1) (x2, y2) (x3, y3) (xn, yn)
- denote the pairs of measurements on the on two
variables X and Y for n cases in a sample (or
population)
116- From this data we can compute summary statistics
for each variable. - The means
- and
-
117- The standard deviations
- and
-
118- These statistics
- give information for each variable separately
- but
- give no information about the relationship
between the two variables -
119 120- The first two statistics
- are used to measure variability in each variable
- they are used to compute the sample standard
deviations -
121- The third statistic
- is used to measure correlation
- If two variables are positively related the sign
of - will agree with the sign of
-
122- When is positive will be
positive. - When xi is above its mean, yi will be above its
mean - When is negative will be
negative. - When xi is below its mean, yi will be below its
mean - The product will be
positive for most cases.
123- This implies that the statistic
- will be positive
- Most of the terms in this sum will be positive
-
124- On the other hand
- If two variables are negatively related the sign
of - will be opposite in sign to
-
125- When is positive will be
negative. - When xi is above its mean, yi will be below its
mean - When is negative will be
positive. - When xi is below its mean, yi will be above its
mean - The product will be
negative for most cases.
126- Again implies that the statistic
- will be negative
- Most of the terms in this sum will be negative
-
127- Pearsons correlation coefficient is defined as
below -
128- The denominator
- is always positive
-
129- The numerator
- is positive if there is a positive relationship
between X ad Y and - negative if there is a negative relationship
between X ad Y. - This property carries over to Pearsons
correlation coefficient r -
130Properties of Pearsons correlation coefficient r
- The value of r is always between 1 and 1.
- If the relationship between X and Y is positive,
then r will be positive. - If the relationship between X and Y is negative,
then r will be negative. - If there is no relationship between X and Y, then
r will be zero. - The value of r will be 1 if the points, (xi, yi)
lie on a straight line with positive slope. - The value of r will be -1 if the points, (xi, yi)
lie on a straight line with negative slope.
131r 1
132r 0.95
133r 0.7
134r 0.4
135r 0
136r -0.4
137r -0.7
138r -0.8
139r -0.95
140r -1
141- Computing formulae for the statistics
-
142 143- To compute
- first compute
- Then
-
144Example
145Â Data Set 3 The following table gives data on
Verbal IQ, Math IQ, Initial Reading Acheivement
Score, and Final Reading Acheivement Score for 23
students who have recently completed a reading
improvement program  Initial Final Verbal
Math Reading Reading Student IQ IQ Acheivement
Acheivement  1 86 94 1.1 1.7 2 104 103 1.5 1.7
3 86 92 1.5 1.9 4 105 100 2.0 2.0 5 118 115 1.9
3.5 6 96 102 1.4 2.4 7 90 87 1.5 1.8 8 95 100
1.4 2.0 9 105 96 1.7 1.7 10 84 80 1.6 1.7 11 94
87 1.6 1.7 12 119 116 1.7 3.1 13 82 91 1.2 1.8
14 80 93 1.0 1.7 15 109 124 1.8 2.5 16 111 119
1.4 3.0 17 89 94 1.6 1.8 18 99 117 1.6 2.6 19 9
4 93 1.4 1.4 20 99 110 1.4 2.0 21 95 97 1.5 1.3
22 102 104 1.7 3.1 23 102 93 1.6 1.9
146(No Transcript)
147 148- Thus Pearsons correlation coefficient is
-