Title: Numerical Descriptive Measures
1Numerical Descriptive Measures
2Objectives
- On completion of this module you will be able to
- calculate and interpret measures of central
tendency (mean, median mode), - calculate and interpret quartiles, range,
interquartile range, variance and standard
deviation, - calculate and interpret the coefficient of
variation, - understand and utilise the empirical rule and the
Bienaymé-Chebyshev Rule,
3Objectives
- On completion of this module you will be able to
- construct a box-and-whisker plot,
- calculate the covariance and correlation and
- discuss pitfalls and ethical issues relating to
descriptive measures.
4Guide for study this week
- Print out Section 3.7 of the text (on CD) so that
you can bring it in to the exam room - Read Appendices A (algebra review), B (summation
notation) and C (statistical symbols). This
material will help you understand the course
content.
5Example 3-1
- A manufacturer of mobile phones has been
concerned that the latest model of the battery is
not lasting as long as anticipated. - They take a random sample of 20 phones and
batteries, and record how long they take to go
flat (this is done by turning the phones on and
leave them switched on until the battery goes
flat).
6Example 3-1
- The following data (battery life in hours) are
the result - Check that you can replicate the results
discussed here using Excel and PHStat2.
42 42 48 45 51 45 48 44 43 42
46 46 47 48 40 48 42 48 51 50
7(a) Mean, median and mode
8(a) Mean, median and mode
- Median is observation.
- Order data from smallest to largest
- Find 10th and 11th values 46 and 46.
- Half-way between these is the median 46.
40 42 42 42 42 43 44 45 45 46
46 47 48 48 48 48 48 50 51 51
9(a) Mean, median and mode
- Mode is the data value that occurs most often
is most typical. - Mode 48 (appears fives times)
10(a) Mean, median and mode
- Mean and median are similar probably best
measures of middle for this data set. - Mode is usually only a good measure of the middle
for large data sets. - The manufacturer will use this information to
determine what is normal or most usual for
battery life. This would be helpful in producing
and maintaining quality products and in
benchmarking.
11(b) Quartiles
- Lower (or first) quartile (LQ) is the value
or the value. - The text rounds this value to the nearest whole
number and takes that data value the 5th data
point is 42. - Some texts take the value 0.25 of the way from
the 5th (42) to the 6th (43) value 42.25.
40 42 42 42 42 43 44 45 45 46
46 47 48 48 48 48 48 50 51 51
12(b) Quartiles
- Upper (or third) quartile (UQ) is
-
- value or the value.
- UQ is the 16th value 48.
- OR take the value 0.75 of the way from the 15th
(48) to the 16th (48) value 48.
40 42 42 42 42 43 44 45 45 46
46 47 48 48 48 48 48 50 51 51
13(c) Variance standard deviation
- The text uses a different formula, but a better
computational formula for the variance is - This only requires we find the sum of the data
and the sum of the squared data to do the
calculation.
14(c) Variance standard deviation
15(c) Variance standard deviation
Note avoid rounding errors dont round until
the final stage!!!
16Interpreting variance and standard deviation
- Variance and standard deviation measure average
scatter around the mean. - Variance results in squared units (eg squared
dollars, squared metres etc). - Therefore we usually interpret the standard
deviation which is measured in the original units
(dollars, metres etc). - Much of the data is within one standard deviation
either side of the mean (more on this later). - Measures of variation (variance, standard
deviation, range and IQR) are always greater than
or equal to zero.
17Populations parameters
- The formulae we have been using calculate the
mean and standard deviation for samples of data. - If you had all the data from the population (i.e
not just a sample), the following formulae would
be used
18Coefficient of variation
- Coefficient of variation (CV) is
- This is a relative measure of variation it
measures scatter of the data relative to the
mean. - It allows comparison of variability between
variables with different units of measure.
19Coefficient of variation
- Imagine two data sets
- Although the standard deviations are equal, we
cant say the distributions have the same
relative variation. - Distribution one shows greater relative variation
than distribution two.
20(c) Coefficient of variation
- Back to the example the coefficient of variation
(CV) is
21(c) Interquartile range range
- Interquartile range (IQR)
- Range
22(d) Box-and-Whisker Plot
- For a box-and-whisker plot we need five values
the minimum, lower quartile, median, upper
quartile and the maximum. - For our data set these are 40, 42, 46, 48 and 51
respectively. - Create the central box with vertical lines at the
lower quartile, median and upper quartile. - Plot lines out from this box to the minimum and
maximum values.
23(d) Box-and-Whisker Plot
40
41
42
49
48
47
46
45
44
43
51
50
24(e) Interpretation
- If the manufacturer intended to issue a
statement saying that their batteries will last
more than 50 hours, what would you advise them?
Why? - Mean, median and mode are all less than 50 hours
? not wise to make this claim. - Only 3 of the 20 observations (15) are over 50
hours!!
25(f) Changed data measures of central tendency
- (f) Suppose the first value was 142 instead of
42. Repeat (a) and comment on the differences.
26(f) Changed data measures of central tendency
40 42 42 42 43 44 45 45 46 46
47 48 48 48 48 48 50 51 51 142
- Since the data order has changed, the median
becomes 46.5 - Mode 48 (as before)
27(g) Changed data measures of spread
28(g) Changed data measures of spread
29(g) Changed data measures of spread
- All measures of spread have increased
dramatically remember only one data point
changed! - We would have interpreted the shape of this data
distribution very differently if we had not known
these figures were so affected by one data value
important to always check the data carefully. - A stem-and-leaf plot or histogram would have
helped us identify the outlier.
30(h) Description of distribution
- (h) How would you describe the shape of the
original data set? The revised data set? - The best way to do this would be a graph (eg
produce a stem-and-leaf diagram or a histogram). - First data set the mean (45.8) and median (46)
are very similar ? data is fairly symmetrical
31(h) Description of distribution
- Modified data set the mean (50.8) is greater
than the median (46.5) in the second case ? data
reveals a slight right skew (although we know an
outlier caused this result).
32Shape
Negative or left-skewness MeanltM
edian
Symmetry or zero-skewness MeanMedian
Positive or right-skewness MeangtM
edian
33Geometric mean
- Geometric mean is the nth root of the product of
n values - Geometric mean rate of return is
34Example 3-2
- The total rate of return () of three bluechip
stocks is given in the table below for the years
2003, 2004 and 2005. - (a) Calculate the geometric mean rate of return
for each stock. - (b) Compare these results.
Year Stock A Stock B Stock C
2003 3.64 1.12 -0.25
2004 2.32 1.70 1.03
2005 0.09 -3.50 2.08
35Solution 3-2
- The geometric mean rate of return is given by
- where the Ri are expressed as decimals.
- Stock A
- Stock B
- Stock C
36Solution 3-2
- (b) Stock A 2.01
- Stock B -0.25
- Stock C 0.95
- Stock B has the worst rate of return (due to
negative value in 2005) - Stock A has the best rate of return (positive but
still a considerable drop in 2005) - Stock C shows increasing rate of return over the
three years may actually make it a better
choice!
37Populations parameters
- Recall the following formula for population mean
and standard deviation - Greek letters are used to indicate population
parameters (µ mu, ? sigma) and Roman for
sample parameters ( , S).
38Empirical Rule
- In bell-shaped distributions (symmetrical, mean
median) - 68 of observations are within 1 standard
deviation of the mean - 95 of observations are within 2 standard
deviations of the mean - 99.7 of observations are within 3 standard
deviations of the mean
39Bienaymé-Chebyshev Rule
- For any data set (i.e. not just bell-shaped
distributions), the percentage of observations
that are within k standard deviations of the mean
is at least - Often this rule is simply called the Chebyshevs
rule a little bit easier to say!
40Example 3-3
- Returning to the data set in Example 3-1, answer
the following questions. - According to the Bienaymé-Chebyshev Rule, what
percentage of these battery lives are expected to
be within 1 standard deviation of the mean?
Within 2 standard deviations of the mean?
Within 3 standard deviations of the mean?
41(a) Bienaymé-Chebyshev Rule
- Given k 1,
-
- so at least 0 of observations are expected to
be within 1 standard deviation of the mean (not
very helpful!!!). - For k 2, so at least
- 75 of observations are within 2 standard
deviations of the mean. - For k 3, so at least
- 88.89 of observations are within 3 standard
deviations of the mean.
42Example 3-3
- (b) Assume that the manufacturer knows that the
mean life of the population of batteries is 48.2
hours and the standard deviation of the
population of batteries is 3.1 hours. - What percentage of data values are actually
within 1 standard deviation of the mean? - Within 2 standard deviations of the mean?
- Within 3 standard deviations of the mean?
43(b) Data within intervals
40 42 42 42 42 43 44 45 45 46
46 47 48 48 48 48 48 50 51 51
So of the data is within one standard
deviation of the mean.
44(b) Data within intervals
40 42 42 42 42 43 44 45 45 46
46 47 48 48 48 48 48 50 51 51
So of the data is within two standard
deviations of the mean.
45(b) Data within intervals
40 42 42 42 42 43 44 45 45 46
46 47 48 48 48 48 48 50 51 51
So all (100) of the data is within three
standard deviations of the mean.
46Example 3-3
- (c) Discuss the difference in your answers to (a)
and (b). - Bienaymé-Chebyshev Rule applies to any
distribution it is a worst case. - It says at least 0 within 1 standard deviation,
at least 75 within 2 standard deviations and at
least 88.89 within 3 standard deviations. - The data set we examined has less spread than
this worst case rule.
47Example 3-4
- A real estate agency is worried that many of
their agents are using poor sales techniques and
that this is having a negative impact on sales. - They believe this is because many of their
agents received very low scores on their
compulsory training course exam (an exam which is
sat prior to beginning employment with the
agency). - They randomly select 10 of their agents,
recording their exam score (out of 200) and the
number of sales they made in the year 2005.
48(a) Produce a scatterplot of the data. Does
there appear to be any correlation between exam
score and sales? Explain.
Score Sales
185 212
122 143
157 184
165 182
183 201
191 235
121 154
158 187
166 178
102 146
49(No Transcript)
50(a) Graph discussion
- Higher exam scores appear to correspond to higher
sales figures. - Therefore there appears to be positive
correlation between exam scores and sales figures.
51Hints on producing graphs
- Important always include the following on a
graph - descriptive labels for both the x and y axes (in
this example Exam Score and Sales) - numbers on both axes to indicate the scale
- a title
- Truncate the axes only if it doesnt violate
principles of graphical excellence!
52(b) Compute the correlation coefficient.
Comment on this value and its meaning for the
real estate agency.
Score Sales
185 212
122 143
157 184
165 182
183 201
191 235
121 154
158 187
166 178
102 146
53(b) Correlation
- Using the computational formula
-
-
- we need to find the values
- and
54(b) Correlation
55(b) Correlation
Note the amount of working required to use the
form of the formula that the text uses see p.
3-12 of the study guide.
56(b) Correlation
- 0.934265 is close to 1 indicating strong positive
correlation between exam scores and sales
figures. - Interpretation for real estate agent
- Allow students (agents) to re-sit the exam to
(possibly) improve their sales performance. - Use the exam as a pre-screening tool when
employing potential agents. - Be very careful correlation does not imply
causation!!
57Covariance
- The covariance is found via
- Used in the calculation of correlation for the
formula used in the text
58Descriptive measures from a frequency distribution
- Approximate mean
- Approximate standard deviation
59Example 3-5
- Participants at a recent accounting for small
businesses workshop were asked to complete an
anonymous survey. - The table below contains data taken from this
survey a frequency distribution of the number of
staff employed by each of the 50 small businesses
in attendance at the workshop. - Note that the fractional (part-time) staff were
recorded in this survey, so for example 2.75
staff could mean two full time and one staff
member employed for ¾ of the hours in a working
week.
60Example 3-5
Class Frequency
0 to less than 5 16
5 to less than 10 19
10 to less than 15 5
15 to less than 20 7
20 to less than 25 2
25 to less than 30 1
Approximate the arithmetic mean and standard
deviation of the number of attendees.
61Example 3-5
Class Frequency fj Midpoint mj mj fj
0 to less than 5 16
5 to less than 10 19
10 to less than 15 5
15 to less than 20 7
20 to less than 25 2
25 to less than 30 1
62Example 3-5
Class Frequency fj Midpoint mj mj fj
0 to less than 5 16 2.5
5 to less than 10 19 7.5
10 to less than 15 5 12.5
15 to less than 20 7 17.5
20 to less than 25 2 22.5
25 to less than 30 1 27.5
63Example 3-5
Class Frequency fj Midpoint mj mj fj
0 to less than 5 16 2.5 16?2.540
5 to less than 10 19 7.5 19?7.5142.5
10 to less than 15 5 12.5 62.5
15 to less than 20 7 17.5 122.5
20 to less than 25 2 22.5 45
25 to less than 30 1 27.5 27.5
64Example 3-5
Class Frequency fj Midpoint mj mj fj
0 to less than 5 16 2.5 16?2.540
5 to less than 10 19 7.5 19?7.5142.5
10 to less than 15 5 12.5 62.5
15 to less than 20 7 17.5 122.5
20 to less than 25 2 22.5 45
25 to less than 30 1 27.5 27.5
Total 50 440
65Example 3-5
66Text sections on CD
- Remember to print out Section 3.7 (Obtaining
descriptive summary measures from a frequency
distribution) of the text (on the CD) so that
you can bring it in to the exam room!
67Pitfalls and ethical issues
- Interpretation of numerical values is subjective
(although the actual calculations are objective). - Knowing the shape of the distribution can
influence the choice of descriptive measures that
you use. - For example the centre of a skewed data set might
be best described by the median rather than the
mean. - Report results accurately but in a neutral and
objective manner.
68Pitfalls and ethical issues
- Report both good and bad results.
- Poor presentation is not necessarily the same as
unethical presentation of results. - Unethical behaviour occurs when
- an inappropriate summary method is chosen
wilfully or - when selective findings are not reported because
it would not support a particular position.
69After the lecture each week
- Review the lecture material
- Complete all readings
- Complete all of recommended problems (listed in
SG) from the textbook - Complete at least some of additional problems
- Consider (briefly) the discussion points prior to
tutorials