Title: Statistics 221
1Statistics 221
- Chapter 3 Part A
- Descriptive Statistics
2Summarizing Data
- We learned in Chapter 2 that one way to derive
knowledge (i. e. learn something) is to collect
data regarding some phenomenon and then summarize
and analyze it. - In chapter 2, we learned about tabular and
graphical techniques for summarizing data. In
this chapter, we learn about numeric techniques
for summarizing data.
3Numeric techniques for summarizing data
- Measures of Location (mean, median, mode,
percentiles, quartiles) - Measures of Variability (range, inter-quartile
range, variance, standard deviation, coefficient
of variation) - Measures of Relative Location (z-scores) and
Detecting Outliers - Exploratory Data Analysis (the 5-number summary
and box plot) - Measures of Association Between Two Variables
(covariance and correlation coefficient)
4Parameters vs Statistics
- If a numerical summary statistic (such as a mean
or average) is computed from a sample, it is
referred to as a statistic if it is computed
from a population, it is referred to as a
parameter. - When a sample set is taken from a population, and
a statistic is calculated from the sample
dataset, the sample statistic is considered to be
a point estimate of the population parameter.
5Measures of location (aka measures of central
tendency)
- Here is the five we will learn
- Mean
- Median
- Mode
- Percentiles
- Quartiles
6The mean (average)
- The mean of a data set is the average of all the
data values. - If the data are from a sample, the mean is
denoted by ? - If the data are from a population, the mean is
denoted by ? (mu).
?
n
n sample size
N population size
7Example Apartment Rents
- Given below is a sample of monthly rent values
() - for one-bedroom apartments. The data is a sample
of 70 - apartments in a particular city. The data are
presented - in ascending order.
8Calculating the mean rent
- Add up all the rents and divide by the number of
rents. - The mean is denoted by x (x-bar).
490.8
9The Median
- The median of a data set is the value in the
middle when the data items are arranged in
ascending order. - For an odd number of observations (n), the median
is the middle value. - For an even number of observations (n), the
median is the average of the two middle values. - The median may be reported instead of the mean
when the data set includes a few extreme values.
10The median rent
- i refers to index. Index is the position number
of a value in a data set that has been arranged
into ascending order. - i 50 70 35
- Since 70 is even, we average the values in the
35th and 36th positions Median (475 475)/2
475
11The median rent
- What would be the median rent if n 25?
25 /2 12.5, round up to 13. The 13th value is
440. (The middle value)
12The Mode
- The mode of a data set is the value that occurs
with greatest frequency. - The greatest frequency can occur at two or more
different values. - If the data have exactly two modes, the data are
bimodal. - If the data have more than two modes, the data
are multimodal.
13The mode rent
- 450 occurred most frequently (7 times) so the
Mode 450
14Percentiles
- A percentile provides information about where a
particular value falls in the rankings of all
data values in the data set. - For example, admission test scores for colleges
and universities are frequently reported in terms
of percentiles. - So if you got a 25 on the ACT, a percentile score
would tell you what percentage of people did
worse than you. - If your score was in the 70ile, then
approximately 70 of the students did worse than
you which means approximately 30 did better.
15Calculating Percentiles
- 1. Arrange the data in ascending order.
- 2. Compute index i, the position of the pth
percentile. - i p n
- 3a. If i is not an integer, round up to the next
integer. The p th percentile is the value in the
i th position. - 3b. If i is an integer, the p th percentile is
the average of the values in positions i and i
1.
16Percentiles for Apartment Rents
- What rent amount is in the 90th Percentile?
- i p n
- i .90 70
- i 63
- Since i is an integer, we average the numbers in
the 63rd and 64th positions (580 590)/2 585 - At least 90 of the apartments have rents of 585
or less.
17Similar Percentile Question
- Here are the scores on the midterm (n12)
- 70 73 79 82 83 87 88 90 91
94 98 100 - If you know that you are in the 80th percentile,
which of these is your score? - i p n
- i .8 12
- i 9.6
- Since i is not an integer, we round up to 10.
- The number in the 10th position is 94.
- At least 80 of the scores are less than your
score of 94.
18Another Percentile Question
- Here are the scores on the midterm (n12)
- 70 73 79 82 83 87 88 90 91
94 98 100 - If you got the 79, what percentile are you in?
- After the dataset is sorted in ascending order,
count the number of values below 79 and divide
that by n - p below you / n
- p 2 / 12
- p 16.7 round to 17th percentile
- At least 17 of the scores are less than your
score of 79.
19Another Percentile Question
- Here are the scores on the midterm (n12)
- 70 73 79 82 83 87 88 90 91
94 98 100 - If you got the 98, what percentile are you in?
p below / n p 10 / 12 p 86.7 round to
87th percentile
- At least 87 of the scores are less than your
score of 98.
20Quartiles
- Sometimes statisticians divide datasets into four
parts called quartiles. - Quartiles are specific percentiles
- First Quartile all the values in the 0-24th
Percentile - Second Quartile all the values in the 25th-49th
Percentile - Third Quartile all the values in the 50th -
75th Percentile - Fourth Quartile all the values in the 76th
100th percentile.
21What are the quartile cut-off amounts (Q1, Q2,
Q3)?
iQ1 25th percentile 25 70 17.5 rounded
to 18 so Q1 445 iQ2 50th percentile 50
70 35 averaged with 36 so Q2 (475 475)/2
475 (same as the median) iQ3 75th percentile
75 70 52.5 rounded to 53 so Q3 525
22What are the quartiles?
1st quartile all rents less than 445 2nd
quartile all rents gt445 and less than 475 3rd
quartile all rents gt475 and less than 525 4th
quartile all rents gt525
23Open the file DataSetsForCh3 and click on the
worksheet Cereal - centrals (measures of
central tendency).
24To calculate the mean, first we add up all the
values to get a sum . B18 sum(b2b17)
25 then count the number of values B19
count(b2b17)
26 then divide by the sum by the count of values
E2 b18/b19
27To calculate the median, find the middle value in
the sorted data set. To sort the dataset,
position the cell pointer on one of the cells in
the dataset. From the menu bar, click Data,
Sort
28 the entire dataset is selected and the sort
window opens. In the sort by box, select Grams
of sugar and make sure ascending is selected,
click ok
29to find the index of the middle value, divide n
by 2. If n is odd, the quotient will not be an
integer, so round up using the ceiling( )
function... F3 ceiling(B19/2, 1) (If n is even,
n/2 will be an integer and ceiling( ) will not do
any rounding.)
30to calculate the median, since n is even, we
didnt have to round and i is an integer, so add
the values in positions i and i1 (8 and 9), then
divide by 2 E3 (B9 B10)/2
31to calculate the mode, identify the values that
occurred most often E4 .13, .43 and .47
32Excels Built-in Functions
- Excel has built-in formulas to calculate mean,
median, and mode - average( )
- median( )
- mode( )
33To find what percentile Cocoa Puffs is, count
the number of values below that row and divide by
the number of values and round up E8
ceiling(13/B19, .01) Format that cell to
percentage, 0 decimal places.
34To find what quartile Cocoa Puffs is, divide
the dataset into 4 quarters and see which quarter
Cocoa Puffs falls into E9 4th You could also
calculate Q3 (the value of the 75ile) and list
all values greater than or equal to that value.
35To find what cereal is in the 30th percentile
multiply .3 number of values and if i is not an
integer, round up to get i (index or position
number) F12 ceiling((.3 B19), 1) (If i is an
integer, average the ith value with the ith1
value.)
36i 16 .3 4.8, and rounding up, i 5, we
identify what cereal is listed in that
position E12 Special K
37To identify the third quartile, calculate Q2 and
Q3, and list the cereals in between We know that
Q2 is the median (.345). To find Q3i, first
multiply n by .75 F13 16 .75
38since i is an integer (12), average the values
in i and i1 (12th 13th positions) to calculate
Q3 G13 (.44 .45) /2 .445
39.type in the names of the cereals that have
sugar content that is gt .345 (Q2) and lt .445
(Q3)
Resave this file.
40When the mean, median, and mode are not
aligned
- The data is said to be skewed.
- Data is skewed if it is not symmetric and if it
extends more to one side than the other.
41Skewness
Not skewed - symmetric
A few very small values in the data set
A few very large values in the data set
42Which measure of central tendency should you
regard as most representative of a data set?
- If there are a few extreme values in your data
set, extreme values may distort the mean but not
the median or the mode. - Lets say you are a fund-raiser. Your last 10
donations were - 5, 5, 15, 5, 10, 5, 10, 15, 10 and
1,000. - What do you want to tell the next person you
solicit for a donation? - 1. That the average donation is over 100
(actually its 103.50) - 2. The median donation is 10.
- 3. The mode donation is 5.
43Which measure of central tendency should you
consider?
- The median and the mode are often used to
describe a typical value. - Lets say you are thinking about becoming a
teacher and you are interested in knowing what
type of starting salary you could expect after
graduation. Which value might be most meaningful
to you? - 1. The mean starting salary
- 2. The median starting salary
- 3. The mode starting salary
44Measures of Variability
- It is often desirable to consider measures of
variability (dispersion) in addition to measures
of location. - For example, in choosing supplier A or supplier B
we might consider not only the average delivery
time for each, but also the variability in
delivery time for each.
45Measures of Variability
- Range
- Inter-quartile Range
- Variance
- Standard Deviation
- Coefficient of Variation
46The Range
- The range of a data set is the difference between
the largest and smallest data values. - It is the simplest measure of variability.
- It is very sensitive to the smallest and largest
data values.
47The range of apartment rents
- The range is 615 525 or 190
48Inter-quartile range
- The interquartile range of a data set is the
difference between the third quartile and the
first quartile. - It is the range for the middle 50 of the data.
- Examining the inter-quartile range of a dataset
allows you to get a feel for the middle-range.
49Example Inter-quartile Range
- 3rd Quartile (Q3) 525
- 1st Quartile (Q1) 445
- Inter-quartile Range Q3 - Q1 525 - 445 80
50Variance
- The variance is a measure of variability that
utilizes all the data. - It is based on the difference between the value
of each observation (xi) and the mean (x for a
sample, m for a population).
51Variance
- The variance is the average of the squared
differences between each data value and the mean. - If the data set is a sample, the variance is
denoted by s2. -
- If the data set is a population, the variance is
denoted by ? 2.
52Standard Deviation
- The standard deviation of a data set is the
positive square root of the variance. - It is measured in the same units as the data,
making it more easily comparable, than the
variance, to the mean. - If the data set is a sample, the standard
deviation is denoted s. - If the data set is a population, the standard
deviation is denoted ? (sigma).
53Coefficient of variation
- The coefficient of variation indicates how large
the standard deviation is in relation to the
mean. - If the data set is a sample, the coefficient of
variation is computed as follows - If the data set is a population, the coefficient
of variation is computed as follows
54Calculating the variance, standard deviation, and
coefficient of variation in Excel
- We will walk through the formulas using the
Cereal dataset.
55Open the file DataSetsForCh3 and click on the
worksheet Cereal dispersions (measures of
dispersion).
56Enter the formula to calculate the mean (x) B18
average(B2B17)
57Enter the formula to count the number of values
in the data set (n) B19 count(B2B17)
58Enter the formula to subtract the first xi from
the mean (x) C2 B2 - B18
59Copy the formula in C2 down to C17 to subtract
all the other xis from the mean (x).
60Enter the formula to square the first xis
difference from the mean (x) D2 C2 C2
61Copy the formula in D2 down to D17 to square each
xis difference from the mean (x).
62Sum all the squares of the xis differences from
the mean (x) D18 sum(D2D17)
63Calculate the variance by dividing the
sum-of-squares by n-1 D21 D18 / (B19 1)
64Calculate the standard deviation by taking the
square root of the variance D22 sqrt(D21)
65Calculate the coefficient of variation by
dividing the standard deviation by the mean D23
D22 / B18 Format the cell to percentage
66Excels Built-in Formulas
- Standard deviation of a sample
- stdev( )
- Variance of a sample
- var( )
- Excel does not provide a built-in formula for the
coefficient of variation which is rarely used.
67Excels Descriptive Statistics
- We can use Excels data analysis tool to generate
a table of all the descriptive statistics.
68- Select all cells in the data set B2B17.
- From the menu bar, select Tools, Data
Analysis
693. In the data analysis window, select
Descriptive Statistics and click ok
704. The input range should be B2B17
Summary statistics should be checked. New
worksheet ply should be selected. Click ok
715. See a new sheet created with the descriptive
statistics. Resize columns as necessary
Notice that it did not list all three modes
only the first mode.
726. Right-click on the sheet 2 sheet tab and
select Rename
737. Type the name Cereal Descriptives and press
enter. Resave the file.
74Homework 4
- 7 on page 84
- Mean, median, 1st and 3rd quartiles, percentile
- Use data sheet Music
- 18 on page 92
- Mean, median, range, std. deviation, coefficient
of variation, make comparisons - Create new worksheet.