Statistics 221 - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

Statistics 221

Description:

Example: Apartment Rents. Given below is a sample of monthly rent values ... The range of apartment rents. The range is 615 525 or 190. Inter-quartile range ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 75

Provided by: CBU

Category:

more less

Transcript and Presenter's Notes

Title: Statistics 221

1
Statistics 221

Chapter 3 Part A
Descriptive Statistics

2
Summarizing Data

We learned in Chapter 2 that one way to derive
knowledge (i. e. learn something) is to collect
data regarding some phenomenon and then summarize
and analyze it.
In chapter 2, we learned about tabular and
graphical techniques for summarizing data. In
this chapter, we learn about numeric techniques
for summarizing data.

3
Numeric techniques for summarizing data

Measures of Location (mean, median, mode,
percentiles, quartiles)
Measures of Variability (range, inter-quartile
range, variance, standard deviation, coefficient
of variation)
Measures of Relative Location (z-scores) and
Detecting Outliers
Exploratory Data Analysis (the 5-number summary
and box plot)
Measures of Association Between Two Variables
(covariance and correlation coefficient)

4
Parameters vs Statistics

If a numerical summary statistic (such as a mean
or average) is computed from a sample, it is
referred to as a statistic if it is computed
from a population, it is referred to as a
parameter.
When a sample set is taken from a population, and
a statistic is calculated from the sample
dataset, the sample statistic is considered to be
a point estimate of the population parameter.

5
Measures of location (aka measures of central
tendency)

Here is the five we will learn
Mean
Median
Mode
Percentiles
Quartiles

6
The mean (average)

The mean of a data set is the average of all the
data values.
If the data are from a sample, the mean is
denoted by ?
If the data are from a population, the mean is
denoted by ? (mu).

?
n
n sample size
N population size
7
Example Apartment Rents

Given below is a sample of monthly rent values
()
for one-bedroom apartments. The data is a sample
of 70
apartments in a particular city. The data are
presented
in ascending order.

8
Calculating the mean rent

Add up all the rents and divide by the number of
rents.
The mean is denoted by x (x-bar).

490.8
9
The Median

The median of a data set is the value in the
middle when the data items are arranged in
ascending order.
For an odd number of observations (n), the median
is the middle value.
For an even number of observations (n), the
median is the average of the two middle values.
The median may be reported instead of the mean
when the data set includes a few extreme values.

10
The median rent

i refers to index. Index is the position number
of a value in a data set that has been arranged
into ascending order.
i 50 70 35
Since 70 is even, we average the values in the
35th and 36th positions Median (475 475)/2
475

11
The median rent

What would be the median rent if n 25?

25 /2 12.5, round up to 13. The 13th value is
440. (The middle value)
12
The Mode

The mode of a data set is the value that occurs
with greatest frequency.
The greatest frequency can occur at two or more
different values.
If the data have exactly two modes, the data are
bimodal.
If the data have more than two modes, the data
are multimodal.

13
The mode rent

450 occurred most frequently (7 times) so the
Mode 450

14
Percentiles

A percentile provides information about where a
particular value falls in the rankings of all
data values in the data set.
For example, admission test scores for colleges
and universities are frequently reported in terms
of percentiles.
So if you got a 25 on the ACT, a percentile score
would tell you what percentage of people did
worse than you.
If your score was in the 70ile, then
approximately 70 of the students did worse than
you which means approximately 30 did better.

15
Calculating Percentiles

1. Arrange the data in ascending order.
2. Compute index i, the position of the pth
percentile.
i p n
3a. If i is not an integer, round up to the next
integer. The p th percentile is the value in the
i th position.
3b. If i is an integer, the p th percentile is
the average of the values in positions i and i
1.

16
Percentiles for Apartment Rents

What rent amount is in the 90th Percentile?
i p n
i .90 70
i 63
Since i is an integer, we average the numbers in
the 63rd and 64th positions (580 590)/2 585
At least 90 of the apartments have rents of 585
or less.

17
Similar Percentile Question

Here are the scores on the midterm (n12)
70 73 79 82 83 87 88 90 91
94 98 100
If you know that you are in the 80th percentile,
which of these is your score?
i p n
i .8 12
i 9.6
Since i is not an integer, we round up to 10.
The number in the 10th position is 94.
At least 80 of the scores are less than your
score of 94.

18
Another Percentile Question

Here are the scores on the midterm (n12)
70 73 79 82 83 87 88 90 91
94 98 100
If you got the 79, what percentile are you in?
After the dataset is sorted in ascending order,
count the number of values below 79 and divide
that by n
p below you / n
p 2 / 12
p 16.7 round to 17th percentile
At least 17 of the scores are less than your
score of 79.

19
Another Percentile Question

Here are the scores on the midterm (n12)
70 73 79 82 83 87 88 90 91
94 98 100
If you got the 98, what percentile are you in?

p below / n p 10 / 12 p 86.7 round to
87th percentile

At least 87 of the scores are less than your
score of 98.

20
Quartiles

Sometimes statisticians divide datasets into four
parts called quartiles.
Quartiles are specific percentiles
First Quartile all the values in the 0-24th
Percentile
Second Quartile all the values in the 25th-49th
Percentile
Third Quartile all the values in the 50th -
75th Percentile
Fourth Quartile all the values in the 76th
100th percentile.

21
What are the quartile cut-off amounts (Q1, Q2,
Q3)?
iQ1 25th percentile 25 70 17.5 rounded
to 18 so Q1 445 iQ2 50th percentile 50
70 35 averaged with 36 so Q2 (475 475)/2
475 (same as the median) iQ3 75th percentile
75 70 52.5 rounded to 53 so Q3 525
22
What are the quartiles?
1st quartile all rents less than 445 2nd
quartile all rents gt445 and less than 475 3rd
quartile all rents gt475 and less than 525 4th
quartile all rents gt525
23
Open the file DataSetsForCh3 and click on the
worksheet Cereal - centrals (measures of
central tendency).
24
To calculate the mean, first we add up all the
values to get a sum . B18 sum(b2b17)
25
then count the number of values B19
count(b2b17)
26
then divide by the sum by the count of values
E2 b18/b19
27
To calculate the median, find the middle value in
the sorted data set. To sort the dataset,
position the cell pointer on one of the cells in
the dataset. From the menu bar, click Data,
Sort
28
the entire dataset is selected and the sort
window opens. In the sort by box, select Grams
of sugar and make sure ascending is selected,
click ok
29
to find the index of the middle value, divide n
by 2. If n is odd, the quotient will not be an
integer, so round up using the ceiling( )
function... F3 ceiling(B19/2, 1) (If n is even,
n/2 will be an integer and ceiling( ) will not do
any rounding.)
30
to calculate the median, since n is even, we
didnt have to round and i is an integer, so add
the values in positions i and i1 (8 and 9), then
divide by 2 E3 (B9 B10)/2
31
to calculate the mode, identify the values that
occurred most often E4 .13, .43 and .47
32
Excels Built-in Functions

Excel has built-in formulas to calculate mean,
median, and mode
average( )
median( )
mode( )

33
To find what percentile Cocoa Puffs is, count
the number of values below that row and divide by
the number of values and round up E8
ceiling(13/B19, .01) Format that cell to
percentage, 0 decimal places.
34
To find what quartile Cocoa Puffs is, divide
the dataset into 4 quarters and see which quarter
Cocoa Puffs falls into E9 4th You could also
calculate Q3 (the value of the 75ile) and list
all values greater than or equal to that value.
35
To find what cereal is in the 30th percentile
multiply .3 number of values and if i is not an
integer, round up to get i (index or position
number) F12 ceiling((.3 B19), 1) (If i is an
integer, average the ith value with the ith1
value.)
36
i 16 .3 4.8, and rounding up, i 5, we
identify what cereal is listed in that
position E12 Special K
37
To identify the third quartile, calculate Q2 and
Q3, and list the cereals in between We know that
Q2 is the median (.345). To find Q3i, first
multiply n by .75 F13 16 .75
38
since i is an integer (12), average the values
in i and i1 (12th 13th positions) to calculate
Q3 G13 (.44 .45) /2 .445
39
.type in the names of the cereals that have
sugar content that is gt .345 (Q2) and lt .445
(Q3)
Resave this file.
40
When the mean, median, and mode are not
aligned

The data is said to be skewed.
Data is skewed if it is not symmetric and if it
extends more to one side than the other.

41
Skewness
Not skewed - symmetric
A few very small values in the data set
A few very large values in the data set
42
Which measure of central tendency should you
regard as most representative of a data set?

If there are a few extreme values in your data
set, extreme values may distort the mean but not
the median or the mode.
Lets say you are a fund-raiser. Your last 10
donations were
5, 5, 15, 5, 10, 5, 10, 15, 10 and
1,000.
What do you want to tell the next person you
solicit for a donation?
1. That the average donation is over 100
(actually its 103.50)
2. The median donation is 10.
3. The mode donation is 5.

43
Which measure of central tendency should you
consider?

The median and the mode are often used to
describe a typical value.
Lets say you are thinking about becoming a
teacher and you are interested in knowing what
type of starting salary you could expect after
graduation. Which value might be most meaningful
to you?
1. The mean starting salary
2. The median starting salary
3. The mode starting salary

44
Measures of Variability

It is often desirable to consider measures of
variability (dispersion) in addition to measures
of location.
For example, in choosing supplier A or supplier B
we might consider not only the average delivery
time for each, but also the variability in
delivery time for each.

45
Measures of Variability

Range
Inter-quartile Range
Variance
Standard Deviation
Coefficient of Variation

46
The Range

The range of a data set is the difference between
the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest
data values.

47
The range of apartment rents

The range is 615 525 or 190

48
Inter-quartile range

The interquartile range of a data set is the
difference between the third quartile and the
first quartile.
It is the range for the middle 50 of the data.
Examining the inter-quartile range of a dataset
allows you to get a feel for the middle-range.

49
Example Inter-quartile Range

3rd Quartile (Q3) 525
1st Quartile (Q1) 445
Inter-quartile Range Q3 - Q1 525 - 445 80

50
Variance

The variance is a measure of variability that
utilizes all the data.
It is based on the difference between the value
of each observation (xi) and the mean (x for a
sample, m for a population).

51
Variance

The variance is the average of the squared
differences between each data value and the mean.
If the data set is a sample, the variance is
denoted by s2.
If the data set is a population, the variance is
denoted by ? 2.

52
Standard Deviation

The standard deviation of a data set is the
positive square root of the variance.
It is measured in the same units as the data,
making it more easily comparable, than the
variance, to the mean.
If the data set is a sample, the standard
deviation is denoted s.
If the data set is a population, the standard
deviation is denoted ? (sigma).

53
Coefficient of variation

The coefficient of variation indicates how large
the standard deviation is in relation to the
mean.
If the data set is a sample, the coefficient of
variation is computed as follows
If the data set is a population, the coefficient
of variation is computed as follows

54
Calculating the variance, standard deviation, and
coefficient of variation in Excel

We will walk through the formulas using the
Cereal dataset.

55
Open the file DataSetsForCh3 and click on the
worksheet Cereal dispersions (measures of
dispersion).
56
Enter the formula to calculate the mean (x) B18
average(B2B17)
57
Enter the formula to count the number of values
in the data set (n) B19 count(B2B17)
58
Enter the formula to subtract the first xi from
the mean (x) C2 B2 - B18
59
Copy the formula in C2 down to C17 to subtract
all the other xis from the mean (x).
60
Enter the formula to square the first xis
difference from the mean (x) D2 C2 C2
61
Copy the formula in D2 down to D17 to square each
xis difference from the mean (x).
62
Sum all the squares of the xis differences from
the mean (x) D18 sum(D2D17)
63
Calculate the variance by dividing the
sum-of-squares by n-1 D21 D18 / (B19 1)
64
Calculate the standard deviation by taking the
square root of the variance D22 sqrt(D21)
65
Calculate the coefficient of variation by
dividing the standard deviation by the mean D23
D22 / B18 Format the cell to percentage
66
Excels Built-in Formulas

Standard deviation of a sample
stdev( )
Variance of a sample
var( )
Excel does not provide a built-in formula for the
coefficient of variation which is rarely used.

67
Excels Descriptive Statistics

We can use Excels data analysis tool to generate
a table of all the descriptive statistics.

Select all cells in the data set B2B17.
From the menu bar, select Tools, Data
Analysis

69
3. In the data analysis window, select
Descriptive Statistics and click ok
70
4. The input range should be B2B17
Summary statistics should be checked. New
worksheet ply should be selected. Click ok
71
5. See a new sheet created with the descriptive
statistics. Resize columns as necessary
Notice that it did not list all three modes
only the first mode.
72
6. Right-click on the sheet 2 sheet tab and
select Rename
73
7. Type the name Cereal Descriptives and press
enter. Resave the file.
74
Homework 4