Title: Those who don
1Those who dont know statistics are condemned to
reinvent it David Freedman
2All you ever wanted to know about the histogram
and more ...
3Distribution of No of Graphics on web pages
(N1873)
1
Mean 17.93
Median 16.00
Std. Dev 17.92
N 1873
Graphic Count
4Horizontal Scale
2
5Distribution of Redundant Link on web pages (N
1861)
3
Mean 22.1
Median 14
Std. Dev 37.33
N 1861.00
6Plotting a histogramendpoint convention, plot
frequencies, make equal intervals etc.
7Frequency Table
4
convention include the left endpoint in the
class interval
8Frequency/Probability
9No of fonts used on a web-page
5
Frequency /probability
10Cleaning up a histogram getting rid of outliers
11Distribution of word count (N1903)
Mean 393.2
Median 223
Std. Dev 725.24
Minimum 0
Maximum 20,357
12Distribution of word count (N1897) top six
removed
7
Mean 368.0
Median 223
Std. Dev 474.04
Minimum 0
Maximum 4132
13Distribution of word count (N1873)
Mean 333.4
Median 220
Std. Dev 360.30
Minimum 0
Maximum 4132
WORDCNT2
14What can histograms tell you
15Distribution of link count on good bad web-pages
8
Good Sites
Bad Sites
16Making inferences from histograms Incidence of
riots and temperature
9
3
0
4
0
9
0
1
0
0
1
1
0
5
0
6
0
7
0
8
0
temperature
17Mean and Median
Mean is arithmetic average, median is 50
point Mean is point where graph balances
Mean shifts around, Median does not shift much,
is more stable Computing Median for odd
numbered N find middle number For even numbered
N interpolate between middle 2, e.g. if it is 7
and 9, then 8 is the median
18The instability of means and standard deviations
19Add two numbers watch the mean, median, SD
20Add one outlier...
21Standard Deviation a measure of spread
22Same mean, different spread
10
S
D
S
D
23The Standard Deviation
24- The SD says how far away numbers
- on a list are from their average.
- Most entries on the list will be
- somewhere around one SD away
- from the average. Very few will be
- more than two or three SDs away.
25Understanding the standard deviation
- Lets start with a list 1, 2, 2, 3
50
25
0
Histogram is symmetric about 2, 2 is mean, and
50 to left of 2, 50 to right
2650
- List 1, 2, 2, 3
- Average 2
- SD .8
25
0
50
List 1, 2, 2, 5 Average 2.5 SD 1.73
25
0
50
List 1, 2, 2, 7 Average 3 SD 2.71
25
0
27Computing the standard deviation
- List 20, 10, 15, 15
- Average 15
- Find deviations from average
- 5, -5, 0, 0
- Square the deviations
- (5)2 (-5)2 (0)2 (0)2 50
- divide it by N-1 50/3 16.67
- Square root it ?16.67 4.08
28Properties of the standard deviation
- The standard deviation is in the same units as
the mean - The standard deviation is inversely related to
sample size (therefore as a measure of spread it
is biased) - In normally distributed data 68 of the sample
lies within 1 SD
29Properties of the Normal Probability Curve
- The graph is symmetric about the mean (the part
to the right is a mirror image of the part to the
left) - The total area under the curve equals 100
- Curve is always above horizontal axis
- Appears to stop after a certain point (the curve
gets really low)
3011
1 SD 68
2 SD 95
3 SD 99.7
- The graph is symmetric about the mean
- The total area under the curve equals 100
- Mean to 1 SD - 68
- Mean to 2 SD - 95
- Mean to 3 SD - 99.7
- You can disregard rest of curve
31Distribution of judges ratings for the Webby
Awards
12
Mean 6.3
Median 6.3
Std. Dev 1.98
N 1867.00
Skewness -.43
Kurtosis -.201
32It is a remarkable fact that many histograms in
real life tend to follow the Normal Curve. For
such histograms, the mean and SD are good summary
statistics. The average pins down the center,
while the SD gives the spread. For histogram
which do not follow the normal Curve, the mean
and SD are not good summary statistics. What
when the histogram is not normal ...
3313
Distribution of word count on web pages
Std. Dev 384.83
Mean 348.3
- 3 SD (384 3) 1152 Mean - 1152 about
30 sample had negative number of links
34When SD is influenced by outliers Use inter
quartile range 75th percentile - 25th percentile
Note. A percentile is a score below which a
certain of sample is
35Measures of Normality
14
- Visual examination
- Skewness measure of symmetry
Symmetric
Positively Skewed
Negatively Skewed
36Kurtosis Does it cluster in the middle?
15
Kurtosis is based on a distributions tail.
Distributions with a large tail
leptokurtic Distributions with a small tail
platykurtic Distributions with a normal tail
mesokurtic
Large tail
Small tail
Normal Tail
37Positively Skewed and Leptokurtic Word Count
Mean 393.2
Median 223
Std. Dev 725.24
Skewness 13.62
Kurtosis 321.84
N 1903.00
38Distribution of word count (N1897) top six
removed
Kurtosis 16.40
Skewness 3.49
Mean 368.0
Median 223
Std. Dev 474.04
N 1897.00
39Degree of Freedom
- The number of independent pieces of information
remaining after estimating one or more parameters - Example List 1, 2, 3, 4 Average 2.5
- For average to remain the same three of the
numbers can be anything you want, fourth is fixed - New List 1, 5, 2.5, __ Average 2.5