Title: Methods for Describing Sets of Data
1Chapter 2
- Methods for Describing Sets of Data
2Review
- Descriptive vs. Inferential Statistics
- Vocabulary
- Population
- (Random, representative) sample
- Parameter
- Statistic
- Data types
- Data sources
3Learning Objectives
- 1. Describe Qualitative Data Graphically
- 2. Describe Numerical Data Graphically
- 3. Create Interpret Graphical Displays
- 4. Explain Numerical Data Properties
- 5. Describe Summary Measures
- 6. Analyze Numerical Data Using Summary Measures
4Data Presentation
5Presenting Qualitative Data
6Data Presentation
7Student Specializations
- Specializat
- ion Freq. Percent Cum.
- -----------------------------------------------
- HCI 9 39.13 39.13
- IEMP 9 39.13 78.26
- LIS 3 13.04 91.30
- Undecided 2 8.70 100.00
- -----------------------------------------------
- Total 23 100.00
8Student Specializations
9Undergrad Majors
- UG major Freq.
Percent Cum. - -------------------------------------------------
------------ - American Studies 1
4.76 4.76 - Cog Sci 1
4.76 9.52 - Comp Sci 3
14.29 23.81 - Economics 3
14.29 38.10 - English 5
23.81 61.90 - Environmental Engineering 1
4.76 66.67 - Graphic Design 1
4.76 71.43 - Math 2
9.52 80.95 - Mechanical Engineering 1
4.76 85.71 - Nutrition 1
4.76 90.48 - Sci and Tech Policy 1
4.76 95.24 - Telecommunications 1
4.76 100.00 - -------------------------------------------------
------------ - Total 21
100.00
10Favorite Colors
- color Freq. Percent Cum.
- -----------------------------------------------
- black 2 8.70 8.70
- blue 12 52.17 60.87
- green 1 4.35 65.22
- orange 1 4.35 69.57
- purple 1 4.35 73.91
- red 5 21.74 95.65
- white 1 4.35 100.00
- -----------------------------------------------
- Total 23 100.00
11Calculus Knowledge
- integrals Freq. Percent Cum.
- -----------------------------------------------
- 1 3 13.04 13.04
- 2 1 4.35 17.39
- 3 11 47.83 65.22
- 4 6 26.09 91.30
- 5 2 8.70 100.00
- -----------------------------------------------
- Total 23 100.00
12Exercises
- 2.1
- 2.2
- 2.9 which chart type is best for CEO degree
categories?
13Presenting Numerical Data
14Data Presentation
15Student Age (Reported) Data
- Stem-and-leaf plot for age
- 2 22233444555777899
- 3 01257
- 4
- 5
- 6
- 7 6
16Histogram
17Starting Salaries (in K)
- 3 8
- 4 000025
- 5 0000
- 6 0000005
- 7 5
- 8 0
18Summation Notation
- Exercise 2.33
- Observations 5, 1, 3, 2, 1
19Numerical Data Properties
20Thinking Challenge
400,000
70,000
50,000
... employees cite low pay -- most workers earn
only 20,000. ... President claims average pay is
70,000!
30,000
20,000
21Standard Notation
Measure
Sample
Population
Mean
?
?
x
Stand. Dev.
s
?
2
2
Variance
s
?
Size
n
N
22Numerical Data Properties
Central Tendency (Location)
Variation (Dispersion)
Shape
23Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
24Central Tendency
25Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
26Whats wrong with this?
- Measurements 1 4 2 9 8
- Middle measurement is 2, so thats the median
?
X
i
X
X
X
?
?
?
?
n
1
2
i
?
1
X
?
?
n
n
27Exercise 2.39
- Why special rule for median with even vs. odd
number of measurements?
28Exercise 2.37
- 18, 10, 15, 13, 17, 15, 12, 15, 18, 16, 11
29What if?
- Replace one of the 18s with 1,118?
30Exercise 2.41a
31Exercise 2.53
32Ages
- Mean 29
- Median 27
- 2 22233444555777899
- 3 01257
- 4
- 5
- 6
- 7 6
33Summary of Central Tendency Measures
Measure
Equation
Description
Mean
Balance Point
??
X
/
n
i
Median
(
n
1)
Position
Middle Value
2
When Ordered
Mode
none
Most Frequent
34Shape
35Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Median
Interquartile Range
Mode
Variance
Standard Deviation
36Shape
- 1. Describes How Data Are Distributed
- 2. Measures of Shape
- Skew Symmetry
Right-Skewed
Left-Skewed
Symmetric
Mean
Median
Mode
Mean
Median
Mode
Mode
Median
Mean
37Exercise 2.47
- Asked to submit 3 letters
- Observed mean 2.28, median3, mode3
- Interpret
38Exercise 2.50 a-d
39Variation
40Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Range
Mean
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
41Quartiles
- 1. Measure of Noncentral Tendency
- 2. Split Ordered Data into 4 Quarters
- 3. Position of i-th Quartile
25
25
25
25
Q1
Q2
Q3
a
f
i
n
?
?
1
Positionin
g Point of
Q
?
i
4
42Ages
- Range
- Quartiles
- 2 22233444555777899
- 3 01257
- 4
- 5
- 6
- 7 6
43Box Plots
44Age and Salary
- Quartiles 24, 27, 30
- Inner fences (15,39)
- Outer fences (6, 48)
- Quartiles 41K, 50K, 60K
- Inner fences ??
- Outer fences ??
45Variance Standard Deviation
- 1. Measures of Dispersion
- 2. Most Common Measures
- 3. Consider How Data Are Distributed
- 4. Show Variation About Mean (?X or ?)
?
X
8.3
4
6
8
10
12
46Sample Variance Formula
c
h
n
2
?
n - 1 in denominator! (Use N if Population
Variance)
X
X
?
i
2
i
1
?
S
?
n
1
?
c
h
c
h
c
h
2
2
2
X
X
X
X
X
X
?
?
?
?
?
?
?
n
1
2
?
n
1
?
47Equivalent Formula
48Another Equivalent Formula
49Deriving the shortcut (p.57)
50Exercise 2.54
51Exercise 2.55a
52Exercise 2.59 Same mean, different variances
53Exercise 2.60 Same range, different means
54Exercise 2.61 (simplified) adding a constant
- 2, 1, 1, 0, 6
- Mean 10/52
- Variance 0 1 1 16 18
- Add 3 to each measurement
- Mean 25/5 5
- Variance ??
- Why doesnt adding a constant affect variance?
55Exercise 2.65
56Chebyshevs Rule Preliminaries
- Lemma For any positive variable Y, and any
constant a, - Proof of Lemma
- For values of Ygta, define Z a
- For values of Ylta, define Z 0
- Clearly mean of Y is bigger than mean of Z
- But mean of Z is just
57Chebyshevs Rule
(From lemma)
58Empirical Rule
- If x has a symmetric, mound-shaped distribution
- Justification Known properties of the normal
distribution, to be studied later in the course
59Example
- Data set has nine 0 values, and one 100
- Mean 10, Range 100
- s2 (910018100)/91000, s 31.62
- 10 are at a distance gt 3s
- Chebyshevs rule applies 10 lt 1/9 11.1
- Empirical rule severely violated 10 gt 0.3
60Preview of Statistical Inference
- You observe one data point
- Make hypothesis about mean and standard deviation
from which it was drawn - Chebyshevs Rule or Empirical Rule tells you how
(un)likely the data point is - If very unlikely, you are suspicious of the
hypothesis about mean and standard deviation, and
reject it
61Exercise 2.67
- N200, mean 1500, s 300
- How many measurements in (900,2100)
- How many measurements in (600, 2400)
- How many measurements in (1200, 1800)
- How many measurements in (1500, 2100)
62Summary of Variation Measures
Measure
Equation
Description
X
-
X
Total Spread
Range
largest
smallest
Q
-
Q
Spread of Middle 50
Interquartile Range
3
1
Dispersion about
Standard Deviation
Sample Mean
(Sample)
Standard Deviation
Dispersion about
Population Mean
(Population)
Variance
2
Squared Dispersion
?
(
X
-
?
X
)
i
about Sample Mean
(Sample)
n
- 1
63Z-scores
- Number of standard deviations from the mean
- Chebyshev and empirical rules apply
64Exercise 2.93c
65Conclusion
- 1. Described Qualitative Data Graphically
- 2. Described Numerical Data Graphically
- 3. Created Interpreted Graphical Displays
- 4. Explained Numerical Data Properties
- 5. Described Summary Measures
- 6. Analyzed Numerical Data Using Summary Measures
66End of Chapter
Any blank slides that follow are blank
intentionally.