Title: Methods for Describing Sets of Data
1Chapter 2
- Methods for Describing Sets of Data
2Review
- Descriptive vs. Inferential Statistics
- Vocabulary
- Population
- (Random, representative) sample
- Parameter
- Statistic
- Data types
- Data sources
3Learning Objectives
- Perform basic data manipulation with stata
- Create Interpret Graphical Displays
- Analyze Numerical Data Using Summary Measures
4Data Manipulation with Stata
- Getting started
- Data Preparation
- Data Analysis
5Getting Started with Stata
- Install stata via NAL
- Start menu Run
- nal
- Double-click on stata
- Wait a while
- Run stata
6Running Commands from the command line
- log using my.log, replace text
- Tells stata to start logging the rest of what
happens to a log file - set memory 100m
- Tells stata to allocate memory for data
- Log close
- Open the file my.log and see what it says
7Running Commands from a do-file
- Window menu do-file editor
- File menu Open
- T\Fall-2005\544\Public\Stata\SlashDotPrep.do
- Highlight one or more lines
- Tools menu Do Selection
8About This Dataset
- From readership logs of the website slashdot.org
- For each page view
- User id (recoded to protect privacy)
- Date/time
- URL
- Viewing threshold
- Display mode
9Data Preparation
- Read it in
- Define labels
- Recode variables
- Keep only what you want
- Save it in stata format
10Data Presentation
11Presenting Qualitative Data
12Data Presentation
13Student Specializations
- sispec Freq. Percent Cum.
- -----------------------------------------------
- HCI 7 46.67 46.67
- IEMP 3 20.00 66.67
- tailored 5 33.33 100.00
- -----------------------------------------------
- Total 15 100.00
14Student Specializations
15Knowledge
- summnot Freq. Percent
Cum. - -------------------------------------------------
----------- - Never taught this 3 20.00
20.00 - Never really learned it 1 6.67
26.67 - It's been many years 4 26.67
53.33 - I know this 3 20.00
73.33 - Can teach this to others 4 26.67
100.00 - -------------------------------------------------
----------- - Total 15 100.00
- deriv Freq. Percent
Cum. - -------------------------------------------------
----------- - Never taught this 1 6.67
6.67 - Never really learned it 2 13.33
20.00 - It's been many years 10 66.67
86.67 - Can teach this to others 2 13.33
100.00 - -------------------------------------------------
----------- - Total 15 100.00
16Knowledge Cont.
- meanmed Freq. Percent Cum.
- -------------------------------------------------
----------- - Never really learned it 1 6.67
6.67 - It's been many years 3 20.00
26.67 - I know this 4 26.67
53.33 - Can teach this to others 7 46.67
100.00 - -------------------------------------------------
----------- - Total 15 100.00
- stdev Freq. Percent
Cum. - -------------------------------------------------
----------- - Never taught this 3 20.00
20.00 - Never really learned it 2 13.33
33.33 - It's been many years 3 20.00
53.33 - I know this 4 26.67
80.00 - Can teach this to others 3 20.00
100.00 - -------------------------------------------------
-----------
17Knowledge Cont.
- cenlim Freq. Percent Cum.
- -------------------------------------------------
----------- - Never taught this 6 40.00
40.00 - Never really learned it 2 13.33
53.33 - It's been many years 6 40.00
93.33 - I know this 1 6.67
100.00 - -------------------------------------------------
----------- - Total 15 100.00
- -gt tabulation of reg
- reg Freq. Percent
Cum. - -------------------------------------------------
----------- - Never taught this 7 46.67
46.67 - Never really learned it 5 33.33
80.00 - It's been many years 1 6.67
86.67 - I know this 2 13.33
100.00 - -------------------------------------------------
----------- - Total 15 100.00
18Exercises
- 2.4
- 2.5
- 2.15 which chart type is best for CEO degree
categories?
19Stata data analysis
- File menu Open
- T\Fall-2005\544\Public\Stata\SlashDotAnalysis.do
- Counts
- Summary tables
- Bar and pie charts
20(No Transcript)
21Sort, Generate
- Sort data
- sort hour minute second
- Generate new variable
- generate totalduration 6060(hour_N -
hour1) 60(minute_N - minute1)
(second_N - second1) - _N means the last row
- 1 means the first row
22Generate Within Groups
- Group rows by userid
- Generate within each group
- sort uid hour minute second
- by uid generate duration 6060(hour_N -
hour1) 60(minute_N - minute1)
(second_N - second1)
23Egen
- Egen Many useful options for calculations
(within groups) - Sum, count
- count-so-far with rank option
- See documentation via help egen
24Collapse
- Creates one row per group
- Options specify how to combine multiple rows for
a group - Min
- Max
- Count
- Mean
- Etc.
25Presenting Numerical Data
26Data Presentation
27Stem in Stata
28Histogram in stata
29Student Age (Reported) Data
- . stem age
- Stem-and-leaf plot for age
- 2 2345567
- 3 0122356
- 4
- 5
- 6
- 7
- 8 4
30Histogram
31Starting Salaries (in K)
- Fall 04 Class
- 3 8
- 4 000025
- 5 0000
- 6 0000005
- 7 5
- 8 0
- Fall 05 Class
- 4 5
- 5 0000
- 6 2355
- 7 5
- 8 5
- 9
- 10 05
32Summation Notation
- Exercise 2.43
- Observations 3, 8, 4, 5, 3, 4, 6
33Summation Notation
- Exercise 2.43
- Observations 3, 8, 4, 5, 3, 4, 6
34Summation Notation
- Exercise 2.43
- Observations 3, 8, 4, 5, 3, 4, 6
35Summation Notation
- Exercise 2.43
- Observations 3, 8, 4, 5, 3, 4, 6
36Summation Notation
- Exercise 2.43
- Observations 3, 8, 4, 5, 3, 4, 6
37Summation with Indexing
38(No Transcript)
39Numerical Data Properties
40Thinking Challenge
400,000
70,000
50,000
... employees cite low pay -- most workers earn
only 20,000. ... President claims average pay is
70,000!
30,000
20,000
41Standard Notation
Measure
Sample
Population
Mean
?
?
x
Stand. Dev.
s
?
2
2
Variance
s
?
Size
n
N
42Numerical Data Properties
Central Tendency (Location)
Variation (Dispersion)
Shape
43Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
44Central Tendency
45Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
46Whats wrong with this median calculation?
- Measurements 1 4 2 9 8
- Middle measurement is 2, so thats the median
47Mean
48Exercise 2.53
- 18, 10, 15, 13, 17, 15, 12, 15, 18, 16, 11
- Calculate mode, median, mean
49What if?
- Replace one of the 18s with 1,118?
50Exercise 2.55a
- N10,
- Whats the mean?
- Whats the median?
51400,000
70,000
50,000
... employees cite low pay -- most workers earn
only 20,000. ... President claims average pay is
70,000!
30,000
20,000
52Ages
- Mean 33
- Median 30
- 2 2345567
- 3 0122356
- 4
- 5
- 6
- 7
- 8 4
53Summary of Central Tendency Measures
Measure
Equation
Description
Mean
Balance Point
??
X
/
n
i
Median
(
n
1)
Position
Middle Value
2
When Ordered
Mode
none
Most Frequent
54Shape
55Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Median
Interquartile Range
Mode
Variance
Standard Deviation
56Shape
- 1. Describes How Data Are Distributed
- 2. Measures of Shape
- Skew Symmetry
Right-Skewed
Left-Skewed
Symmetric
Mean
Median
Mode
Mean
Median
Mode
Mode
Median
Mean
57Exercise 2.62
- Asked to submit 3 letters
- Observed mean 2.28, median3, mode3
- Interpret
58Exercise 2.64 a-d
59Variation
60Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Range
Mean
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
61Quartiles
- 1. Measure of Noncentral Tendency
- 2. Split Ordered Data into 4 Quarters
- 3. Position of i-th Quartile
25
25
25
25
Q1
Q2
Q3
a
f
i
n
?
?
1
Positionin
g Point of
Q
?
i
4
62Ages
- Range
- Quartiles
- 2 2345567
- 3 0122356
- 4
- 5
- 6
- 7
- 8 4
63Box Plots
64Age and Salary
- Quartiles 25, 30, 33
- Inner fences 13, 45
- Outer fences 1, 57
- Quartiles 50K, 63K, 75K
- Inner fences ??
- Outer fences ??
65Box Plots in Stata
66Variance Standard Deviation
- 1. Measures of Dispersion
- 2. Most Common Measures
- 3. Consider How Data Are Distributed
- 4. Show Variation About Mean (?X or ?)
?
X
8.3
4
6
8
10
12
67Sample Variance Formula
c
h
n
2
?
n - 1 in denominator! (Use N if Population
Variance)
X
X
?
i
2
i
1
?
S
?
n
1
?
c
h
c
h
c
h
2
2
2
X
X
X
X
X
X
?
?
?
?
?
?
?
n
1
2
?
n
1
?
68Equivalent Formula
69Another Equivalent Formula
70Exercise 2.70
- What is the primary disadvantage of usign the
range to compare the variability of data sets?
71Exercise 2.74a
- Calculate variance and standard deviation
72Exercise 2.74a
- Calculate variance and standard deviation
73Exercise 2.77 Same mean, different variances
- Using only integers in 0,10, construct two
datasets with at least 10 observations each - Same mean
- Different variances
74Exercise 2.78 Same range, different means
75Exercise 2.79 (simplified) adding a constant
- 2, 1, 1, 0, 6
- Mean 10/52
- Variance 0 1 1 416 22
- Add 3 to each measurement
- Mean ?
- Variance ?
76Exercise 2.79 (simplified) adding a constant
- 2, 1, 1, 0, 6
- Mean 10/52
- Variance 0 1 1 416 22
- Add 3 to each measurement
- Mean 25/5 5
- Variance ??
77Why doesnt adding a constant affect variance?
78Stata Measures of Central Tendency
. summ accesses, detail
(max) accesses -----------------------------------
-------------------------- Percentiles
Smallest 1 1 1 5
1 1 10 1
1 Obs 22049 25
1 1 Sum of Wgt.
22049 50 2 Mean
7.827158
Largest Std. Dev. 559.3611 75
4 211 90 8
219 Variance 312884.9 95
12 1799 Skewness
148.346 99 26 83038
Kurtosis 22020.4
79Chebysev and Empirical Rules Intuitions
- Cant all be above average
- Cant have too many values very far from the mean
- Or can you?
- What if half the values are 1000, half are -1000
- Cant have too many values very far from the mean
- Very far measured in standard deviations
80Chebyshevs Rule Preliminaries
- Lemma For any positive variable Y, and any
constant a, - Proof of Lemma
- For values of Ygta, define Z a
- For values of Ylta, define Z 0
- Clearly mean of Y is bigger than mean of Z
- But mean of Z is just
81Chebyshevs Rule
(From lemma)
82Empirical Rule
- If x has a symmetric, mound-shaped distribution
- Justification Known properties of the normal
distribution, to be studied later in the course
83Example
- Data set has nine 0 values, and one 100
- Mean 10, Range 100
- s2 (910018100)/91000, s 31.62
- 10 are at a distance gt 3s
- Chebyshevs rule applies 10 lt 1/9 11.1
- Empirical rule severely violated 10 gt 0.3
84Preview of Statistical Inference
- You observe one data point
- Make hypothesis about mean and standard deviation
from which it was drawn - Chebyshevs Rule or Empirical Rule tells you how
(un)likely the data point is - If very unlikely, you are suspicious of the
hypothesis about mean and standard deviation, and
reject it
85Exercise 2.87
- N200, mean 1500, s 300
- How many measurements in (900,2100)
- How many measurements in (600, 2400)
- How many measurements in (1200, 1800)
- How many measurements in (1500, 2100)
86Summary of Variation Measures
Measure
Equation
Description
X
-
X
Total Spread
Range
largest
smallest
Q
-
Q
Spread of Middle 50
Interquartile Range
3
1
Dispersion about
Standard Deviation
Sample Mean
(Sample)
Standard Deviation
Dispersion about
Population Mean
(Population)
Variance
2
Squared Dispersion
?
(
X
-
?
X
)
i
about Sample Mean
(Sample)
n
- 1
87Z-scores
- Number of standard deviations from the mean
- Chebyshev and empirical rules apply
88Exercise 2.117c, page 85
89Scatterplots
90Misleading With Statistics
- Bar graphs
- Stretch the vertical axis
- Scale break
91(No Transcript)
92Misleading With Statistics
- Bar graphs
- Stretch the vertical axis
- Scale break
- Reporting central tendency
- Medians vs. means
- Not reporting variance
- Small samples with reports of relative frequency
93End of Chapter
Any blank slides that follow are blank
intentionally.