Title: Descriptive statistics (Part I)
1Lecture 2
- Descriptive statistics (Part I)
2Lecture 2 Descriptive statistics
- Data in raw form are usually not easy to use for
decision making - Some type of organization is needed
- Table
- Graph
- Techniques reviewed here
- Bar charts and pie charts
- Ordered array
- Stem-and-leaf display
- Frequency distributions, histograms
- Cumulative distributions
- Contingency tables
3Tabulating and Graphing Univariate Categorical
Data
Categorical Data
Graphing Data
Tabulating Data
Pie Charts
Summary Table
Bar Charts
4Summary Table(for an Investors Portfolio)
Investment Category Amount Percentage (in
thousands ) Stocks 46.5
42.27 Bonds 32 29.09 CD
15.5 14.09 Savings 16
14.55 Total 110 100
Variables are Categorical
5Bar Chart(for an Investors Portfolio)
6Pie Chart (for an Investors Portfolio)
Amount Invested in K
Savings 15
Stocks 42
CD 14
Percentages are rounded to the nearest percent
Bonds 29
7Organizing Numerical Data
Numerical Data
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
Frequency Distributions Cumulative
Distributions
Ordered Array
Stem and Leaf Display
2 144677 3 028 4 1
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Histograms
Tables
8The Ordered Array
- Data in raw form (as collected)
- 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
- Data in ordered array from smallest to largest
21, 24, 24, 26, 27, 27, 30, 32, 38, 41 - Shows range (min to max)
- May help identify outliers (unusual
observations) - If the data set is large, the ordered array is
less useful
9Stem-and-Leaf Display
- A simple way to see distribution details in a
data set - METHOD Separate the sorted data series
into leading digits (the stem) and
the trailing digits (the leaves)
10Example
- Data in Raw Form (as Collected) 24, 26, 24,
21, 27, 27, 30, 41, 32, 38 - Data in Ordered Array from Smallest to Largest
21, 24, 24, 26, 27, 27, 30, 32, 38, 41 - Stem-and-Leaf Display
2 1 4 4 6 7 7
3 0 2 8
4 1
11Tabulating Numerical Data Frequency Distributions
- What is a Frequency Distribution?
- A frequency distribution is a list or a table
- containing class groupings (ranges within which
the data fall) ... - and the corresponding frequencies with which data
fall within each grouping or category - It allows for a quick visual interpretation of
the data
12Tabulating Numerical Data Frequency Distributions
- Example A manufacturer of insulation randomly
selects 20 winter days and records the daily high
temperature - 24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
- 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
13- Sort Raw Data on days in Ascending Order12, 13,
17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38,
41, 43, 44, 46, 53, 58 - Find Range 58 - 12 46
- Select Number of Classes 5 (usually between 5
and 15) - Compute Class Interval (Width) 10 (46/5 then
round up) - Determine Class Boundaries (Limits)10, 20, 30,
40, 50, 60 - Count Observations Assign to Classes
14Frequency Distributions, Relative Frequency
Distributions and Percentage Distributions
Data in Ordered Array 12, 13, 17, 21, 24, 24,
26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46,
53, 58
Relative Frequency
Percentage
Class Frequency
10, 20) 3
.15 15 20, 30) 6
.30 30 30,
40) 5 .25
25 40, 50)
4 .20
20 50, 60) 2 .10
10 Total 20
1 100
15Graphing Numerical Data The Histogram
- A graph of the data in a frequency distribution
is called a histogram - The class boundaries (or class midpoints) are
shown on the horizontal axis - the vertical axis is either frequency, relative
frequency, or percentage - Bars of the appropriate heights are used to
represent the number of observations within each
class
16Histogram Example
Class Midpoint
Class
Frequency
10, 20) 15
3 20, 30) 25
6 30, 40) 35
5 40, 50) 45
4 50, 60) 55
2
(No gaps between bars)
Class Midpoints
17Tabulating Numerical Data Cumulative Frequency
Data in Ordered Array 12, 13, 17, 21, 24, 24,
26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46,
53, 58
Upper Cumulative Cumulative Limit
Frequency Frequency 10
0
0 20 3
15 30 9
45 40 14
70 50
18 90 60
20 100
18Two categorical variables (contingency table)
- The following data represent the responses to a
question asked in a survey of 20 college students
majoring in business - What is your gender? (Male M Female F)
- What is your major? (Accountancy A
Information System I Market M) - Gender M M M F M F F M F M F M M M M F F M
F F - Major A I I M A I A A I I A A A M I
M A A A I
19Contingency table (contd)
- Raw data set
- Gender M M M F M F F M F M F M M M M
F F M F F - Major A I I M A I A A I I A A
A M I M A A A I
A I M Total
Male 6 4 1 11
Female 4 3 2 9
Total 10 7 3 20
20Graphical methods are
- Good in presenting data
- Not easy for comparison
- Difficult to use for statistical inference
21Numerical description
Summary Measures
Variation
Central Tendency (location measures)
Quartiles
Range
Mean
Median
Mode
Variance
Interquartile range
Standard Deviation
22Mean
- Mean (Arithmetic Mean) of Data Values
- Sample mean
- Population mean
Sample Size
Population Size
23An example
- TV watching hours/week 5, 7, 3, 38, 7
- Mean (5 7 3 38 7)/5 60/5 12
- If the correct time for 4th subject is 8 (not 38)
- Mean (5 7 3 8 7)/5 30/5 6
3 5 6 7 8
3 5 7 12
38
Mean 6
Mean 12
24Mean (Contd)
- The Most Common Measure of Central Tendency
especially when n is large due to its good
theoretical properties - Affected by Extreme Values (Outliers)
25Median
- Robust measure of central tendency
- Not affected by extreme values
-
-
- In an ordered array, the median is the middle
number - If n is odd, the median is the middle number
(i.e,(n1)/2 th measurement) - If n is even, the median is the average of the
n/2 th and (n/2 1) th measurement
3 5 7 8
3 5 7
38
Median 7
Median 7
26Mode
- A Measure of Central Tendency
- Value that Occurs Most Often
- Not Affected by Extreme Values
- There May Not Be a Mode
- There May Be Several Modes
- Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11
12 13 14
No Mode
Mode 9
27Quartiles
- Split ordered data into 4 quarters
- Position of i-th quartile
- (1st quartile) and (3rd quartile)
are measures of Noncentral Location - are called 25th, 50th, and
75th percentile respectively. A pth percentile
is the value of X such that p of the
measurements are less than X and (100-p) are
greater than X.
25
25
25
25
28Quartiles (example)
Data in Ordered Array 3 6 6 12 12 12 15
15 18 21
- Position of first quartile is
- Position of third quartile is
295-number summary
- Box-and-Whisker Plot
- Graphical display of data using 5-numbers
Data in Ordered Array 3 6 6 12 12 12 15
15 18 21
Median( )
X
X
largest
smallest
21
6
3
12
15.75