Title: Descriptive%20Statistics
1Descriptive Statistics
- Lecture 02
- Tabular and Graphical Presentation of Data and
Measures of Locations
2Presentation of Qualitative Variables
- The simplest way of presenting/summarizing a
qualitative variable is by using a frequency
table, which shows the frequency of occurrence of
each of the different categories. - Such a table could also include the relative
frequency, which indicates the proportion or
percentage of occurrence of each of the
categories. - The frequency table could then be pictorially
represented by a bar graph or a pie diagram.
3An Example
- A manufacturer of jeans has plants in California
(CA), Arizona (AZ), and Texas (TX). A sample of
25 pairs of jeans was randomly selected from a
computerized database, and the state in which
each was produced was recorded. The data are as
follows - CA AZ AZ TX CA CA CA TX TX TX AZ AZ CA AZ TX CA
AZ TX TX TX CA AZ AZ CA CA - Quite uninformative at this stage!
- Need to summarize to reveal information.
4The Frequency Table
5The Bar Chart
Frequency
10
5
0
TX
CA
AZ
6Example continued
- By looking at this frequency table and bar graph,
one is able to obtain the information that there
seems to be equal proportions of pairs of jeans
being manufactured in the three states. - Frequency table and bar graph certainly more
informative than the raw presentation of the
sample data. - Another method of pictorial presentation of
qualitative data is by using the pie diagram. In
this case a pie is divided into the categories
with a given categorys angle being equal to 360
degrees times the relative frequency of
occurrence of that category.
7Pie Diagram
Angles (in degrees) CA(360)(.36)129.6 AZ(360)(
.32)115.2 TX(360)(.32)115.2
129.6o
115.2o
115.2o
8Pie Chart from Minitab
9Presentation of Quantitative Variables
- When the quantitative variable is discrete (such
as counts), a frequency table and a bar graph
could also be used for summarizing it. - Only difference is that the values of the
variables could not be reshuffled in the graph,
in contrast to when the variable is categorical
or qualitative. - For example suppose that we asked a sample 20
students about the number of siblings in their
family. The sample data might be - 4, 1, 6, 2, 2, 3, 4, 1, 2, 2, 3, 7, 2, 1, 1, 5,
3, 4, 6, 3
10Its Bar Graph is
11An Example of a Real Data Set Poverty versus
PACT in SC
74 48 54 77 43 55 94 41 62 88 49 62 78 50 59 79 46
58 61 41 47 45 26 34 87 49 62 68 36 52 76 45 56 3
2 22 31 63 39 53 33 20 26 64 44 53 39 20 22 37 21
27 47 23 30 40 29 41 43 25 27 37 24 31 64 37 43 59
36 45 70 32 41 55 37 46 90 38 47 45 32 35 31 25 2
4 35 29 32 15 14 18
73 30 41 31 24 30 75 45 57 57 29 40 80 51 63 54 30
44 67 28 33 76 45 50 87 61 61 54 27 33 60 32 41 3
5 26 35 51 29 36 50 35 42 43 23 26 66 32 44 86 63
75 54 25 33 87 60 69 49 29 37 46 38 43 50 38 44 57
40 50 90 60 75 26 17 20 47 23 27 53 37 39 58 34 4
3 16 13 15
Lunch ActualLang ActualMath 59 32 38 46 26 30 90 6
3 67 29 17 24 41 24 26 51 30 41 41 25 30 43 32 36
70 33 36 93 50 66 84 50 66 64 27 32 52 36 43 50 31
43 53 28 35 78 36 41 57 31 42 51 39 42 55 41 53 6
0 37 45 96 46 66 75 34 45 60 29 36 71 43 53 68 42
51 76 47 52 82 49 55
12Frequency Tables and Histograms
Consider the variable Lunch, which represents
the percentage of students in the school district
whose lunches are not free. The higher the value
of this variable, the richer the district. n
Number of Observations 86 LV Lowest Value
15 HV Highest Value 96 Let us construct a
frequency table with classes 10,20), 20,30),
30,40), , 90,100)
13Frequency Table for Variable Lunch
14Frequency Histogram
15Stem-and-Leaf Plots
- An important tool for presenting quantitative
data when the sample size is not too large is via
a stem-and-leaf plot. - By using this method, there is usually no loss
of information in that the exact values of the
observations could be recovered (in contrast to a
frequency table for continuous data). - Basic idea To divide each observation into a
stem and a leaf. - The stems will serve as the body of the plant
while the leaves will serve as the branches or
leaves of the plant. - An illustration makes the idea transparent.
16An Example
- A random sample of 30 subjects from the 1910
subjects in the blood pressure data set was
selected. We present here the systolic blood
pressures of these 30 subjects. - 30 Systolic Blood Pressures 122 135 110 126 100
110 110 126 94 124 108 110 92 98 118 110 102 108
126 104 110 120 110 118 100 110 120 100 120 92 - Lowest Value 92, Highest Value 135
- Stems 9,10, 11, 12, 13
- Leaves Ones Digit
17Stem-and-Leaf Plot
- 9 224
- 9 8
- 10 00024
- 10 88
- 11 00000000
- 11 88
- 12 00024
- 12 666
- 13
- 13 5
18Stem-and-Leaf continued
- In this stem-and-leaf plot, because there will
only be 5 stems if we use 9, 10, 11, 12, 13, we
decided to subdivide each stem into two parts
corresponding to leaf values lt 4, and those gt
5. - Such a procedure usually produces better looking
distributions. - Looking at this stem-and-leaf plot, notice that
many of the observations are in the range of
100-126. - The exact values could be recovered from this
plot. - By arranging the leaves in ascending order, the
plot also becomes more informative.
19Comparative Stem-and-Leaf Plots
- When comparing the distributions of two groups
(e.g., when classified according to GENDER),
side-by-side stem-and-leaf plots (also
side-by-side histograms) could be used. - To illustrate, consider 30 observations from the
blood pressure data set with Gender and Systolic
Blood Pressure being the observed variables. - For the males (Sex 0) 122, 120, 130, 110, 134,
136, 142, 100, 120, 162, 126, 132, 124, 130 - For the females (Sex 1) 132, 94, 104, 100,
130, 110, 102, 110, 130, 92, 125, 108, 100, 130,
100, 100
20Comparing Male/Female Systolic Blood Pressures
21Scatterplots Studying Relationship Between
Poverty and Math
Question What kind of relationship is there
between Lunch and PACT Math Scores?
22Numerical Summary Measures
- Overview
- Why do we need numerical summary measures?
- Measures of Location
- Measures of Variation
- Measures of Position
- Box Plots
23Why we Need Summary Measures?
- A picture is worth a thousand words, but beauty
is always in the eyes of the beholder! - Graphs or pictures sometimes unwieldy
- Usually wants a small set of numbers that could
provide the important features of the data set - When making decisions, objectivity is enhanced
when they are based on numbers! - Numerical summaries and tabular/graphical
presentations complement each other
24The Setting
- In defining and illustrating our summary
measures, assume that we have sample data - Sample Data X1, X2, X3, , Xn
- Sample Size n
- These summary measures are thus (sample)
statistics. - If instead they are based on the population
values, they will be (population) parameters.
25Measures of Location or Center
- These are summary measures that provide
information on the center of the data set - Usually, these measures of location are where the
observations cluster, but not always - In laymans terms, these measures are what we
associate with averages - Will discuss two measures sample mean and sample
median
26Sample Mean or Arithmetic Average
- The sample mean equals the sum of the
observations divided by the number of
observations. - It is defined symbolically via
27Properties of the Sample Mean
- Center of Gravity
- Sum of the deviations of the observations from
the mean is always zero (barring rounding errors) - Sample mean could however be affected drastically
by extreme or outliers - The sample mean is very conducive to mathematical
analysis compared to other measures of location
28Illustration
- Consider the systolic blood pressure data set
considered in Lecture 01 - Sample Size n 30
- Data 122, 135, 110, 126, 100, 110, 110, 126, 94,
124, 108, 110, 92, 98, 118, 110, 102, 108, 126,
104, 110, 120, 110, 118, 100, 110, 120, 100, 120,
92
29Sample Mean Computation
- This value of 111.1 could be interpreted as the
balancing point of the 30 systolic blood pressure
observations. - Locating this in the histogram we have
30Sample Mean in Histogram
31Sample Median
- Sample median (M) value that divides the
arranged/ordered data set into two equal parts. - At least 50 are lt M and at least 50 are gt M
- Not sensitive to outliers but harder to deal with
mathematically - Appropriate when histogram is left or
right-skewed - Better to present both mean and median in practice
32Illustration of Computation of Median
- Consider again the blood pressure data earlier.
- n30 an even number.
- Median will be the average of the 15th and 16th
observations in arranged data. - Arranged data 92, 92, 94, 98, 100, 100, 100,
102, 104, 108, 108, 110, 110, 110, 110, 110, 110,
110, 110, 118, 118, 120, 120, 120, 122, 124, 126,
126, 126, 135
33Continued ...
- The sample median is the average of 110 and 110,
which are the 15th and 16th observations in the
arranged data. - The median equals 110.
- Note that it is very close to the sample mean
value of 111.1 - This closeness is because of the near symmetry of
the distribution
34Relative Positions of Mean and Median
- For symmetric distributions, the mean and the
median coincide. - For right-skewed distributions, the mean tends to
be larger than the median (mean pulled up by the
large extreme values) - For left-skewed distributions, the mean tends to
be smaller than the median (mean pulled down by
the small extreme values)