Title: Descriptive Analysis and Presentation of SingleVariable Data
1Descriptive Analysis and Presentation of
Single-Variable Data
- Descriptive Statistics for Short
- References Julio Rivera
2Circle graph
- Good for summarizing attribute (nominal) data
3Bar Graphs
- These are more traditional
- Provide basic, quick information
4Stem and Leaf Plots
- Useful for sorting data
- Useful for visualizing the spread of the data
- Stem is the line and the leaves are the values
extended from the stem - Sometimes shows things unexpected
5Frequency Distribution
- Suppose you have the values 3, 2, 2, 3, 2, 4,
4, 1, 2, 2, 4, 3, 2, 0, 2, 2, 1, 3, 3, 1 - A list like this does not provide a clear picture
to the reader - Stem and Leaf or other graphs are not always the
most effective way to display the data
6Ungrouped Frequency Distribution
- Used to represent the set of data in summary form
- This an ungrouped frequency distribution because
each value of x stands alone
7Grouped Frequency Distribution
- Classes are formed
- Each class here has the same width (not always
the case) - Each class should not overlap with other classes
8Assume a data set
- 27 68 79 91 107 43 71 80 91 108 43 71 81 93 108 4
4 71 82 94 116 47 73 82 94 12049 73 84 94 120 50 7
4 84 96 122 54 75 86 97 123 58 76 88 103 127 65 77
88 106 128
9Classify it
- No overlaps
- Upper class limit
- largest values fitting into each class
- Lower Class limit
- smallest piece of data that could go in each class
10Class Boundaries
- numbers that do not occur in the sample data, but
are halfway between the upper limit of one class
and the lower limit of the next - In this data set the boundaries are 21.5, 32.5,
43.5, 54.5, etc.
11Histogram
- A type of bar graph representing and entire set
of data - A histogram will contain
- title
- vertical scale (frequency)
- horizontal scale representing the variable
Exam Scores
12Relative Frequency Histogram
- A proportional measure of the frequency of
occurrence (a percentage) - SYSTAT puts it on the left of the graph
13Shapes of Histograms
- Described as Symmetrical, Normal or Triangular
- Both sides of the distribution are identical
- The normal distribution is your friend
14Shapes of Histograms
- Uniform or Rectangular
- Every value appears with equal frequency
- Notice there are no peaks in this
distribution--we will talk about Kurtosis later.
15Shapes of Histograms
- Skewed
- One tail is stretched out longer than the other
- The direction of skewness is on the side of the
longer tail - This is skewed to the right (positive skew)
16Shapes of Histograms
- This is skewed to the left
- negative skew
- We will talk about the mathematical formula for
skewness later
17Shapes of Histograms
- J-shaped
- There is no tail on the side of the class with
the highest frequency
18Shapes of Histograms
- Bi-Modal or Multi-modal
- The most populous classes are separated by one or
more classes. - This situation often implies that two populations
are being sampled
19Cumulative Frequency
Yuck--sorry about this definition!
- Using the frequencies to make subtotals of the
accumulated frequencies of the classes and those
classes which are less than the current class - The relative cumulative frequency is a percentage
of the total
20Classifying the Data
- Notice the classification of data
- Equal groups of eleven
- Only one of a number of ways of classifying data
- Remember thatbut first we have to talk about
some characteristics of data
21Tangibility--the physical reality of the data
- Tangible
- Actually existing, capable of measurement
(snowfall, mountain height, number of beagles,
etc.) - Abstract
- Computed or derived from other sources (percent
of unemployment, median school years completed)
22Temporality--the situation of the data in time
- Status
- Showing things as they are at a particular time.
(Land value in 1995, air traffic in December,
Mean temperature 1950-80, etc.) - Trend
- Showing how things change through time (urban
growth since 1950, change in the extent of the
rain forest since 1960)
23Measurement Levels--a quick review
- Nominal Data
- Ordinal Data
- Interval Data
- Ratio Data
24Classifying and Characterizing Data
- Classifications of data are useful because they
allow us to describe data. - The better we can describe the data, the more
skillful we are at representing it. - We are going to discuss some of these
classifications in detail
25Exogenous Classifications
- Category boundaries are established based on
threshold values set by criteria from another
source - Source is usually a government agency or
professional organization - Low Risk 30 to 49 PPM
- Medium Risk 50 to 99 PPM
- High Risk 100 to 1000 PPM
26Ideographic
- Category boundaries are established based on
details within the data values - Quantile
- Contains an equal number of pieces of data in
each category - 50 states, 5 categories, 10 states in each
category - 87 counties, 4 categories, 22 in three
categories, 21 in one category
27- Multi-modal
- Uses natural breaks in the data to determine
category boundaries - Be careful--natural breaks--not blips
- Multi-step
- Breaks in the slope of a cumulative frequency
curve
28Serial
- Category boundaries are established that have a
mathematical relationship to one another - Equal Interval
- divides the data into equally ranged classes
based on the minimum and maximum data values
29- Arithmetic Progression
- Category ranges differ based on a rate of
increase of a constant (Adding) - 2, 4, 6, 8, 10, 12, 14, 16
- Geometric Progression
- Category ranges differ based on a rate of
increase that takes in account the prior class
range (Multiplying) - 2, 4, 8, 16, 32, 64, 128, 256
30- Standard Deviation Units
- Centered on the mean of the data with class
boundaries established by the standard deviations
plus the mean - Highlighted
- based on a decision to highlight particular
sectors of the data set in greater or lesser
detail
31Writing it out in a legend
- Categories designations should not overlap
- 0-10, 10-20, 20-30, 30-40, 40-50
- Designations should reflect the actual data that
is in the category - 0-10, 11-20, 21-30, 31-40, 41-50
- 0-9, 10-19, 20-29, 30-39, 41-50
- 1-10, 11-20, 21-30, 31-40, 41-50
32Classifying and Characterizing Data
- No where are these classifications more dramatic
than on choropleth maps (and other thematic maps
as well) - The following seven maps were made with the same
piece of data - The only difference was how the data was
classified - All the classifications are standard, accepted
ways of classifying data
33Theres more ways to describe data
- Shape and frequency are one way to describe data
sets - Then we can classify them
- We can describe the data relative to the middle
of the data
34Measures of Central Tendency
- Different ways of describing the middle of the
data set - Some averages are more average than others
- Its mean to drive on the median, you might get
mode down - A really bad play on words--Im truly sorry
- Julio
35Mean
- Arithmetic Mean Sometimes called the
average. - This means add all the variables of x and then
divide the number of values which is represented
by n
36Median
- The middle value of the sample when the data are
ranked in order according to size - The median is the ith value in the ranked data
set - Median is between 25 and 26 (use both values)
37Other Averages
- Mode
- The value of x which occurs the most frequently
- Excellent for nominal data
- Midrange
- The midrange is the data value which is the
average of the low and high values
38Weighted Mean
- Takes each class midpoint and multiples it by the
class frequency. - These are summed and divided by n
39Which tendency is most central?
- All these measures of central tendency describe
the middle of the data. - McGrew provides a good example with income data
- Mode 21,000
- Median 26,000
- Mean 71,428
40Using Central Tendency
- Mode, Median and Mean are the most common
- Means can be greatly influenced by outlying
values - Medians are always the middle value
- Modes are the most common value
- These can yield different results using the same
data and have different bases in logic
41Measures of dispersion
- If you have a middle, how is the data arranged
around the middle? - Examine it graphically and statistically
- Graphically? did I hear John Tukey?
42Box and Whisker Plots
- Line in the center of the box is the median
- Box represents the Inter-quartile Range (25 to
75 levels) - Whiskers represent 25--sometimes outliers are
marked
43Deviation from the Mean
- How much do the values vary from the mean?
- If the mean is 6 and your value is 3 what is the
variation? - If the mean is 6 and your value is 7 what is the
variation? - What is the absolute value of those two
variations - x reads the absolute value of x (always
positive) - x-y the absolute value of x-y (always positive)
44Mean absolute deviation
- The mean of the deviations
- Tells us the average distance that a piece of
data is from the mean
45Standard Deviation and Variance
- These are your friends
- Think of them as Bert and Ernie
- These two statistics are two of the most commonly
referred to in statistical work
46Standard Deviation
- First formula is for the sample, and the second
is for the population - Standard Deviation is a measure of fluctuation of
the data (a yardstick if you will)
47Variance
- It is the standard deviation squared
- Like standard deviation it is a measurement of
fluctuation of the data. - Variances are used in other statistical
procedures (later in the semester)
48Coefficient of Variation
- Comparison of data between two data sets
- Takes account the problems of different sample
sizes - A ratio of the standard deviation divided by the
mean
49Coefficient of Variation
- Annual precipitation data for 40 years (n40)
- Although St. Louis has the larger SD, the CV for
San Diego is larger. - Why? and what does this mean?
50Measures of Shape and Relative Position
- Other measures of the frequency Distribution
- Skewness and Kurtosis
51Skewness
- Measures the degree of symmetry in the
distribution - the extent to which the bulk of values in a
distribution are concentrated on one or the other
side of the mean - Remember shapes of histograms that were skewed to
the right and skewed to the left (positive and
negative)
52Skewness
- Skewness is important because many Geographic
data sets are skewed - Helps understand the mean
- Little skew, the mean is probably representative
- Lots--maybe not
s standard deviation
53Kurtosis
- Measures the flatness or the peakedness of the
data - Is the distribution full of peaks and valleys or
is it flatter? - Best compared to a normal curve (kurtosis3)
54Kurtosis
- If value is greater than 3 it is letptokurtic
(peaky) - If value is less than 3 it is platykurtic (flat)
- Note that some programs will use formula 2 so
then zero is the dividing line for kurotsis
55Quartiles
- Data is divided into four equal parts (quarters)
- Inter-quartile range is between 25 and 50 marks
- Do not confuse quartiles with quantiles
- quartile is a type of quantile
56Z-score (Standard Score)
- The position of a value (x) away from the mean
measured in standard deviations - More on this later
57Chebyshevs Theorem
- A useful tidbit of knowledge
- The proportion of any distribution that lies
within k standard deviations of the mean is at
least 1-(1/k2) where k is any positive number
larger than one - This applies to any distribution of data
58Application of Theorem
- This means that for any distribution of data
- 75 of the data is within 2 standard deviations
from the mean
59Empirical Rule
- For Normal Distributions
- 68 of the data is within 1 Standard Deviation of
the Mean - 95 is within 2 SDs
- 99.7 is within 3 SDs
- Really important to know (film at 11)