Graphical Summary of Data Distribution - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Graphical Summary of Data Distribution

Description:

Platykurtic When the kurtosis 0, the frequencies throughout the curve are ... Thus, negative kurtosis indicates a relatively flat distribution ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 34
Provided by: UNC5243
Category:

less

Transcript and Presenter's Notes

Title: Graphical Summary of Data Distribution


1
Graphical Summary of Data Distribution
  • Statistical View Point
  • Histograms
  • Skewness
  • Kurtosis
  • Other Descriptive Summary Measures

Source www.unc.edu/courses/2006spring/geog/090/00
1/www/Lectures/2006- Geog090-Week03-Lecture02-Ske
wsnessKurtosis.ppt
2
Measures of Dispersion Coefficient of Variation
  • Coefficient of variation (CV) measures the spread
    of a set of data as a proportion of its mean.
  • It is the ratio of the sample standard deviation
    to the sample mean
  • It is sometimes expressed as a percentage
  • There is an equivalent definition for the
    coefficient of variation of a population

3
Coefficient of Variation (CV)
  • It is a dimensionless number that can be used to
    compare the amount of variance between
    populations with different means

4
Histogram Frequency Distribution
  • A histogram is one way to depict a frequency
    distribution
  • Frequency is the number of times a variable takes
    on a particular value
  • Note that any variable has a frequency
    distribution
  • e.g. roll a pair of dice several times and record
    the resulting values (constrained to being
    between and 2 and 12), counting the number of
    times any given value occurs (the frequency of
    that value occurring), and take these all
    together to form a frequency distribution

5
Frequency Distribution
  • Frequencies can be absolute (when the frequency
    provided is the actual count of the occurrences)
    or relative (when they are normalized by dividing
    the absolute frequency by the total number of
    observations 0, 1)
  • Relative frequencies are particularly useful if
    you want to compare distributions drawn from two
    different sources (i.e. while the numbers of
    observations of each source may be different)

6
Histograms
  • We may summarize our data by constructing
    histograms, which are vertical bar graphs
  • A histogram is used to graphically summarize the
    distribution of a data set
  • A histogram divides the range of values in a data
    set into intervals
  • Over each interval is placed a bar whose height
    represents the frequency of data values in the
    interval.

7
Building a Histogram
  • To construct a histogram, the data are first
    grouped into categories
  • The histogram contains one vertical bar for each
    category
  • The height of the bar represents the number of
    observations in the category (i.e., frequency)
  • It is common to note the midpoint of the category
    on the horizontal axis

8
Building a Histogram Example
  • 1. Develop an ungrouped frequency table
  • That is, we build a table that counts the number
    of occurrences of each variable value from lowest
    to highest
  • TMI Value Ungrouped Freq.
  • 4.16 2
  • 4.17 4
  • 4.18 0
  • 13.71 1
  • We could attempt to construct a bar chart from
    this table, but it would have too many bars to
    really be useful

9
Building a Histogram Example
  • 2. Construct a grouped frequency table
  • Select an appropriate number of classes

Percentage
10
Building a Histogram Example
  • 3. Plot the frequencies of each class
  • All that remains is to create the bar graph

A proxy for Soil Moisture
11
Further Moments of the Distribution
  • While measures of dispersion are useful for
    helping us describe the width of the
    distribution, they tell us nothing about the
    shape of the distribution

12
Further Moments of the Distribution
  • There are further statistics that describe the
    shape of the distribution, using formulae that
    are similar to those of the mean and variance
  • 1st moment - Mean (describes central value)
  • 2nd moment - Variance (describes dispersion)
  • 3rd moment - Skewness (describes asymmetry)
  • 4th moment - Kurtosis (describes peakedness)

13
Further Moments Skewness
  • Skewness measures the degree of asymmetry
    exhibited by the data
  • S sample standard deviation
  • If skewness equals zero, the histogram is
    symmetric about the mean
  • Positive skewness vs negative skewness

14
Further Moments Skewness
Source http//library.thinkquest.org/10030/3smods
as.htm
15
Further Moments Skewness
  • Positive skewness
  • There are more observations below the mean than
    above it
  • When the mean is greater than the median
  • Negative skewness
  • There are a small number of low observations and
    a large number of high ones
  • When the median is greater than the mean

16
Further Moments Kurtosis
  • Kurtosis measures how peaked the histogram is
  • The kurtosis of a normal distribution is 0
  • Kurtosis characterizes the relative peakedness or
    flatness of a distribution compared to the normal
    distribution

17
Further Moments Kurtosis
  • Platykurtic When the kurtosis lt 0, the
    frequencies throughout the curve are closer to be
    equal (i.e., the curve is more flat and wide)
  • Thus, negative kurtosis indicates a relatively
    flat distribution
  • Leptokurtic When the kurtosis gt 0, there are
    high frequencies in only a small part of the
    curve (i.e, the curve is more peaked)
  • Thus, positive kurtosis indicates a relatively
    peaked distribution

18
Further Moments Kurtosis
platykurtic
leptokurtic
Source http//www.riskglossary.com/link/kurtosis.
htm
  • Kurtosis is based on the size of a distribution's
    tails.
  • Negative kurtosis (platykurtic) distributions
    with short tails
  • Positive kurtosis (leptokurtic) distributions
    with relatively long tails

19
Why Do We Need Kurtosis?
  • These two distributions have the same variance,
    approximately the same skew, but differ markedly
    in kurtosis.

Source http//davidmlane.com/hyperstat/A53638.htm
l
20
How to Graphically Summarize Data?
  • Histograms
  • Box plots

21
Functions of a Histogram
  • The function of a histogram is to graphically
    summarize the distribution of a data set
  • The histogram graphically shows the following
  • 1. Center (i.e., the location) of the data
  • 2. Spread (i.e., the scale) of the data
  • 3. Skewness of the data
  • 4. Kurtosis of the data
  • 4. Presence of outliers
  • 5. Presence of multiple modes in the data.

22
Functions of a Histogram
  • The histogram can be used to answer the following
    questions
  • 1. What kind of population distribution do the
    data come from?
  • 2. Where are the data located?
  • 3. How spread out are the data?
  • 4. Are the data symmetric or skewed?
  • 5. Are there outliers in the data?

23
Source http//www.robertluttman.com/vms/Week5/pag
e9.htm (First three)
http//office.geog.uvic.ca/geog226/frLab1.html
(Last)
24
Box Plots
  • We can also use a box plot to graphically
    summarize a data set
  • A box plot represents a graphical summary of what
    is sometimes called a five-number summary of
    the distribution
  • Minimum
  • Maximum
  • 25th percentile
  • 75th percentile
  • Median
  • Interquartile Range (IQR)

25
Box Plots
  • Example Consider first 9 Commodore prices ( in
    ,000)
  • 6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99,
    20.0
  • Arrange these in order of magnitude
  • 3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5,
    20.0
  • The median is Q2 6.7 (there are 4 values on
    either side)
  • Q1 5.9 (median of the 4 smallest values)
  • Q3 10.2 (median of the 4 largest values)
  • IQR Q3 Q1 10.2 - 5.9 4.3

26
  • Example (ranked)
  • 3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5,
    20.0
  • The median is Q1 6.7
  • Q1 5.9 Q3 10.2 IQR Q3 Q1 10.2 - 5.9
    4.3

27
Box Plots
Example Table 1.1 Commuting data (Rogerson, p5)
Ranked commuting times 5, 5, 6, 9, 10, 11, 11,
12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22,
23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 47
25th percentile is represented by observation
(301)/47.75 75th percentile is represented by
observation 3(301)/423.25
25th percentile 11.75 75th percentile 26
Interquartile range 26 11.75 14.25
28
Example (Ranked commuting times) 5, 5, 6, 9,
10, 11, 11, 12, 12, 14, 16, 17, 19, 21, 21, 21,
21, 21, 22, 23, 24, 24, 26, 26, 31, 31, 36, 42,
44, 47
25th percentile 11.75 75th
percentile 26
Interquartile range 26 11.75 14.25
29
Other Descriptive Summary Measures
  • Descriptive statistics provide an organization
    and summary of a dataset
  • A small number of summary measures replaces the
    entirety of a dataset
  • Well briefly talk about other simple descriptive
    summary measures

30
Other Descriptive Summary Measures
  • You're likely already familiar with some simple
    descriptive summary measures
  • Ratios
  • Proportions
  • Percentages
  • Rates of Change
  • Location Quotients

31
Other Descriptive Summary Measures
  • Ratios
  • of observations in A
  • of observations in B
  • e.g., A - 6 overcast, B - 24 mostly cloudy days
  • Proportions Relates one part or category of
    data to the entire set of observations, e.g., a
    box of marbles that contains 4 yellow, 6 red, 5
    blue, and 2 green gives a yellow proportion of
    4/17 or
  • colorcount yellow, red, blue, green
  • acount 4, 6, 5, 2


32
Other Descriptive Summary Measures
  • Proportions - Sum of all proportions 1. These
    are useful for comparing two sets of data
    w/different sizes and category counts, e.g., a
    different box of marbles gives a yellow
    proportion of 2/23, and in order for this to be a
    reasonable comparison we need to know the totals
    for both samples
  • Percentages - Calculated by proportions x 100,
    e.g., 2/23 x 100 8.696, use of these should
    be restricted to larger samples sizes, perhaps
    20 observations

33
Other Descriptive Summary Measures
  • Location Quotients - An index of relative
    concentration in space, a comparison of a
    region's share of something to the total
  • Example Suppose we have a region of 1000 Km2
    which we subdivide into three smaller areas of
    200, 300, and 500 km2 (labeled A, B, C)
  • The region has an influenza outbreak with 150
    cases in A, 100 in B, and 350 in C (a total of
    600 flu cases)
  • Proportion of Area Proportion of Cases Location
    Quotient
  • A 200/10000.2 150/6000.25
    0.25/0.21.25
  • B 300/10000.3 100/6000.17 0.17/0.3
    0.57
  • C 500/10000.5 350/6000.58
    0.58/0.51.17
Write a Comment
User Comments (0)
About PowerShow.com