Descriptive Analysis and Presentation of SingleVariable Data - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Descriptive Analysis and Presentation of SingleVariable Data

Description:

Useful for visualizing the spread of the data ... Actually existing, capable of measurement (snowfall, mountain height, number of beagles, etc. ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 60
Provided by: juliocr
Category:

less

Transcript and Presenter's Notes

Title: Descriptive Analysis and Presentation of SingleVariable Data


1
Descriptive Analysis and Presentation of
Single-Variable Data
  • Descriptive Statistics for Short
  • References Julio Rivera

2
Circle graph
  • Good for summarizing attribute (nominal) data

3
Bar Graphs
  • These are more traditional
  • Provide basic, quick information

4
Stem and Leaf Plots
  • Useful for sorting data
  • Useful for visualizing the spread of the data
  • Stem is the line and the leaves are the values
    extended from the stem
  • Sometimes shows things unexpected

5
Frequency Distribution
  • Suppose you have the values 3, 2, 2, 3, 2, 4,
    4, 1, 2, 2, 4, 3, 2, 0, 2, 2, 1, 3, 3, 1
  • A list like this does not provide a clear picture
    to the reader
  • Stem and Leaf or other graphs are not always the
    most effective way to display the data

6
Ungrouped Frequency Distribution
  • Used to represent the set of data in summary form
  • This an ungrouped frequency distribution because
    each value of x stands alone

7
Grouped Frequency Distribution
  • Classes are formed
  • Each class here has the same width (not always
    the case)
  • Each class should not overlap with other classes

8
Assume a data set
  • 27 68 79 91 107 43 71 80 91 108 43 71 81 93 108 4
    4 71 82 94 116 47 73 82 94 12049 73 84 94 120 50 7
    4 84 96 122 54 75 86 97 123 58 76 88 103 127 65 77
    88 106 128

9
Classify it
  • No overlaps
  • Upper class limit
  • largest values fitting into each class
  • Lower Class limit
  • smallest piece of data that could go in each class

10
Class Boundaries
  • numbers that do not occur in the sample data, but
    are halfway between the upper limit of one class
    and the lower limit of the next
  • In this data set the boundaries are 21.5, 32.5,
    43.5, 54.5, etc.

11
Histogram
  • A type of bar graph representing and entire set
    of data
  • A histogram will contain
  • title
  • vertical scale (frequency)
  • horizontal scale representing the variable

Exam Scores
12
Relative Frequency Histogram
  • A proportional measure of the frequency of
    occurrence (a percentage)
  • SYSTAT puts it on the left of the graph

13
Shapes of Histograms
  • Described as Symmetrical, Normal or Triangular
  • Both sides of the distribution are identical
  • The normal distribution is your friend

14
Shapes of Histograms
  • Uniform or Rectangular
  • Every value appears with equal frequency
  • Notice there are no peaks in this
    distribution--we will talk about Kurtosis later.

15
Shapes of Histograms
  • Skewed
  • One tail is stretched out longer than the other
  • The direction of skewness is on the side of the
    longer tail
  • This is skewed to the right (positive skew)

16
Shapes of Histograms
  • This is skewed to the left
  • negative skew
  • We will talk about the mathematical formula for
    skewness later

17
Shapes of Histograms
  • J-shaped
  • There is no tail on the side of the class with
    the highest frequency

18
Shapes of Histograms
  • Bi-Modal or Multi-modal
  • The most populous classes are separated by one or
    more classes.
  • This situation often implies that two populations
    are being sampled

19
Cumulative Frequency
Yuck--sorry about this definition!
  • Using the frequencies to make subtotals of the
    accumulated frequencies of the classes and those
    classes which are less than the current class
  • The relative cumulative frequency is a percentage
    of the total

20
Classifying the Data
  • Notice the classification of data
  • Equal groups of eleven
  • Only one of a number of ways of classifying data
  • Remember thatbut first we have to talk about
    some characteristics of data

21
Tangibility--the physical reality of the data
  • Tangible
  • Actually existing, capable of measurement
    (snowfall, mountain height, number of beagles,
    etc.)
  • Abstract
  • Computed or derived from other sources (percent
    of unemployment, median school years completed)

22
Temporality--the situation of the data in time
  • Status
  • Showing things as they are at a particular time.
    (Land value in 1995, air traffic in December,
    Mean temperature 1950-80, etc.)
  • Trend
  • Showing how things change through time (urban
    growth since 1950, change in the extent of the
    rain forest since 1960)

23
Measurement Levels--a quick review
  • Nominal Data
  • Ordinal Data
  • Interval Data
  • Ratio Data

24
Classifying and Characterizing Data
  • Classifications of data are useful because they
    allow us to describe data.
  • The better we can describe the data, the more
    skillful we are at representing it.
  • We are going to discuss some of these
    classifications in detail

25
Exogenous Classifications
  • Category boundaries are established based on
    threshold values set by criteria from another
    source
  • Source is usually a government agency or
    professional organization
  • Low Risk 30 to 49 PPM
  • Medium Risk 50 to 99 PPM
  • High Risk 100 to 1000 PPM

26
Ideographic
  • Category boundaries are established based on
    details within the data values
  • Quantile
  • Contains an equal number of pieces of data in
    each category
  • 50 states, 5 categories, 10 states in each
    category
  • 87 counties, 4 categories, 22 in three
    categories, 21 in one category

27
  • Multi-modal
  • Uses natural breaks in the data to determine
    category boundaries
  • Be careful--natural breaks--not blips
  • Multi-step
  • Breaks in the slope of a cumulative frequency
    curve

28
Serial
  • Category boundaries are established that have a
    mathematical relationship to one another
  • Equal Interval
  • divides the data into equally ranged classes
    based on the minimum and maximum data values

29
  • Arithmetic Progression
  • Category ranges differ based on a rate of
    increase of a constant (Adding)
  • 2, 4, 6, 8, 10, 12, 14, 16
  • Geometric Progression
  • Category ranges differ based on a rate of
    increase that takes in account the prior class
    range (Multiplying)
  • 2, 4, 8, 16, 32, 64, 128, 256

30
  • Standard Deviation Units
  • Centered on the mean of the data with class
    boundaries established by the standard deviations
    plus the mean
  • Highlighted
  • based on a decision to highlight particular
    sectors of the data set in greater or lesser
    detail

31
Writing it out in a legend
  • Categories designations should not overlap
  • 0-10, 10-20, 20-30, 30-40, 40-50
  • Designations should reflect the actual data that
    is in the category
  • 0-10, 11-20, 21-30, 31-40, 41-50
  • 0-9, 10-19, 20-29, 30-39, 41-50
  • 1-10, 11-20, 21-30, 31-40, 41-50

32
Classifying and Characterizing Data
  • No where are these classifications more dramatic
    than on choropleth maps (and other thematic maps
    as well)
  • The following seven maps were made with the same
    piece of data
  • The only difference was how the data was
    classified
  • All the classifications are standard, accepted
    ways of classifying data

33
Theres more ways to describe data
  • Shape and frequency are one way to describe data
    sets
  • Then we can classify them
  • We can describe the data relative to the middle
    of the data

34
Measures of Central Tendency
  • Different ways of describing the middle of the
    data set
  • Some averages are more average than others
  • Its mean to drive on the median, you might get
    mode down
  • A really bad play on words--Im truly sorry
  • Julio

35
Mean
  • Arithmetic Mean Sometimes called the
    average.
  • This means add all the variables of x and then
    divide the number of values which is represented
    by n

36
Median
  • The middle value of the sample when the data are
    ranked in order according to size
  • The median is the ith value in the ranked data
    set
  • Median is between 25 and 26 (use both values)

37
Other Averages
  • Mode
  • The value of x which occurs the most frequently
  • Excellent for nominal data
  • Midrange
  • The midrange is the data value which is the
    average of the low and high values

38
Weighted Mean
  • Takes each class midpoint and multiples it by the
    class frequency.
  • These are summed and divided by n

39
Which tendency is most central?
  • All these measures of central tendency describe
    the middle of the data.
  • McGrew provides a good example with income data
  • Mode 21,000
  • Median 26,000
  • Mean 71,428

40
Using Central Tendency
  • Mode, Median and Mean are the most common
  • Means can be greatly influenced by outlying
    values
  • Medians are always the middle value
  • Modes are the most common value
  • These can yield different results using the same
    data and have different bases in logic

41
Measures of dispersion
  • If you have a middle, how is the data arranged
    around the middle?
  • Examine it graphically and statistically
  • Graphically? did I hear John Tukey?

42
Box and Whisker Plots
  • Line in the center of the box is the median
  • Box represents the Inter-quartile Range (25 to
    75 levels)
  • Whiskers represent 25--sometimes outliers are
    marked

43
Deviation from the Mean
  • How much do the values vary from the mean?
  • If the mean is 6 and your value is 3 what is the
    variation?
  • If the mean is 6 and your value is 7 what is the
    variation?
  • What is the absolute value of those two
    variations
  • x reads the absolute value of x (always
    positive)
  • x-y the absolute value of x-y (always positive)

44
Mean absolute deviation
  • The mean of the deviations
  • Tells us the average distance that a piece of
    data is from the mean

45
Standard Deviation and Variance
  • These are your friends
  • Think of them as Bert and Ernie
  • These two statistics are two of the most commonly
    referred to in statistical work

46
Standard Deviation
  • First formula is for the sample, and the second
    is for the population
  • Standard Deviation is a measure of fluctuation of
    the data (a yardstick if you will)

47
Variance
  • It is the standard deviation squared
  • Like standard deviation it is a measurement of
    fluctuation of the data.
  • Variances are used in other statistical
    procedures (later in the semester)

48
Coefficient of Variation
  • Comparison of data between two data sets
  • Takes account the problems of different sample
    sizes
  • A ratio of the standard deviation divided by the
    mean

49
Coefficient of Variation
  • Annual precipitation data for 40 years (n40)
  • Although St. Louis has the larger SD, the CV for
    San Diego is larger.
  • Why? and what does this mean?

50
Measures of Shape and Relative Position
  • Other measures of the frequency Distribution
  • Skewness and Kurtosis

51
Skewness
  • Measures the degree of symmetry in the
    distribution
  • the extent to which the bulk of values in a
    distribution are concentrated on one or the other
    side of the mean
  • Remember shapes of histograms that were skewed to
    the right and skewed to the left (positive and
    negative)

52
Skewness
  • Skewness is important because many Geographic
    data sets are skewed
  • Helps understand the mean
  • Little skew, the mean is probably representative
  • Lots--maybe not

s standard deviation
53
Kurtosis
  • Measures the flatness or the peakedness of the
    data
  • Is the distribution full of peaks and valleys or
    is it flatter?
  • Best compared to a normal curve (kurtosis3)

54
Kurtosis
  • If value is greater than 3 it is letptokurtic
    (peaky)
  • If value is less than 3 it is platykurtic (flat)
  • Note that some programs will use formula 2 so
    then zero is the dividing line for kurotsis

55
Quartiles
  • Data is divided into four equal parts (quarters)
  • Inter-quartile range is between 25 and 50 marks
  • Do not confuse quartiles with quantiles
  • quartile is a type of quantile

56
Z-score (Standard Score)
  • The position of a value (x) away from the mean
    measured in standard deviations
  • More on this later

57
Chebyshevs Theorem
  • A useful tidbit of knowledge
  • The proportion of any distribution that lies
    within k standard deviations of the mean is at
    least 1-(1/k2) where k is any positive number
    larger than one
  • This applies to any distribution of data

58
Application of Theorem
  • This means that for any distribution of data
  • 75 of the data is within 2 standard deviations
    from the mean

59
Empirical Rule
  • For Normal Distributions
  • 68 of the data is within 1 Standard Deviation of
    the Mean
  • 95 is within 2 SDs
  • 99.7 is within 3 SDs
  • Really important to know (film at 11)
Write a Comment
User Comments (0)
About PowerShow.com