Stat 281: Ch. 2--Presenting Data - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Stat 281: Ch. 2--Presenting Data

Description:

Stat 281: Ch. 2--Presenting Data An engineer, consultant and statistician were driving down a steep mountain road. Suddenly, the brakes failed and the car careened ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 53
Provided by: Dwig79
Category:

less

Transcript and Presenter's Notes

Title: Stat 281: Ch. 2--Presenting Data


1
Stat 281 Ch. 2--Presenting Data
  • An engineer, consultant and statistician were
    driving down a steep mountain road. Suddenly, the
    brakes failed and the car careened down the road
    out of control. But half way down, the driver
    somehow managed to stop the car by running it
    against the embankment, narrowly avoiding going
    over a very steep cliff. They all got out,
    shaken, but otherwise unharmed.
  • The consultant said "To fix this problem we need
    to organize a committee, have meetings, write
    several interim reports and develop a solution
    through a continuous improvement process."
  • The engineer said "No! That would take too long,
    and besides that method has never really worked.
    I have my trusty penknife here and will take
    apart the brake system, isolate the problem and
    correct it."
  • The statistician said "No - you're both wrong!
    Let's all push the car back up the hill and see
    if it happens again. We only have a sample size
    of 1 here!!"

2
Fizzy Cola Sales(Showing first 8 of 50)
Employee Gallons Sold
P.P. 95.00
S.M. 100.75
P.T. 126.00
P.U. 114.00
M.S. 134.25
F.K. 116.75
L.Z. 97.50
F.E. 102.25
3
The Goal
  • Display data in ways that elucidate the
    information contained in them
  • Raw Data actually contains all the information
    available, but it may not be easy to understand
  • Its not so much the information available that
    countsits the information you get out!

4
Ranked Fizzy Cola Sales
Rank Empl. Gal. Sold Rank Empl. Gal. Sold
1 T.T. 82.50 43 R.O. 133.25
2 A.D. 88.50 44 M.S. 134.25
3 E.I. 91.00 45 O.U. 135.00
4 A.S. 93.25 46 G.H. 135.50
5.5 P.P. 95.00 47 R.T. 136.00
5.5 E.Y. 95.00 48 A.T. 137.00
7 L.Z. 97.50 49 O.O. 144.00
8 T.N. 99.50 50 R.N. 148.00
5
Viewing Data Directly
  • Ranked Data (aka an Array)
  • Still contains all the information
  • Can quickly see range (max and min)
  • May also easily determine median, quartiles, etc.
  • Stem and Leaf
  • Arranges ranked data into chart-like form

6
Fizzy Cola Stem Leaf
8 28
9 135579
10 0234556789
11 02344555667889
12 124455688
13 2345567
14 48
7
More Complex Stem Leaf(MiniTab Style)
  • Stem-and-Leaf of C1 N16
  • Leaf Unit0.010
  • 1 59 7
  • 4 60 148
  • (5) 61 02669
  • 7 62 0247
  • 3 63 58
  • 1 64 3

8
Dot Plot for Fizzy Cola Sales
  • Dot plots display vertically stacked dots for
    each data value.
  • They tend to bring out any clustering behavior
    in the data.
  • Stem Leaf and Dot Plots begin to give us a
    picture of the Distribution of Data.

9
Summarized Data
  • Frequency Tables
  • Grouped or ungrouped
  • Frequency Distribution
  • Relative Frequency Distribution
  • Bar Graphs
  • Histogram (Numeric Data Only)
  • Pie Charts
  • Often used for Categorical Data

10
Fizzy Cola Frequency Table
Number of Employees in each Sales Range Number of Employees in each Sales Range Number of Employees in each Sales Range

Gallons Sold Employees Employees
80-90 2 2
gt90-100 6 6
gt100-110 10 10
gt110-120 14 14
gt120-130 9 9
gt130-140 7 7
gt140-150 2 2
11
Histogram of Fizzy Cola Sales
12
Constructing a Histogram
  • 1. Identify the high (H) and low (L) scores.
    Find the range. Range H - L.
  • 2. Select a number of classes and a class width
    so that the product is a bit larger than the
    range.
  • 3. Pick a starting point a little smaller than L.
    Count from L by the width to obtain the class
    boundaries. Observations that fall on class
    boundaries are placed into the class interval to
    the right.
  • Note
  • 1. The class width is the difference between the
    upper- and lower-class boundaries.
  • 2. There is no best choice for class widths,
    number of classes, or starting points.

13
Terms Used With Histograms
  • Symmetrical The sides of the distribution are
    mirror images. There is a line of symmetry.
  • Uniform (rectangular) Every value appears with
    equal frequency.
  • Skewed One tail is stretched out longer than the
    other. The direction of skewness is on the side
    of the longer tail (Positively vs. negatively
    skewed).
  • J-shaped There is no tail on the side of the
    class with the highest frequency.
  • Bimodal The two largest classes are separated by
    one or more classes. Often implies two
    populations are sampled.
  • Normal The distribution is symmetric about the
    mean and bell-shaped.

14
Bimodal Distribution
15
Left-Skewed Distribution
Ages of Nuns
16
Distribution of Categorical Data
Cars Sold in One Week
  • Day Number Sold
  • Monday 15
  • Tuesday 23
  • Wednesday 35
  • Thursday 11
  • Friday 12
  • Saturday 42

17
Basic Pie Chart
Cars Sold in One Week
Pie Charts focus our attention on fractions of
the whole, especially for the largest classes.
18
Three-D Pie Chart
Cars Sold in One Week
Three-D Pie Charts are pretty but can also be
used to distort the image.
19
Manipulating 3-D Pie Charts
Cars Sold in One Week
Changing the angle or turning the pie may affect
our perception of size.
20
Bar Charts for Categorical Data
Cars Sold in One Week
(Bar charts for categorical data are drawn with
bars separated, while bars in histograms touch.)
21
Manipulating Bar Charts
Cars Sold in One Week
Cutting off the vertical axis distorts
our perception of the differences between bars.
22
Manipulating Bar Charts
Cars Sold in One Week
Removal of labels on the vertical axis allows
bars to be stretched upward to hide the
differences.
23
Hmmm
  • It is proven that the celebration of birthdays is
    healthy. Statistics show that people who
    celebrate the most birthdays become the oldest.
  • In earlier times, they had no statistics, so they
    had to fall back on lies. (Stephen Leacock)

24
Measures of Central Tendency
  • Statistics used to locate the middle of a set of
    numeric data, or where the data is clustered.
  • The term average may be associated with all
    measures of central tendency.
  • The mode for discrete data is the value that
    occurs with greatest frequency.
  • The modal class of a histogram is the class with
    the greatest frequency.
  • A bimodal distribution has two high-frequency
    classes separated by classes with lower
    frequencies.

25
Summation Notation
26
The Mean
  • Mean The regular average. The sum of all the
    values divided by the total number of values.
  • The population mean, m, (lowercase Greek mu) is
    the mean of all x values for the population. It
    is a parameter of the distribution.
  • We usually cannot measure m but would like to
    estimate its value.

27
The Sample Mean
  • The sample mean, , (read x-bar) is the mean of
    all x values for the sample. It is a statistic.
  • The mean can be greatly influenced by outliers.
  • E.g. Bill Gates moves to town.

28
Median
  • Median The value of the data that occupies the
    middle position when the data are ranked
    according to size.
  • The sample median (statistic) may be denoted by
    x tilde
  • .
  • The population median (parameter), M, (uppercase
    Greek mu), is the data value in the middle of the
    population.
  • To find the median
  • 1. Rank the data.
  • 2. Determine the depth of the median.
  • 3. Determine the value of the median.

29
Mode
  • Mode The mode is the value of x that occurs most
    frequently.
  • Note If two or more values in a sample are tied
    for the highest frequency (number of
    occurrences), there is no mode.
  • Note Mode, as defined here, is most applicable
    to categorical or discrete data. The mode for
    continuous data is defined differently.

30
Other Measures of Center
  • Midrange The number midway between the maximum
    and minimum data values. It is found by
    averaging the max and min.
  • Midquartile Oops, we havent defined quartiles
    yet. But this is the average of the first and
    third quartile instead of the max and min.

31
Dispersion
  • How spread apart are the data?
  • Two populations with the same mean can have very
    different distributionswould like to take
    measure spread somehow.
  • Range (max-min)
  • Values in middle are ignored
  • Dispersion of middle could be very different
  • Use the idea of deviation from the mean
  • MAD
  • Variance
  • Standard Deviation

32
Deviations from the Mean
deviations
mean
x-values
33
Some example data
Obs Data x
1 2
2 4
3 5
4 9
Total
34
Calculate the mean
Obs Data x Mean
1 2 5
2 4 5
3 5 5
4 9 5
Total 20
35
Deviation From the Mean
Obs Data x Mean Deviation x-
1 2 5 -3
2 4 5 -1
3 5 5 0
4 9 5 4
Total 20 20 0
36
Mean Absolute Deviation (MAD)
Obs Data x Mean Deviation x- Absolute Deviation
1 2 5 -3 3
2 4 5 -1 1
3 5 5 0 0
4 9 5 4 4
Sum of Absolute Deviations Sum of Absolute Deviations Sum of Absolute Deviations Sum of Absolute Deviations 8
MAD (divide sum by n) MAD (divide sum by n) MAD (divide sum by n) MAD (divide sum by n) 2
37
Formula
38
Use of Squared Deviations
Obs Data x Mean Deviation x- Squared Deviation
1 2 5 -3 9
2 4 5 -1 1
3 5 5 0 0
4 9 5 4 16
Sum of Squared Deviations SS(x) Sum of Squared Deviations SS(x) Sum of Squared Deviations SS(x) Sum of Squared Deviations SS(x) 26
Variance (Divide Sum by n-1) Variance (Divide Sum by n-1) Variance (Divide Sum by n-1) Variance (Divide Sum by n-1) 8.67
Standard Deviation (Take Square Root) Standard Deviation (Take Square Root) Standard Deviation (Take Square Root) Standard Deviation (Take Square Root) 2.94
39
Sums of Squares
  • The sum of squared deviations is denoted by SS(x)
    and often called the Sum of Squares for x.
  • There are also other notations used, including
    SSx and Sxx

40
Variance
  • The Variance is the statisticians favorite
    measure of dispersion, but in reports or
    everyday use the standard deviation is more
    commonly given.
  • The Standard Deviation is the square root of the
    variance.
  • The Variance may be thought of as the average
    squared deviation from the mean.
  • For a sample, divide by n-1.
  • For a population, divide by N.

41
Formulas
42
Formulas
43
  • Example Find the variance and standard deviation
    for the data 5, 7, 1, 3, 8.

44
Interpretation of s
  • Need to get a sense of the meaning of different
    values of dispersion measures.
  • Are units same as data or squared?
  • Empirical Rule 68, 95, 99.7
  • Test of Normality
  • Range as estimator of s

45
z-Scores
  • Also standardized scores or just standard
    scores.
  • Expresses a quantity in terms of its distance
    from the mean in standard deviation units.

46
More z-Scores
  • The z-score measures the number of standard
    deviations away from the mean.
  • z-scores typically range from -3.00 to 3.00.
  • z-scores may be used to make comparisons of raw
    scores.
  • You can calculate back from z-score to raw data
    value by using the inverse

47
Percentiles
  • Values of the variable that divide a set of
    ranked data into 100 equal subsets.
  • Each set of data has 99 percentiles.
  • The kth percentile, Pk, is a value such that at
    most k of the data are smaller than Pk and at
    most (100-k) are larger.

48
  • Procedure for finding Pk
  • 1. Rank the n observations, lowest to highest.
  • 2. Compute A (nk)/100.
  • 3. If A is an integer
  • d(Pk) A.5 (depth)
  • Pk is halfway between the value of the datum in
    the Ath position and the value of the next
    datum.
  • If A is a fraction
  • d(Pk) B, the next largest integer.
  • Pk is the value of the data in the Bth position.
  • Some programs like Excel also do interpolation

49
Quartiles
  • Like percentiles except dividing the data set
    into 4 equal subsets.
  • The first quartile, Q1, is the same as the 25th
    percentile, and
  • The third quartile, Q3, is the same as the 75th
    percentile.
  • The second quartile is the 50th percentile, which
    is the median.
  • Sometimes finding Q1 and Q3 is described as
    finding the medians of the bottom half and top
    half of the data, respectively.

50
Five Number Summary
  • The Min, Q1, Median, Q3, and Max
  • Indicate how the data is spread out in each
    quarter.
  • Interquartile Range is the distance between Q1
    and Q3.
  • The Midquartile is the average of Q1 and Q3,
    another measure of central tendency.

51
Box and Whisker Plots
52
Hmmm
  • What did the Box Plot say to the outlier?
  • Dont you dare get close to my whisker!
Write a Comment
User Comments (0)
About PowerShow.com