Looking at Data - Distributions - PowerPoint PPT Presentation

About This Presentation
Title:

Looking at Data - Distributions

Description:

2005 Oscar Nominees (Best Picture) Movie: Domestic Gross/Worldwide Gross ... Mean & Median Domestic Gross among nominees ($M): Delta Flight Times - ATL/MCO Oct,2004 ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 47
Provided by: larryw4
Category:

less

Transcript and Presenter's Notes

Title: Looking at Data - Distributions


1
Chapter1
  • Looking at Data - Distributions

2
Introduction
  • Goal Using Data to Gain Knowledge
  • Terms/Definitions
  • Individiduals Units described by or used to
    obtain data, such as humans, animals, objects
    (aka experimental or sampling units)
  • Variables Characteristics corresponding to
    individuals that can take on different values
    among individuals
  • Categorical Variable Levels correspond to one of
    several groups or categories
  • Quantitaive Variable Take on numeric values such
    that arithmetic operations make sense

3
Introduction
  • Spreadsheets for Statistical Analyses
  • Rows Represent Individuals
  • Columns Represent Variables
  • SPSS, Minitab, EXCEL are examples
  • Measuring Variables
  • Instrument Tool used to make quantitative
    measurement on subjects (e.g. psychological test
    or physical fitness measurement)
  • Independent and Dependent Variables
  • Independent Variable Describes a group an
    individal comes from (categorical) or its level
    (quantitative) prior to observation
  • Dependent Variable Random outcome of interest

4
Independent and Dependent Variables
  • Dependent variables are also called response
    variables
  • Independent Variables are also called explanatory
    variables
  • Marketing Does amount of exposure effect
    attitudes?
  • I.V. Exposure (in time or number), different
    subjects receive different levels
  • D.V. Measurement of liking of a product or brand
  • Medicine Does a new drug reduce heart disease?
  • I.V. Treatment (Active Drug vs Placebo)
  • D.V. Presence/Absence of heart disease in a time
    period
  • Psychology/Finance Risk Perceptions
  • I.V. Framing of Choice (Loss vs Gain)
  • D.V. Choice Taken (Risky vs Certain)

5
Rates and Proportions
  • Categorical Variables Typically we count the
    number with some characteristic in a group of
    individuals.
  • The actual count is not a useful summary. More
    useful summaries include
  • Proportion The number with the characteristic
    divided by the group size (will lie between 0 and
    1)
  • Percent with characteristic per 100
    individuals (proportion100)
  • Rate per 100,000 proportion100,000

6
Graphical Displays of Distributions
  • Graphs of Categorical Variables
  • Bar Graph Horizontal axis defines the various
    categories, heights of bars represent numbers of
    individuals
  • Pie Chart Breaks down a circle (pie) such that
    the size of the slices represent the numbers of
    individuals in the categories or percentage of
    individuals.

7
Example - AAA Ratings of FL Hotels (Bar Chart)
8
Example - AAA Ratings of FL Hotels (Pie Chart)
9
Graphical Displays of Distributions
  • Graphs of Numeric Variables
  • Stemplot Crude, but quick method of displaying
    the entire set of data and observing shape of
    distribution
  • Stem All but rightmost digit, Leaf Rightmost
    Digit
  • Put stems in vertical column (small at top),
    draw vertical line
  • Put leaves in appropriate row in increasing
    order from stem
  • Histogram Breaks data into equally spaced ranges
    on horizontal axis, heights of bars represent
    frequencies or percentages

10
Example Time (Hours/Year) Lost to Traffic
Stems 10s of hours Leaves Hours
Step 1
Stems 1 2 3 4 5
Step 2
Stems and Leaves 1 48 2 01244699 3
0112244457778 4 122222245566 5 0336
Step 3
. Source Texas Transportation Institute
(5/7/2001).
11
Example Time (Hours/Year) Lost to TrafficEXCEL
Output
Note in histogram, the bins represent the number
up to and including that number (e.g. T?14,
14ltT?21, , 42ltT?49, Tgt49)
12
Comparing 2 Groups - Back-to-back Stemplots
  • Places Stems in Middle, group 1 to left, group 2
    to right
  • Example Maze Learning
  • Groups (I.V.) Adults vs Children
  • Measured Response (D.V.) Average number of
    Errors in series of Trials

13
Example - Maze Learning (Average Errors)
Stems Integer parts Leaves Decimal Parts
14
Examinining Distributions
  • Overall Pattern and Deviations
  • Shape symmetric, stretched to one direction,
    multiple humps
  • Center Typical values
  • Spread Wide or narrow
  • Outlier Individual whose value is far from
    others (see bottom right corner of previous
    slide)
  • May be due to data entry error, instrument
    malfunction, or individual being unusual wrt
    others

15
Time Plots -Variable Measured Over Time
16
Time Plot with Trend/Seasonality
17
Numeric Descriptions of Distributions
  • Measures of Central Tendency
  • Arithmetic Mean Total equally divided among
    individual cases
  • Median Midpoint of the distribution (M)
  • Measures of Spread (Dispersion)
  • Quartiles (first/third) Points that break out
    the smallest and largest 25 of distribution (Q1
    , Q3)
  • 5 Number Summary (Minimum,Q1,M,Q3,Maximum)
  • Interquartile Range IQR Q3-Q1
  • Boxplot Graphical summary of 5 Number Summary
  • Variance Average squared deviation from mean
    (s2)
  • Standard Deviation Square root of variance (s)

18
Measures of Central Tendency
  • Arithmetic Mean Obtain the total by summing all
    values and divide by sample size (equal
    allotment among individuals)
  • Median Midpoint of Distribution
  • Sort values from smallest to largest
  • If n odd, take the (n1)/2 ordered value
  • If n even, take average of n/2 and (n/2)1
    ordered values

19
2005 Oscar Nominees (Best Picture)
  • Movie Domestic Gross/Worldwide Gross
  • The Aviator 103M / 214M
  • Finding Neverland 52M / 116M
  • Million Dollar Baby 100M / 216M
  • Ray 75M / 97M
  • Sideways 72M / 108M
  • Mean Median Domestic Gross among nominees (M)

20
Delta Flight Times - ATL/MCO Oct,2004
  • N372 Flights 10/1/2004-10/31/2004
  • Total actual time 30536 Minutes
  • Mean Time 30536/372 82.1 Minutes
  • Median 372/2186, (372/2)1187
  • 186th and 187th ordered times are 81 minutes
    M81

21
Measures of Spread
  • Quartiles First (Q1 aka Lower) and Third (Q3 aka
    Upper)
  • Q1 is the median of the values below the median
    position
  • Q3 is the median of the values below the median
    position
  • Notes(See examples on next page)
  • If n is odd, median position is (n1)/2, and
    finding quartiles does not include this value.
  • If n is even, median position is treated (most
    commonly) as (n1)/2 and the two values
    (positions) used to compute median are used for
    quartiles.

22
  • Oscar Nominations
  • of Individuals n5
  • Median Position (51)/23
  • Positions Below Median Position 1-2
  • Positions Above Median Position 4-5
  • Median of Lower Positions 1.5
  • Median of Lower Positions 4.5
  • ATL/MCO Flights
  • of Individuals n372
  • Median Position (3721)/2186.5
  • Positions Below Median Position 1-186
  • Positions Above Median Position 187-372
  • Median of Lower Positions 93.5
  • Median of Upper Positions 279.5

23
Outliers - 1.5xIQR Rule
  • Outlier Value that falls a long way from other
    values in the distribution
  • 1.5xIQR Rule An observation may be considered an
    outlier if it falls either 1.5 times the
    interquartile range above the third (upper)
    quartile or the same distance below the first
    (lower) quartile.
  • ATL/MCO Data Q176 Q386 IQR10 1.5xIQR15
  • High Outliers Above 8615101 minutes
  • Low Outliers Below 76-1561 minutes
  • 12 Flights are at 102 minutes or more (Highest is
    122). See (modified) boxplot below

24
Measures of Spread - Variance and S.D.
  • Deviation Difference between an observed value
    and the overall mean (sign is important)
  • Variance Average squared deviation (divides
    the sum of squared deviations by n-1 (as opposed
    to n) for reasons we see later
  • Standard Deviation Positive square root of s2

25
Example - 2005 Oscar Movie Revenues
  • Mean x80.4
  • The Aviator i1 x1103 Deviation
    103-80.422.6
  • Finding Neverland i2 x252 Dev 52-80.4
    -28.4
  • Million Dollar Baby i3 x3100 Dev
    100-80.419.6
  • Ray i4 x475 Dev 75-80.4 -5.4
  • Sideways i5 x572 Dev 72-80.4 -8.4

26
Computer Output of Summary Measures and Boxplot
(SPSS) - ATL/MCO Data
27
Linear Transformations
  • Often work with transformed data
  • Linear Transformation xnew a bx for
    constants a and b (e.g. transforming from metric
    system to U.S., celsius to fahrenheit, etc)
  • Effects
  • Multiplying by b causes both mean and standard
    deviation to be multiplied by b
  • Addition by a shifts mean and all percentiles by
    a but does not effect the standard deviation or
    spread
  • Note that for locations, multiplication of b
    precedes addition of a

28
Density Curves/Normal Distributions
  • Continuous (or practically continuous) variables
    that can lie along a continuous (practically)
    range of values
  • Obtain a histogram of data (will be irregular
    with rigid blocks as bars over ranges)
  • Density curves are smooth approximations (models)
    to the coarse histogram
  • Curve lies above the horizontal axis
  • Total area under curve is 1
  • Area of curve over a range of values represents
    its probability
  • Normal Distributions - Family of density curves
    with very specific properties

29
Mean and Median of a Density Curve
  • Mean is the balance point of a distribution of
    measurements. If the height of the curve
    represented weight, its where the density curve
    would balance
  • Median is the point where half the area is below
    and half the area is above the point
  • Symmetric Densities Mean Median
  • Right Skew Densities Mean gt Median
  • Left Skew Densities Mean lt Median
  • We will mainly work with means. Notation

30
Symmetric (Normal) Distribution
31
Right Skewed Density Curve
32
Mean is the Balance Point
33
Normal Distribution
  • Bell-shaped, symmetric family of distributions
  • Classified by 2 parameters Mean (m) and standard
    deviation (s). These represent location and
    spread
  • Random variables that are approximately normal
    have the following properties wrt individual
    measurements
  • Approximately half (50) fall above (and below)
    mean
  • Approximately 68 fall within 1 standard
    deviation of mean
  • Approximately 95 fall within 2 standard
    deviations of mean
  • Virtually all fall within 3 standard deviations
    of mean
  • Notation when X is normally distributed with mean
    m and standard deviation s

34
Two Normal Distributions
35
Normal Distribution
36
Example - Heights of U.S. Adults
  • Female and Male adult heights are well
    approximated by normal distributions
    XFN(63.7,2.5) XMN(69.1,2.6)

Source Statistical Abstract of the U.S. (1992)
37
Standard Normal (Z) Distribution
  • Problem Unlimited number of possible normal
    distributions (-? lt m lt ? , s gt 0)
  • Solution Standardize the random variable to have
    mean 0 and standard deviation 1
  • Probabilities of certain ranges of values and
    specific percentiles of interest can be obtained
    through the standard normal (Z) distribution

38
(No Transcript)
39
Standard Normal (Z) Distribution
Table Area
1-Table Area
z
40
2nd Decimal Place
I n t g e r p a r t 1st D e c i m a l
41
2nd Decimal Place
I n t g e r p a r t 1st D e c i m a l
42
Finding Probabilities of Specific Ranges
  • Step 1 - Identify the normal distribution of
    interest (e.g. its mean (m) and standard
    deviation (s) )
  • Step 2 - Identify the range of values that you
    wish to determine the probability of observing
    (XL , XU), where often the upper or lower bounds
    are ? or -?
  • Step 3 - Transform XL and XU into Z-values
  • Step 4 - Obtain P(ZL? Z ? ZU) from Z-table

43
Example - Adult Female Heights
  • What is the probability a randomly selected
    female is 510 or taller (70 inches)?
  • Step 1 - X N(63.7 , 2.5)
  • Step 2 - XL 70.0 XU ?
  • Step 3 -
  • Step 4 - P(X ? 70) P(Z ? 2.52)
    1-P(Z?2.52)1-.9941.0059 ( ? 1/170)

44
Finding Percentiles of a Distribution
  • Step 1 - Identify the normal distribution of
    interest (e.g. its mean (m) and standard
    deviation (s) )
  • Step 2 - Determine the percentile of interest
    100p (e.g. the 90th percentile is the cut-off
    where only 90 of scores are below and 10 are
    above).
  • Step 3 - Find p in the body of the z-table and
    itscorresponding z-value (zp) on the outer edge
  • If 100p lt 50 then use left-hand page of table
  • If 100p ?50 then use right-hand page of table
  • Step 4 - Transform zp back to original units

45
Example - Adult Male Heights
  • Above what height do the tallest 5 of males lie
    above?
  • Step 1 - X N(69.1 , 2.6)
  • Step 2 - Want to determine 95th percentile (p
    .95)
  • Step 3 - P(z?1.645) .95
  • Step 4 - X.95 69.1 (1.645)(2.6) 73.4
    (6,1.4)

46
Statistical Models
  • When making statistical inference it is useful to
    write random variables in terms of model
    parameters and random errors
  • Here m is a fixed constant and e is a random
    variable
  • In practice m will be unknown, and we will use
    sample data to estimate or make statements
    regarding its value
Write a Comment
User Comments (0)
About PowerShow.com