Looking at Data - Distributions - PowerPoint PPT Presentation

About This Presentation

Title:

Looking at Data - Distributions

Description:

2005 Oscar Nominees (Best Picture) Movie: Domestic Gross/Worldwide Gross ... Mean & Median Domestic Gross among nominees ($M): Delta Flight Times - ATL/MCO Oct,2004 ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 47

Provided by: larryw4

Learn more at: https://users.stat.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Looking at Data - Distributions

1
Chapter1

Looking at Data - Distributions

2
Introduction

Goal Using Data to Gain Knowledge
Terms/Definitions
Individiduals Units described by or used to
obtain data, such as humans, animals, objects
(aka experimental or sampling units)
Variables Characteristics corresponding to
individuals that can take on different values
among individuals
Categorical Variable Levels correspond to one of
several groups or categories
Quantitaive Variable Take on numeric values such
that arithmetic operations make sense

3
Introduction

Spreadsheets for Statistical Analyses
Rows Represent Individuals
Columns Represent Variables
SPSS, Minitab, EXCEL are examples
Measuring Variables
Instrument Tool used to make quantitative
measurement on subjects (e.g. psychological test
or physical fitness measurement)
Independent and Dependent Variables
Independent Variable Describes a group an
individal comes from (categorical) or its level
(quantitative) prior to observation
Dependent Variable Random outcome of interest

4
Independent and Dependent Variables

Dependent variables are also called response
variables
Independent Variables are also called explanatory
variables
Marketing Does amount of exposure effect
attitudes?
I.V. Exposure (in time or number), different
subjects receive different levels
D.V. Measurement of liking of a product or brand
Medicine Does a new drug reduce heart disease?
I.V. Treatment (Active Drug vs Placebo)
D.V. Presence/Absence of heart disease in a time
period
Psychology/Finance Risk Perceptions
I.V. Framing of Choice (Loss vs Gain)
D.V. Choice Taken (Risky vs Certain)

5
Rates and Proportions

Categorical Variables Typically we count the
number with some characteristic in a group of
individuals.
The actual count is not a useful summary. More
useful summaries include
Proportion The number with the characteristic
divided by the group size (will lie between 0 and
1)
Percent with characteristic per 100
individuals (proportion100)
Rate per 100,000 proportion100,000

6
Graphical Displays of Distributions

Graphs of Categorical Variables
Bar Graph Horizontal axis defines the various
categories, heights of bars represent numbers of
individuals
Pie Chart Breaks down a circle (pie) such that
the size of the slices represent the numbers of
individuals in the categories or percentage of
individuals.

7
Example - AAA Ratings of FL Hotels (Bar Chart)
8
Example - AAA Ratings of FL Hotels (Pie Chart)
9
Graphical Displays of Distributions

Graphs of Numeric Variables
Stemplot Crude, but quick method of displaying
the entire set of data and observing shape of
distribution
Stem All but rightmost digit, Leaf Rightmost
Digit
Put stems in vertical column (small at top),
draw vertical line
Put leaves in appropriate row in increasing
order from stem
Histogram Breaks data into equally spaced ranges
on horizontal axis, heights of bars represent
frequencies or percentages

10
Example Time (Hours/Year) Lost to Traffic
Stems 10s of hours Leaves Hours
Step 1
Stems 1 2 3 4 5
Step 2
Stems and Leaves 1 48 2 01244699 3
0112244457778 4 122222245566 5 0336
Step 3
. Source Texas Transportation Institute
(5/7/2001).
11
Example Time (Hours/Year) Lost to TrafficEXCEL
Output
Note in histogram, the bins represent the number
up to and including that number (e.g. T?14,
14ltT?21, , 42ltT?49, Tgt49)
12
Comparing 2 Groups - Back-to-back Stemplots

Places Stems in Middle, group 1 to left, group 2
to right
Example Maze Learning
Groups (I.V.) Adults vs Children
Measured Response (D.V.) Average number of
Errors in series of Trials

13
Example - Maze Learning (Average Errors)
Stems Integer parts Leaves Decimal Parts
14
Examinining Distributions

Overall Pattern and Deviations
Shape symmetric, stretched to one direction,
multiple humps
Center Typical values
Spread Wide or narrow
Outlier Individual whose value is far from
others (see bottom right corner of previous
slide)
May be due to data entry error, instrument
malfunction, or individual being unusual wrt
others

15
Time Plots -Variable Measured Over Time
16
Time Plot with Trend/Seasonality
17
Numeric Descriptions of Distributions

Measures of Central Tendency
Arithmetic Mean Total equally divided among
individual cases
Median Midpoint of the distribution (M)
Measures of Spread (Dispersion)
Quartiles (first/third) Points that break out
the smallest and largest 25 of distribution (Q1
, Q3)
5 Number Summary (Minimum,Q1,M,Q3,Maximum)
Interquartile Range IQR Q3-Q1
Boxplot Graphical summary of 5 Number Summary
Variance Average squared deviation from mean
(s2)
Standard Deviation Square root of variance (s)

18
Measures of Central Tendency

Arithmetic Mean Obtain the total by summing all
values and divide by sample size (equal
allotment among individuals)

Median Midpoint of Distribution
Sort values from smallest to largest
If n odd, take the (n1)/2 ordered value
If n even, take average of n/2 and (n/2)1
ordered values

19
2005 Oscar Nominees (Best Picture)

Movie Domestic Gross/Worldwide Gross
The Aviator 103M / 214M
Finding Neverland 52M / 116M
Million Dollar Baby 100M / 216M
Ray 75M / 97M
Sideways 72M / 108M
Mean Median Domestic Gross among nominees (M)

20
Delta Flight Times - ATL/MCO Oct,2004

N372 Flights 10/1/2004-10/31/2004
Total actual time 30536 Minutes
Mean Time 30536/372 82.1 Minutes
Median 372/2186, (372/2)1187
186th and 187th ordered times are 81 minutes
M81

21
Measures of Spread

Quartiles First (Q1 aka Lower) and Third (Q3 aka
Upper)
Q1 is the median of the values below the median
position
Q3 is the median of the values below the median
position
Notes(See examples on next page)
If n is odd, median position is (n1)/2, and
finding quartiles does not include this value.
If n is even, median position is treated (most
commonly) as (n1)/2 and the two values
(positions) used to compute median are used for
quartiles.

Oscar Nominations
of Individuals n5
Median Position (51)/23
Positions Below Median Position 1-2
Positions Above Median Position 4-5
Median of Lower Positions 1.5
Median of Lower Positions 4.5
ATL/MCO Flights
of Individuals n372
Median Position (3721)/2186.5
Positions Below Median Position 1-186
Positions Above Median Position 187-372
Median of Lower Positions 93.5
Median of Upper Positions 279.5

23
Outliers - 1.5xIQR Rule

Outlier Value that falls a long way from other
values in the distribution
1.5xIQR Rule An observation may be considered an
outlier if it falls either 1.5 times the
interquartile range above the third (upper)
quartile or the same distance below the first
(lower) quartile.
ATL/MCO Data Q176 Q386 IQR10 1.5xIQR15
High Outliers Above 8615101 minutes
Low Outliers Below 76-1561 minutes
12 Flights are at 102 minutes or more (Highest is
122). See (modified) boxplot below

24
Measures of Spread - Variance and S.D.

Deviation Difference between an observed value
and the overall mean (sign is important)
Variance Average squared deviation (divides
the sum of squared deviations by n-1 (as opposed
to n) for reasons we see later

Standard Deviation Positive square root of s2

25
Example - 2005 Oscar Movie Revenues

Mean x80.4
The Aviator i1 x1103 Deviation
103-80.422.6
Finding Neverland i2 x252 Dev 52-80.4
-28.4
Million Dollar Baby i3 x3100 Dev
100-80.419.6
Ray i4 x475 Dev 75-80.4 -5.4
Sideways i5 x572 Dev 72-80.4 -8.4

26
Computer Output of Summary Measures and Boxplot
(SPSS) - ATL/MCO Data
27
Linear Transformations

Often work with transformed data
Linear Transformation xnew a bx for
constants a and b (e.g. transforming from metric
system to U.S., celsius to fahrenheit, etc)
Effects
Multiplying by b causes both mean and standard
deviation to be multiplied by b
Addition by a shifts mean and all percentiles by
a but does not effect the standard deviation or
spread
Note that for locations, multiplication of b
precedes addition of a

28
Density Curves/Normal Distributions

Continuous (or practically continuous) variables
that can lie along a continuous (practically)
range of values
Obtain a histogram of data (will be irregular
with rigid blocks as bars over ranges)
Density curves are smooth approximations (models)
to the coarse histogram
Curve lies above the horizontal axis
Total area under curve is 1
Area of curve over a range of values represents
its probability
Normal Distributions - Family of density curves
with very specific properties

29
Mean and Median of a Density Curve

Mean is the balance point of a distribution of
measurements. If the height of the curve
represented weight, its where the density curve
would balance
Median is the point where half the area is below
and half the area is above the point
Symmetric Densities Mean Median
Right Skew Densities Mean gt Median
Left Skew Densities Mean lt Median
We will mainly work with means. Notation

30
Symmetric (Normal) Distribution
31
Right Skewed Density Curve
32
Mean is the Balance Point
33
Normal Distribution

Bell-shaped, symmetric family of distributions
Classified by 2 parameters Mean (m) and standard
deviation (s). These represent location and
spread
Random variables that are approximately normal
have the following properties wrt individual
measurements
Approximately half (50) fall above (and below)
mean
Approximately 68 fall within 1 standard
deviation of mean
Approximately 95 fall within 2 standard
deviations of mean
Virtually all fall within 3 standard deviations
of mean
Notation when X is normally distributed with mean
m and standard deviation s

34
Two Normal Distributions
35
Normal Distribution
36
Example - Heights of U.S. Adults

Female and Male adult heights are well
approximated by normal distributions
XFN(63.7,2.5) XMN(69.1,2.6)

Source Statistical Abstract of the U.S. (1992)
37
Standard Normal (Z) Distribution

Problem Unlimited number of possible normal
distributions (-? lt m lt ? , s gt 0)
Solution Standardize the random variable to have
mean 0 and standard deviation 1

Probabilities of certain ranges of values and
specific percentiles of interest can be obtained
through the standard normal (Z) distribution

38
(No Transcript)
39
Standard Normal (Z) Distribution
Table Area
1-Table Area
z
40
2nd Decimal Place
I n t g e r p a r t 1st D e c i m a l
41
2nd Decimal Place
I n t g e r p a r t 1st D e c i m a l
42
Finding Probabilities of Specific Ranges

Step 1 - Identify the normal distribution of
interest (e.g. its mean (m) and standard
deviation (s) )
Step 2 - Identify the range of values that you
wish to determine the probability of observing
(XL , XU), where often the upper or lower bounds
are ? or -?
Step 3 - Transform XL and XU into Z-values

Step 4 - Obtain P(ZL? Z ? ZU) from Z-table

43
Example - Adult Female Heights

What is the probability a randomly selected
female is 510 or taller (70 inches)?
Step 1 - X N(63.7 , 2.5)
Step 2 - XL 70.0 XU ?
Step 3 -

Step 4 - P(X ? 70) P(Z ? 2.52)
1-P(Z?2.52)1-.9941.0059 ( ? 1/170)

44
Finding Percentiles of a Distribution

Step 1 - Identify the normal distribution of
interest (e.g. its mean (m) and standard
deviation (s) )
Step 2 - Determine the percentile of interest
100p (e.g. the 90th percentile is the cut-off
where only 90 of scores are below and 10 are
above).
Step 3 - Find p in the body of the z-table and
itscorresponding z-value (zp) on the outer edge
If 100p lt 50 then use left-hand page of table
If 100p ?50 then use right-hand page of table
Step 4 - Transform zp back to original units

45
Example - Adult Male Heights

Above what height do the tallest 5 of males lie
above?
Step 1 - X N(69.1 , 2.6)
Step 2 - Want to determine 95th percentile (p
.95)
Step 3 - P(z?1.645) .95
Step 4 - X.95 69.1 (1.645)(2.6) 73.4
(6,1.4)

46
Statistical Models

When making statistical inference it is useful to
write random variables in terms of model
parameters and random errors

Here m is a fixed constant and e is a random
variable
In practice m will be unknown, and we will use
sample data to estimate or make statements
regarding its value

Write a Comment

User Comments (0)