Title: Looking at Data - Distributions
1Chapter1
- Looking at Data - Distributions
2Introduction
- Goal Using Data to Gain Knowledge
- Terms/Definitions
- Individiduals Units described by or used to
obtain data, such as humans, animals, objects
(aka experimental or sampling units) - Variables Characteristics corresponding to
individuals that can take on different values
among individuals - Categorical Variable Levels correspond to one of
several groups or categories - Quantitaive Variable Take on numeric values such
that arithmetic operations make sense
3Introduction
- Spreadsheets for Statistical Analyses
- Rows Represent Individuals
- Columns Represent Variables
- SPSS, Minitab, EXCEL are examples
- Measuring Variables
- Instrument Tool used to make quantitative
measurement on subjects (e.g. psychological test
or physical fitness measurement) - Independent and Dependent Variables
- Independent Variable Describes a group an
individal comes from (categorical) or its level
(quantitative) prior to observation - Dependent Variable Random outcome of interest
4Independent and Dependent Variables
- Dependent variables are also called response
variables - Independent Variables are also called explanatory
variables - Marketing Does amount of exposure effect
attitudes? - I.V. Exposure (in time or number), different
subjects receive different levels - D.V. Measurement of liking of a product or brand
- Medicine Does a new drug reduce heart disease?
- I.V. Treatment (Active Drug vs Placebo)
- D.V. Presence/Absence of heart disease in a time
period - Psychology/Finance Risk Perceptions
- I.V. Framing of Choice (Loss vs Gain)
- D.V. Choice Taken (Risky vs Certain)
5Rates and Proportions
- Categorical Variables Typically we count the
number with some characteristic in a group of
individuals. - The actual count is not a useful summary. More
useful summaries include - Proportion The number with the characteristic
divided by the group size (will lie between 0 and
1) - Percent with characteristic per 100
individuals (proportion100) - Rate per 100,000 proportion100,000
6Graphical Displays of Distributions
- Graphs of Categorical Variables
- Bar Graph Horizontal axis defines the various
categories, heights of bars represent numbers of
individuals - Pie Chart Breaks down a circle (pie) such that
the size of the slices represent the numbers of
individuals in the categories or percentage of
individuals.
7Example - AAA Ratings of FL Hotels (Bar Chart)
8Example - AAA Ratings of FL Hotels (Pie Chart)
9Graphical Displays of Distributions
- Graphs of Numeric Variables
- Stemplot Crude, but quick method of displaying
the entire set of data and observing shape of
distribution - Stem All but rightmost digit, Leaf Rightmost
Digit - Put stems in vertical column (small at top),
draw vertical line - Put leaves in appropriate row in increasing
order from stem - Histogram Breaks data into equally spaced ranges
on horizontal axis, heights of bars represent
frequencies or percentages
10Example Time (Hours/Year) Lost to Traffic
Stems 10s of hours Leaves Hours
Step 1
Stems 1 2 3 4 5
Step 2
Stems and Leaves 1 48 2 01244699 3
0112244457778 4 122222245566 5 0336
Step 3
. Source Texas Transportation Institute
(5/7/2001).
11Example Time (Hours/Year) Lost to TrafficEXCEL
Output
Note in histogram, the bins represent the number
up to and including that number (e.g. T?14,
14ltT?21, , 42ltT?49, Tgt49)
12Comparing 2 Groups - Back-to-back Stemplots
- Places Stems in Middle, group 1 to left, group 2
to right - Example Maze Learning
- Groups (I.V.) Adults vs Children
- Measured Response (D.V.) Average number of
Errors in series of Trials
13Example - Maze Learning (Average Errors)
Stems Integer parts Leaves Decimal Parts
14Examinining Distributions
- Overall Pattern and Deviations
- Shape symmetric, stretched to one direction,
multiple humps - Center Typical values
- Spread Wide or narrow
- Outlier Individual whose value is far from
others (see bottom right corner of previous
slide) - May be due to data entry error, instrument
malfunction, or individual being unusual wrt
others
15Time Plots -Variable Measured Over Time
16Time Plot with Trend/Seasonality
17Numeric Descriptions of Distributions
- Measures of Central Tendency
- Arithmetic Mean Total equally divided among
individual cases - Median Midpoint of the distribution (M)
- Measures of Spread (Dispersion)
- Quartiles (first/third) Points that break out
the smallest and largest 25 of distribution (Q1
, Q3) - 5 Number Summary (Minimum,Q1,M,Q3,Maximum)
- Interquartile Range IQR Q3-Q1
- Boxplot Graphical summary of 5 Number Summary
- Variance Average squared deviation from mean
(s2) - Standard Deviation Square root of variance (s)
18Measures of Central Tendency
- Arithmetic Mean Obtain the total by summing all
values and divide by sample size (equal
allotment among individuals)
- Median Midpoint of Distribution
- Sort values from smallest to largest
- If n odd, take the (n1)/2 ordered value
- If n even, take average of n/2 and (n/2)1
ordered values
192005 Oscar Nominees (Best Picture)
- Movie Domestic Gross/Worldwide Gross
- The Aviator 103M / 214M
- Finding Neverland 52M / 116M
- Million Dollar Baby 100M / 216M
- Ray 75M / 97M
- Sideways 72M / 108M
- Mean Median Domestic Gross among nominees (M)
20Delta Flight Times - ATL/MCO Oct,2004
- N372 Flights 10/1/2004-10/31/2004
- Total actual time 30536 Minutes
- Mean Time 30536/372 82.1 Minutes
- Median 372/2186, (372/2)1187
- 186th and 187th ordered times are 81 minutes
M81
21Measures of Spread
- Quartiles First (Q1 aka Lower) and Third (Q3 aka
Upper) - Q1 is the median of the values below the median
position - Q3 is the median of the values below the median
position - Notes(See examples on next page)
- If n is odd, median position is (n1)/2, and
finding quartiles does not include this value. - If n is even, median position is treated (most
commonly) as (n1)/2 and the two values
(positions) used to compute median are used for
quartiles.
22- Oscar Nominations
- of Individuals n5
- Median Position (51)/23
- Positions Below Median Position 1-2
- Positions Above Median Position 4-5
- Median of Lower Positions 1.5
- Median of Lower Positions 4.5
- ATL/MCO Flights
- of Individuals n372
- Median Position (3721)/2186.5
- Positions Below Median Position 1-186
- Positions Above Median Position 187-372
- Median of Lower Positions 93.5
- Median of Upper Positions 279.5
23Outliers - 1.5xIQR Rule
- Outlier Value that falls a long way from other
values in the distribution - 1.5xIQR Rule An observation may be considered an
outlier if it falls either 1.5 times the
interquartile range above the third (upper)
quartile or the same distance below the first
(lower) quartile. - ATL/MCO Data Q176 Q386 IQR10 1.5xIQR15
- High Outliers Above 8615101 minutes
- Low Outliers Below 76-1561 minutes
- 12 Flights are at 102 minutes or more (Highest is
122). See (modified) boxplot below
24Measures of Spread - Variance and S.D.
- Deviation Difference between an observed value
and the overall mean (sign is important) - Variance Average squared deviation (divides
the sum of squared deviations by n-1 (as opposed
to n) for reasons we see later
- Standard Deviation Positive square root of s2
25Example - 2005 Oscar Movie Revenues
- Mean x80.4
- The Aviator i1 x1103 Deviation
103-80.422.6 - Finding Neverland i2 x252 Dev 52-80.4
-28.4 - Million Dollar Baby i3 x3100 Dev
100-80.419.6 - Ray i4 x475 Dev 75-80.4 -5.4
- Sideways i5 x572 Dev 72-80.4 -8.4
26Computer Output of Summary Measures and Boxplot
(SPSS) - ATL/MCO Data
27Linear Transformations
- Often work with transformed data
- Linear Transformation xnew a bx for
constants a and b (e.g. transforming from metric
system to U.S., celsius to fahrenheit, etc) - Effects
- Multiplying by b causes both mean and standard
deviation to be multiplied by b - Addition by a shifts mean and all percentiles by
a but does not effect the standard deviation or
spread - Note that for locations, multiplication of b
precedes addition of a
28Density Curves/Normal Distributions
- Continuous (or practically continuous) variables
that can lie along a continuous (practically)
range of values - Obtain a histogram of data (will be irregular
with rigid blocks as bars over ranges) - Density curves are smooth approximations (models)
to the coarse histogram - Curve lies above the horizontal axis
- Total area under curve is 1
- Area of curve over a range of values represents
its probability - Normal Distributions - Family of density curves
with very specific properties
29Mean and Median of a Density Curve
- Mean is the balance point of a distribution of
measurements. If the height of the curve
represented weight, its where the density curve
would balance - Median is the point where half the area is below
and half the area is above the point - Symmetric Densities Mean Median
- Right Skew Densities Mean gt Median
- Left Skew Densities Mean lt Median
- We will mainly work with means. Notation
30Symmetric (Normal) Distribution
31Right Skewed Density Curve
32Mean is the Balance Point
33Normal Distribution
- Bell-shaped, symmetric family of distributions
- Classified by 2 parameters Mean (m) and standard
deviation (s). These represent location and
spread - Random variables that are approximately normal
have the following properties wrt individual
measurements - Approximately half (50) fall above (and below)
mean - Approximately 68 fall within 1 standard
deviation of mean - Approximately 95 fall within 2 standard
deviations of mean - Virtually all fall within 3 standard deviations
of mean - Notation when X is normally distributed with mean
m and standard deviation s
34Two Normal Distributions
35Normal Distribution
36Example - Heights of U.S. Adults
- Female and Male adult heights are well
approximated by normal distributions
XFN(63.7,2.5) XMN(69.1,2.6)
Source Statistical Abstract of the U.S. (1992)
37Standard Normal (Z) Distribution
- Problem Unlimited number of possible normal
distributions (-? lt m lt ? , s gt 0) - Solution Standardize the random variable to have
mean 0 and standard deviation 1
- Probabilities of certain ranges of values and
specific percentiles of interest can be obtained
through the standard normal (Z) distribution
38(No Transcript)
39Standard Normal (Z) Distribution
Table Area
1-Table Area
z
402nd Decimal Place
I n t g e r p a r t 1st D e c i m a l
412nd Decimal Place
I n t g e r p a r t 1st D e c i m a l
42Finding Probabilities of Specific Ranges
- Step 1 - Identify the normal distribution of
interest (e.g. its mean (m) and standard
deviation (s) ) - Step 2 - Identify the range of values that you
wish to determine the probability of observing
(XL , XU), where often the upper or lower bounds
are ? or -? - Step 3 - Transform XL and XU into Z-values
- Step 4 - Obtain P(ZL? Z ? ZU) from Z-table
43Example - Adult Female Heights
- What is the probability a randomly selected
female is 510 or taller (70 inches)? - Step 1 - X N(63.7 , 2.5)
- Step 2 - XL 70.0 XU ?
- Step 3 -
- Step 4 - P(X ? 70) P(Z ? 2.52)
1-P(Z?2.52)1-.9941.0059 ( ? 1/170)
44Finding Percentiles of a Distribution
- Step 1 - Identify the normal distribution of
interest (e.g. its mean (m) and standard
deviation (s) ) - Step 2 - Determine the percentile of interest
100p (e.g. the 90th percentile is the cut-off
where only 90 of scores are below and 10 are
above). - Step 3 - Find p in the body of the z-table and
itscorresponding z-value (zp) on the outer edge - If 100p lt 50 then use left-hand page of table
- If 100p ?50 then use right-hand page of table
- Step 4 - Transform zp back to original units
45Example - Adult Male Heights
- Above what height do the tallest 5 of males lie
above? - Step 1 - X N(69.1 , 2.6)
- Step 2 - Want to determine 95th percentile (p
.95) - Step 3 - P(z?1.645) .95
- Step 4 - X.95 69.1 (1.645)(2.6) 73.4
(6,1.4)
46Statistical Models
- When making statistical inference it is useful to
write random variables in terms of model
parameters and random errors
- Here m is a fixed constant and e is a random
variable - In practice m will be unknown, and we will use
sample data to estimate or make statements
regarding its value