Title: Displaying Distributions with Graphs
1- Displaying Distributions with Graphs
2Interesting Problems
- Poker games, lottery,
- Sports statistics,
- Political voting, poll, survey,
- Business, stock market,
- Census,
- Marketing,
- Biological, medical, psychological,
- Practical for decision making
3Recall
- Statistics is the science of data.
- Collecting
- Analyzing
- Decision making
- Data
- Individuals
- Variables
- Categorical variables
- Quantitative variables
4NBA Draft 2003 Top 5 Picks
5Students in STAT 31
- Class Roll
- Variables PID, College, Class, Degree, Major -
Categorical - How many categories? How many students in each
category? - Equivalently, what is the distribution for each
variable?
6Distributions of Variables
- The distribution of a variable indicates what
values a variable takes and how often it takes
these values. - For a categorical variable, distribution
- For a quantitative variable, distribution
7Variable Class
8Exploratory Data Analysis (EDA)
- Use statistical tools and ideas to help us
examine data - Goal to describe the main features of the data
- NEVER skip this
- EDA
- Displaying distributions with
- Displaying distributions with
9Basic Strategies for EDA
- Graphical visualizations
- Numerical summaries
- One variable at a time
- Relationships among the variables
10Graphic Techniques for Categorical Variables
- Bar Graph uses bars to represent the frequencies
(or relative frequencies) such that the height of
each bar equals the frequency or relative
frequency of each category. - Frequencies counts
- Relative frequencies percent
- Bar Graph height indicates count or percent
11Graphic Techniques for Categorical Variables
12Graphic Techniques for Quantitative Variables
- Stemplot (Stem-and-Leaf Plot)
- Histogram
- Time plot
13Example Midterm Scores
- The following data set contains the midterm
- exam scores
- 74 76 78 88 87 87 53 95 82 79 79 78
- 62 80 77 70 60 60 84 95 85 93 79 84
- 71 85 100 77 72 95 79 83 97 87 73 84
- 74 83 85 95 62 50 86 83 86 36
- Type of variable?
14Stemplot
- Separate each observation into a stem consisting
of all but the final (rightmost) digit and a
leaf, the final digit. Stems may have as many
digits as needed, but each leaf contains only a
single digit. - Write the stems in a vertical column with the
smallest at the top, and draw a vertical line at
the right of this column. - Write each leaf in the row to the right of its
stem, in increasing order out from the stem.
15Example Midterm Scores of STAT 101
- The following data set contains the midterm
- exam scores
- 74 76 78 88 87 87 53 95 82 79 79 78
- 62 80 77 70 60 60 84 95 85 93 79 84
- 71 85 100 77 72 95 79 83 97 87 73 84
- 74 83 85 95 62 50 86 83 86 36
16Example Midterm Scores of STAT 101
- A stem-and-leaf display is follows
- 3 6
- 4
- 5 03 Leaf last digit
- 6 0022 Stem remaining digit(s)
- 7 012344677889999
- 8 02333444555667778
- 9 355557
- 10 0
17Back-to-back Stemplot
- Babe Ruth (New York Yankees) 1920-1934
- 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
- Mark McGwire (St. Louis Cardinals) 19862001
- 3 49 32 33 39 22 42 9 9 39 52 58 70 65 32 29
18Back-to-back Stemplot
- Ruth McGwire
- 0 3 9 9
- 1
- 5 2 2 2 9
- 5 4 3 2 2 3 9 9
- 9 7 6 6 6 1 1 4 2 9
- 9 4 4 5 2 8
- 0 6 5
- 7 0
19Splitting stems rounding
- For a moderate number of obs,
- Split each stem into two one with leaves 0-4 and
the other with leaves 5-9 - Increase of stems, reduce of leaves
- Rounding
- If many stems have no leaves or only one leaf,
rounding may help.
20Spending at a supermarket
21Example A study on litter size
- Data (170 observations)
- 4 6 5 6 7 3 6 4 4 6 4 4 9 5 10 6
6 5 6 8 2 7 7 7 9 3 7 5 7 7 4 5
5 6 7 6 7 8 6 6 7 6 6 7 5 4 5 6 6
1 3 4 7 5 4 7 5 8 8 5 6 8 5 5 4
9 6 7 3 7 7 5 4 6 9 6 7 7 5 7 3 7
6 5 3 7 10 5 6 8 7 5 5 7 5 5 8 9
7 5 7 5 5 5 6 3 7 8 7 7 6 3 4 4 4
7 2 7 8 5 8 6 6 5 6 4 7 5 5 6 9
3 5 4 8 3 9 8 3 6 5 4 7 8 4 8 6 8
5 6 4 3 8 8 6 9 5 5 6 6 7 6 8 6
11 6 5 6 6 3
22Stem-and-leaf plot for pups
- 0122333333333333344 (35)
- 0555555555555555555555555... (132)
- 1 001
23Limitations of Stemplot
- Awkward for large data sets
- Splitting stem/rounding is not very helpful.
24Histogram
- breaks the range of the values of a quantitative
variable into intervals and displays only the
count or percent of the observations that fall
into each interval. - You can choose any convenient number of
intervals. - Intervals must be of equal width.
25Example A study on litter size
26Example Call Center Data
- Financial firm call center
- Calls handled by AVI within 60 seconds
- October 666
- December 523
- Avi Service Time Data
27October
28December
29Notes for Making Histogram
- Choose the number of classes sensibly
- Too few classes skyscraper graph
- Too many pancake graph
- Sturges rule
- Choose number of classes k such that
-
- where n is the sample size
- Intervals must be of equal width.
- Areas of the bars are proportional to the
frequency.
30Examining Distributions
- Overall Pattern
- Shape
- Center (numerical, Lecture 3)
- midpoint
- Spread (numerical, Lecture 3)
- range
- Deviations
- Outliers some values that fall outside the
overall pattern.
31Shapes of Distributions
- Graphs can help to determine shapes.
- Modes peaks of a distribution.
- Unimodal one peak
- Bimodal two peaks
- Symmetric or skewed?
32Shakespeares Words
33A unimodal histogram
34Tuition and fees
35A bimodal histogram
36Shakespeares Words
37Skewness
Left skewed
Right skewed
38Iowa Test of Basic Skills vocabulary scores
39A study on litter size
40Bell-shaped Histograms
41Summary Shapes of Distributions
- Symmetric
- histogram in which the right half is a mirror
image of the left half. - Skewed to the right
- histogram in which the right tail is more
stretched out than the left.(long tail to the
right) - Skewed to the left
- histogram the left tail is more stretched out
than the right.(long tail to the left) - Number of modal classes
- the number of distinct peaks in a histogram
- Bell-shaped
- A histogram looks like a bell.
42Time plots
- A time plot of a variable plots each obs against
the time at which it was measured. - Time x-axis
- Variable y-axis
- Examples stock price, unemployment rate, daily
temperature - Great for identifying changing patterns related
to time. - What to look for
- .
- .
- .
43Example Number of Suicides in USA (1900-1970)
44Call Center Daily Call Volume in Sep. 2002
45Call Center Monthly Call Volume in 2002
46Outliers
- Observations that lie outside the overall pattern
of a distribution. - Possible reasons
- error in data entry (most likely reason)
-
-
-
- extraordinary individuals (Jordans salary)
47Handling Outliers
- Detect it using graphical and numerical methods.
- Check the data to make sure correct entry.
- Reducing influence of outlier
- delete the observation (BE CAREFUL!)
- Use transformations, robust methods.
48Speed of Light (Histogram)
49Speed of Light (Time plot)
50Remember
- Distribution of variables
- Examine distributions
- Overall pattern
- Shape
- Symmetric or skewed
- How many modes?
- Bell-shaped
- Outliers
- Graphical tools for categorical data
- Bar graph
- Pie chart
- Graphical tools for quantitative data
- Stemplot
- Histograms
- Time plots