Title: Chapter 3: Displaying and Describing Categorical Data
1Chapter 3 Displaying and DescribingCategorical
Data
- The three rules of data analysis wont be
difficult to remember
- Make a picturethings may be revealed that are
not obvious in the raw data. These will be things
to think about.
- Make a pictureimportant features of and patterns
in the data will show up. You may also see things
that you did not expect.
- Make a picturethe best way to tell others about
your data is with a well-chosen picture.
2Frequency Tables Making Piles
- We can pile the data by counting the number of
data values in each category of interest.
- We can organize these counts into a frequency
table, which records the totals and the category
names.
- People on Titanic by Ticket Class
3Frequency Tables Making Piles (cont.)
- A relative frequency table is similar, but gives
the percentages (instead of counts) for each
category.
Relative Frequency of People on Titanic by Ticket
Class
4Displaying DataWhats Wrong With This Picture?
- You might think that
- a good way to show
- the Titanic data is
- with this display
- There are 2 things wrong
5The Area Principle
- The ship display makes it look like most of the
people on the Titanic were crew members, with a
few passengers along for the ride.
- When we look at each ship, we see the area taken
up by the ship, instead of the length of the
ship.
- The ship display violates the area principle
- The area occupied by a part of the graph should
correspond to the magnitude of the value it
represents.
6More on Displaying Data (not in Text)
- Your table or graphical display should always
have a title or caption, so that casual readers
can understand what is being presented at a
glance - An informative display often leads people to read
your paper!
- If you are writing a paper, it is imperative that
you attribute the source of your data.
- Keep the display easy-to-read
- Use simple fonts, colors/patterns
- Make sure that your reader distinguish between
different categories
- Common problem- graphs that look good on your
color monitor may be hard to read when printed
with a B/W printer
7Bar Charts
- A bar chart displays the distribution of a
categorical variable, showing the counts for each
category next to each other for easy comparison.
- A bar chart stays true
to the area principle.
- Thus, it is a better
- display for this data
- Dont forget a title (or at
- least a caption!)
People on Titanic by Ticket Class
8Bar Charts (cont.)
- A relative frequency bar chart displays the
relative proportion of counts for each category.
- A relative frequency bar chart also stays true to
the area principle.
- Replacing counts
with percentages
in the ship data - Dont forget a title/caption
Percentage of Titanic Passengers in each Ticket
Class
9Pie Charts
- When you are interested in parts of the whole, a
pie chart might be your display of choice.
- Pie charts show the whole
group of cases as
a circle.
- They slice the circle into
pieces whose size
is
proportional to the
fraction
of the whole
in each
category. - Dont forget a title /caption
Number of Titanic Passengers by Class
10Contingency Tables
- A contingency table allows us to look at two
categorical variables together.
- It shows how individuals are distributed along
each variable, contingent on the value of the
other variable.
- Example of a contingency table of ticket class
and survival.
Survival and Class of Titanic Passengers
11Contingency Tables (cont)
- Each cell of the contingency table gives the
count for a combination of values of the two
values.
- For example, the second cell in the crew column
tells us that 673 crew members died when the
Titanic sunk.
12Contingency Tables
- The margins of the table, both on the right and
on the bottom, give totals and the frequency
distributions for each of the variables.
- Each frequency distribution is called a marginal
distribution of its respective variable.
- The marginal distribution of Survival is
- (Can also phrase as What aboard the Titanic
Survived?)
13Conditional Distributions
- A conditional distribution shows the distribution
of one variable for just the individuals who
satisfy some condition on another variable.
- The following is the conditional distribution of
ticket Class, conditional on having survived
For example 28.6 of those who survived were
from First Class How were these ages calculated?
What might be another way to phrase this?
14Conditional Distributions (cont.)
- The following is the conditional distribution of
ticket Class. It is conditional on having
perished
How were these ages calculated?
What might be another way to phrase this?
15Conditional Distributions (cont.)
- The conditional distributions tell us that there
was a difference in class for those who survived
and those who perished.
- Rather than a
- table of numbers,
- this is better
shown with
pie charts of
the two
distributions
Titanic Survivors and Non-survivors, by Class
What would these pie charts look like if class
had no influence on survival?
16Conditional Distributions (cont.)
- We see that the distribution of Class for the
survivors is different from that of the
non-survivors.
- This leads us to believe that Class and Survival
are associated, and are not independent.
- The variables would be considered independent
when the distribution of one variable in a
contingency table is the same for all categories
of the other variable.
17Segmented Bar Charts
Titanic Survivors and Non-survivors, by Class
- A segmented bar chart displays the same
information as a pie chart, but in the form of
bars instead of circles.
- Here is the segmented bar chart for ticket Class
by Survival status
18What Can Go Wrong?
- Dont violate the area principle.
- While some people might like the pie chart on the
left better, it is harder to compare fractions of
the whole, which a well-done pie chart does.
19What Can Go Wrong? (cont.)
- Keep it honestmake sure your display shows what
it says it shows.
- This plot of the percentage of high-school
students who engage in specified dangerous
behaviors has a problem. Can you see it?
20What Can Go Wrong? (cont.)
- Dont confuse similar-sounding percentagespay
particular attention to the wording of the
context.
- Dont forget to look at the variables separately
tooexamine the marginal distributions, since it
is important to know how many cases are in each
category. - Be sure to use enough individuals!
- Do not make a report like We found that 66.67 of
the rats improved their performance with
training. The other rat died.
21What Can Go Wrong? (cont.)
- Dont overstate your casedont claim something
you cant.
- Dont use unfair or silly averagesthis could
lead to Simpsons Paradox, so be careful when you
average one variable across different levels of a
second variable.
22Simpsons Paradox Example
- Chris has a 3.20 SFSU GPA
- Sean has a 3.34 SFSU GPA
- Who seems to do better in SFSU classes, Chris or
Sean?
- Who do you think is likely to get a better grade
in DS412 (Operations Management)
- What else might be useful to know?
23Simpsons Paradox, Continued
- Is comparing Chriss and Seans GPAs a fair
assessment of their abilities to get good grades
in particular classes?
- What data is shared in common?
- Who do you think is likely to get a better grade
in DS412 (Operations Management) now?
24What have we learned?
- We can summarize categorical data by counting the
number of cases in each category (expressing
these as counts or percents).
- We can display the distribution in a bar chart or
pie chart.
- And, we can examine two-way tables called
contingency tables, examining marginal and/or
conditional distributions of the variables.
25Additional Examples
- Pr9 A May 2001 Gallup Poll found that many
Americans believe in supernatural phenomena.
The poll was based on telephone responses from
1012 randomly selected adults. - Is it reasonable to conclude 66 of those polled
believe in either Ghosts or Astrology?
- Can you tell what of people did not believe in
any of these phenomena? Explain.
- What is an appropriate graph?
26Chart for Example 9
27Additional Examples
- Pr 22 A survey of autos in the student lot and
staff lot at SFSU classified cars by country of
origin
- What of cars surveyed where foreign?
- What of the American cars were student owned?
- What of students own American cars?
- What is the marginal distributor of origin?
- What is the conditional distribution of origin by
owner type?
- Do you think that car origin is independent of
owner type? Use a graph to explain your argument
28Example 22 continued
- First create
- Contingency Table Origin of Students' and
Staffs' Cars
- Answers
- a) (45102)/359 41
- b) We have 212 American cars of which 107 were
student driven, 50.5
- c) We have 195 students, of whom 107 drove
American cars, 55
- (note that b and c look similar but are not the
same question,
- nor do they have the same answer!)
- Marginal Dist of care origin
- American 212/359 59, Europe 13, Asian
28
29Example 22 continued
- f) Now find Conditional Distributions for each
driver type
- Does Car Origin Seem independent of type of
driver?
- What is an even better way to show this?
-
-
30Example 22- Extension
- Pick an appropriate graph to show the difference
between the two distributions
-
-
31Example Old Test Problem
- The Bookstore will run a contest where the prize
is a pair of concert tickets and must decide
between the upcoming Rolling Stones and the Black
Eyed Peas concerts. They want to figure out
which will generate more excitement, so they paid
a student to run a survey of students and staff
(profs, administrators, and other non-students)
on preferences. The student produced the
following table - A) What percent of people surveyed preferred
tickets to the Rolling Stones?
- B) What percent of people surveyed were Staff?
- C) What percent of the students preferred tickets
to the Rolling Stones?
- D) What percent of those who preferred tickets to
the Rolling Stones were students?
- E) What percent of respondents were students and
preferred tickets to the Rolling Stones?
- F) Does it appear that preference for concert
tickets is independent of a respondents being a
staff or a student? Show why using calculations
or graphics
32Example Old Test Problem
- First, it is probably useful to fill out the
contingency table with the marginal
distributions
- A) What percent of people surveyed preferred
tickets to the Rolling Stones?
- B) What percent of people surveyed were Staff?
- C) What percent of the students preferred tickets
to the Rolling Stones?
- D) What percent of those who preferred tickets to
the Rolling Stones were students?
- E) What percent of respondents were students and
preferred tickets to the Rolling Stones?
- F) Does it appear that preference for concert
tickets is independent of a respondents being a
staff or a student? Show why using calculations
or graphics