Title: Qualitative (Categorical) Data
1Practical Applications of Statistics or So you've
got data... NOW WHAT?
With special thanks to Herb McGrath and Laura
Simon of the Penn State Stats Dept.
2What is Statistics?Statistics is a branch of
mathematics concerned with interpreting data
- Statistics guides all aspects of your Science
Fair project - Experiment Design/Data Collection
- Summarizing Data
- Interpreting Data
- Drawing Conclusions from Data
3Data is a good start now for a good finish!
- Turning data into information
- Proving that an experimental factor had an effect
- Inferring relationships between factors
- Determining how confident you can be in your
determinations
4Segment I Understanding Error
- It is important to minimize error and understand
it! - Some data is inherently biased
- Selection bias
- Measurement bias
5Example Sampling Bias
- Dog Obedience Experiment 1000 people are invited
to a public park to demonstrate the ability of
their dog to respond to commands. 500 turn out
with their dogs. The experimenters document the
number of commands given to each dog and the
number that they respond to appropriately. They
find that the dogs respond an astonishing 71 of
the time! Whats wrong with this study?
6Example Measurement Bias
- The dogs are weighed to determine if large dogs
respond to commands better than small dogs. It is
later determined that the scale was a postage
scale and couldnt weigh packages above 10 lbs.
All dogs weighing more than 10 lbs were recorded
as weighing 10 lbs.
7Segment I Understanding Error
- Some data is inherently biased
- Selection bias
- Measurement bias
- All non-biased data has a STANDARD ERROR OF
MEASUREMENT - This is the precision to which the
instrument/experimenter can accurately record the
data - Data is randomly distributed around the true
value - Note Most systems have also randomness of their
own
8Example Random Error
- On the other hand, when have you ever seen a dog
stand still on a scale. The dogs shake and it is
impossible to measure their weight to better than
an accuracy of 5.
The Bottom Line Minimize your sources of error
and understand the error that you cant get rid
of.
9Segment I Understanding Error
- Some data is inherently biased
- Selection bias
- Measurement bias
- All non-biased data has a STANDARD ERROR OF
MEASUREMENT - This is the precision to which the
instrument/experimenter can accurately record the
data - Data is randomly distributed around the true
value - Note Most systems have also randomness of their
own - The average data point differs from the mean of
the data by the Standard Deviation
10Example Fish Catches
Vessel A and Vessel B caught the same
average number of fish every day but obviously
have very different typical days. There are
ways to quantify these differences.
11The Frequency Histogram
250
200
150
Number of members in bin
100
50
0
0
2
4
6
8
10
Average fish weight in lbs. (binwidth 0.5 lbs.)
12The Gaussian Distribution
250
200
150
Number of members in bin
100
50
0
0
2
4
6
8
10
Weight in Lbs. (binwidth 0.5 Lbs.)
13The Gaussian Distribution
0.4
0.35
0.3
0.25
0.2
Likelihood (arbitrary units)
0.15
0.1
50
0.05
0
0
2
4
6
8
10
Weight in Lbs (no bins)
14The Gaussian Distribution
15The Gaussian Distribution
34.1
34.1
13.6
13.6
2.14
2.14
16Three types of The Average
- The mean Bill Gates walks into a coffee shop and
the average person in the shop is rich! - The median Half of all people are taller than me
and half are shorter. I am of the average
height. - The mode The largest number of golfers shoot an
88.
Important In most data sets, most data is not
the average.
17Example Fish Catches
If the average data point is not at the
average of the data, then were is it?
Deviation- The distance of a measurement
from the mean. Variance- The sum of the
squared deviations of n measurements from
the mean. Standard deviation- The variance
divided by n-1
18What is a standard deviation?
- It is the typical (standard) difference
(deviation) of an observation from the mean. - Think of it as the average distance a data point
is from the mean, although this is not strictly
true.
To calculate the standard deviation (s)
-Calculate the variance subtract each datapoint
from the average and square the
difference. -Sum the squares. -Divide
the sum by the number of samples, minus one
19So how do our fishing boats compare?
20So you have the mean (µ) and the standard
deviation (s) Now what?
The standard score (z) (Data
Point-Mean)/Standard Deviation
21Common Question What value and error bars should
I report for this data?
- For each data set, find the mean.
- For each data set, find the standard deviation.
- Plot the mean as a dot.
- Plot 2 standard deviations above and below the
mean.
22Lab 1 Make A Scatter Plot of Data with Error Bars
- Open c\StatsLab.xls
- In sheet 1 is some data. Use the function
average() and stdev to put the means and
standard deviations where they are indicated. - Make a scatter plot of the means versus the
dependent variable, the quantity of fertilizer. - Add error bars of 2s.
23Common Question Is there really a difference
between these two data sets?
- Experimental Design The control versus the
experimental. - Collect data from both samples.
- Find the mean and standard deviation of each.
- Determine the standard score (z) of the one
versus the other. - If one is 2 standard deviations different from
the others mean, then the sets probably show a
difference. - The confidence with which you can say that the
two are different depends on z - Z 1 69 confidence, Z2, 95 confidence, Z3,
99.5 confidence - To find your confidence to report, look up a
table of STANDARD NORMAL PROBABILITIES
But what if you have very few data points?
24Lab 2 Determine whether data set differences are
statistically significant.
- The second worksheet contains the data from
Lab 1. - Compute the standard score (Z) for the
differences between each data set. - Which of the data points cannot be shown to be
different from the others with 95 confidence?
25Common Question Is there really a difference
between these two data sets?
- Experimental Design The control versus the
experimental. - Collect data from both samples.
- Find the mean and standard deviation of each.
- Determine the standard score (z) of the one
versus the other. - If one is 2 standard deviations different from
the others mean, then the sets probably show a
difference. - The confidence with which you can say that the
two are different depends on z - Z 1 69 confidence, Z2, 95 confidence, Z3,
99.5 confidence - To find your confidence to report, look up a
table of STANDARD NORMAL PROBABILITIES
But what if you have very few data points?
26- W.S. Student Gossett (1876 - 1937)
27Students T-Test
- Theres a certain chance you just found a weird
sample - It goes up as the sample size goes down.
- Too few samples wont be explained by a Gaussian
curve. - Student published an alternative to this.
28Students T-Table
29Lab 3 Students T-Test
- In Lab 3, another, much smaller data set from
another fertilizer is to be analyzed. Can it be
concluded that the other fertilizer had a
statistical effect on the growth?
30Now to Change Gears Tests for Correlation
- Sometimes its difficult to perform a controlled
experiment. - Samples collected from nature
- Studies involving the weather
- Anything where you ask people to do something
over a long period of time - So what do you do? You look for correlation.
31Positive Correlation Two Phenomena Rise and Fall
Together
32Negative (Inverse) Correlation Two Phenomena
Rise and Fall Together
33Strong vs. Weak Correlations
34Common Question Are these phenomena correlated?
- Pearsons Product Moment Coefficient
- A.K.A. Correlation Coefficient or r
- Never mind about this formula, its easy in Excel
- What is the statistical significance?
- Use Students T table but with n-2 degrees of
freedom
35Common Problem Fitting Data
- Find a trend in the data
- How much more do plants grow when fertilized?
- Can be very complicated Requires deep
understanding of the physical phenomenon - Simplest fit, Linear Least Squares
- Accessible in Microsoft Excel
36Lab 4 Determining Correlation
- Make a scatter plot of the hours studied versus
exam score columns. - Find the correlation coefficient by using the
function gtInsert gtFunction gtStatistical
gtPearson - Plot the linear least squares fit of the plot and
the r2 value on the graph. - Determine how confident we can be that the
phenomena are linked.
37Shifting Gears Again Discrete Data
- Until now, we assumed that all data was
measurement data. - Now, we deal with nominal data
- Home runs hit by player.
- Ice cream preference by country.
38Categorical data classified asNominal, Ordinal,
and/or Binary
Categorical data
Ordinal data
Nominal data
Not binary
Binary
Binary
Not binary
39Heart Attacks and Aspirin
Does taking aspirin affect the chances of
suffering a Second heart attack? How sure can you
be?
40?2 analysis Expectation vs. observation
- Compute the percent distribution if all the data
was random.
41?2 analysis Expectation vs. observation
The ?2 can now be looked up on a table to see its
significance. ?2 has degrees of freedom, like
Students T and Pearsons Product. The degrees of
freedom (rows-1 columns-1) The Excel
command is CHITEST
42Lab 5 ?2 Analysis
- Wally likes to fish, rain or shine. Some days he
catches fish and other days he doesnt. Find out
if there is a link between rainy days and Wallys
luck with the rod and reel. What is the
probability of such an event occurring by
accident?
43To Recap
- Statistics is a branch of mathematics that is
used to guide experimental design. - Larger sample sizes are better if unbiased.
- Students T-tests can check if small data sets
are significantly different. - -For larger sets, use Gaussian Z-tests.
- All of this is available in Excel.
44Stats Resources
- CliffsQuickReview Statistics (10)
- Spreadsheet programs
- Lots of stats built into Excel
- http//mail.pittstate.edu/winters/tutorial/
- http//www.statsoft.com/textbook/stathome.html
45(No Transcript)
46Lab 1 Make A Scatter Plot of Data with Error Bars
- Open c\StatsLab.xls
- In sheet 1 is some data. Use the function
average() and stdev to put the means and
standard deviations where they are indicated. - Make a scatter plot of the means versus the
dependent variable, the quantity of fertilizer. - Add error bars of 2s.
47Lab 2 Determine whether data set differences are
statistically significant.
- The second worksheet contains the data from
Lab 1. - Compute the standard score (Z) for the
differences between each data set. - Which of the data points cannot be shown to be
different from the others with 95 confidence?
48Lab 3 Students T-Test
- In Lab 3, another, much smaller data set from
another fertilizer is to be analyzed. Can it be
concluded that the other fertilizer had a
statistical effect on the growth?
49Lab 4 Determining Correlation
- Make a scatter plot of the hours studied versus
exam score columns. - Find the correlation coefficient by using the
function gtInsert gtFunction gtStatistical
gtPearson - Plot the linear least squares fit of the plot and
the r2 value on the graph. - Determine how confident we can be that the
phenomena are linked.
50Lab 5 ?2 Analysis
- Wally likes to fish, rain or shine. Some days he
catches fish and other days he doesnt. Find out
if there is a link between rainy days and Wallys
luck with the rod and reel. What is the
probability of such an event occurring by
accident?
51(No Transcript)
52Population
- The set of data (numerical or otherwise)
corresponding to the entire collection of units
about which information is sought
53Population Examples
- Unemployment - Status of ALL employable people
(employed, unemployed) in the U.S. - SAT Scores - Math SAT scores of EVERY person that
took the SAT during 1997 - Responses of ALL currently enrolled underage
college students as to whether they have consumed
alcohol in the last 24 hours
Traits of a Population are called Parameters
54Sample
- A subset of the population data that are actually
collected in the course of a study. -
Population
Sample
100 people with heart conditions given aspirin.
55Sample Examples
- Unemployment - Status of the 1000 employable
people interviewed. - SAT Scores - Math SAT scores of 20 people that
took the SAT during 1997 - Responses of 538 currently enrolled underage
college students as to whether they have consumed
alcohol in the last 24 hours
Traits of Samples are called Statistics