Qualitative (Categorical) Data - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Qualitative (Categorical) Data

Description:

What's wrong with this study? Example: Measurement Bias ... The dogs shake and it is impossible to measure their weight to better than an accuracy of 5 ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 56

Provided by: cente56

Category:

more less

Transcript and Presenter's Notes

Title: Qualitative (Categorical) Data

1
Practical Applications of Statistics or So you've
got data... NOW WHAT?
With special thanks to Herb McGrath and Laura
Simon of the Penn State Stats Dept.
2
What is Statistics?Statistics is a branch of
mathematics concerned with interpreting data

Statistics guides all aspects of your Science
Fair project
Experiment Design/Data Collection
Summarizing Data
Interpreting Data
Drawing Conclusions from Data

3
Data is a good start now for a good finish!

Turning data into information
Proving that an experimental factor had an effect
Inferring relationships between factors
Determining how confident you can be in your
determinations

4
Segment I Understanding Error

It is important to minimize error and understand
it!
Some data is inherently biased
Selection bias
Measurement bias

5
Example Sampling Bias

Dog Obedience Experiment 1000 people are invited
to a public park to demonstrate the ability of
their dog to respond to commands. 500 turn out
with their dogs. The experimenters document the
number of commands given to each dog and the
number that they respond to appropriately. They
find that the dogs respond an astonishing 71 of
the time! Whats wrong with this study?

6
Example Measurement Bias

The dogs are weighed to determine if large dogs
respond to commands better than small dogs. It is
later determined that the scale was a postage
scale and couldnt weigh packages above 10 lbs.
All dogs weighing more than 10 lbs were recorded
as weighing 10 lbs.

7
Segment I Understanding Error

Some data is inherently biased
Selection bias
Measurement bias
All non-biased data has a STANDARD ERROR OF
MEASUREMENT
This is the precision to which the
instrument/experimenter can accurately record the
data
Data is randomly distributed around the true
value
Note Most systems have also randomness of their
own

8
Example Random Error

On the other hand, when have you ever seen a dog
stand still on a scale. The dogs shake and it is
impossible to measure their weight to better than
an accuracy of 5.

The Bottom Line Minimize your sources of error
and understand the error that you cant get rid
of.
9
Segment I Understanding Error

Some data is inherently biased
Selection bias
Measurement bias
All non-biased data has a STANDARD ERROR OF
MEASUREMENT
This is the precision to which the
instrument/experimenter can accurately record the
data
Data is randomly distributed around the true
value
Note Most systems have also randomness of their
own
The average data point differs from the mean of
the data by the Standard Deviation

10
Example Fish Catches
Vessel A and Vessel B caught the same
average number of fish every day but obviously
have very different typical days. There are
ways to quantify these differences.
11
The Frequency Histogram
250
200
150
Number of members in bin
100
50
0
0
2
4
6
8
10
Average fish weight in lbs. (binwidth 0.5 lbs.)
12
The Gaussian Distribution
250
200
150
Number of members in bin
100
50
0
0
2
4
6
8
10
Weight in Lbs. (binwidth 0.5 Lbs.)
13
The Gaussian Distribution
0.4
0.35
0.3
0.25
0.2
Likelihood (arbitrary units)
0.15
0.1
50
0.05
0
0
2
4
6
8
10
Weight in Lbs (no bins)
14
The Gaussian Distribution
15
The Gaussian Distribution
34.1
34.1
13.6
13.6
2.14
2.14
16
Three types of The Average

The mean Bill Gates walks into a coffee shop and
the average person in the shop is rich!
The median Half of all people are taller than me
and half are shorter. I am of the average
height.
The mode The largest number of golfers shoot an
88.

Important In most data sets, most data is not
the average.
17
Example Fish Catches
If the average data point is not at the
average of the data, then were is it?
Deviation- The distance of a measurement
from the mean. Variance- The sum of the
squared deviations of n measurements from
the mean. Standard deviation- The variance
divided by n-1
18
What is a standard deviation?

It is the typical (standard) difference
(deviation) of an observation from the mean.
Think of it as the average distance a data point
is from the mean, although this is not strictly
true.

To calculate the standard deviation (s)
-Calculate the variance subtract each datapoint
from the average and square the
difference. -Sum the squares. -Divide
the sum by the number of samples, minus one
19
So how do our fishing boats compare?
20
So you have the mean (µ) and the standard
deviation (s) Now what?
The standard score (z) (Data
Point-Mean)/Standard Deviation
21
Common Question What value and error bars should
I report for this data?

For each data set, find the mean.
For each data set, find the standard deviation.
Plot the mean as a dot.
Plot 2 standard deviations above and below the
mean.

22
Lab 1 Make A Scatter Plot of Data with Error Bars

Open c\StatsLab.xls
In sheet 1 is some data. Use the function
average() and stdev to put the means and
standard deviations where they are indicated.
Make a scatter plot of the means versus the
dependent variable, the quantity of fertilizer.
Add error bars of 2s.

23
Common Question Is there really a difference
between these two data sets?

Experimental Design The control versus the
experimental.
Collect data from both samples.
Find the mean and standard deviation of each.
Determine the standard score (z) of the one
versus the other.
If one is 2 standard deviations different from
the others mean, then the sets probably show a
difference.
The confidence with which you can say that the
two are different depends on z
Z 1 69 confidence, Z2, 95 confidence, Z3,
99.5 confidence
To find your confidence to report, look up a
table of STANDARD NORMAL PROBABILITIES

But what if you have very few data points?
24
Lab 2 Determine whether data set differences are
statistically significant.

The second worksheet contains the data from
Lab 1.
Compute the standard score (Z) for the
differences between each data set.
Which of the data points cannot be shown to be
different from the others with 95 confidence?

25
Common Question Is there really a difference
between these two data sets?

Experimental Design The control versus the
experimental.
Collect data from both samples.
Find the mean and standard deviation of each.
Determine the standard score (z) of the one
versus the other.
If one is 2 standard deviations different from
the others mean, then the sets probably show a
difference.
The confidence with which you can say that the
two are different depends on z
Z 1 69 confidence, Z2, 95 confidence, Z3,
99.5 confidence
To find your confidence to report, look up a
table of STANDARD NORMAL PROBABILITIES

But what if you have very few data points?
26

W.S. Student Gossett (1876 - 1937)

27
Students T-Test

Theres a certain chance you just found a weird
sample
It goes up as the sample size goes down.
Too few samples wont be explained by a Gaussian
curve.
Student published an alternative to this.

28
Students T-Table
29
Lab 3 Students T-Test

In Lab 3, another, much smaller data set from
another fertilizer is to be analyzed. Can it be
concluded that the other fertilizer had a
statistical effect on the growth?

30
Now to Change Gears Tests for Correlation

Sometimes its difficult to perform a controlled
experiment.
Samples collected from nature
Studies involving the weather
Anything where you ask people to do something
over a long period of time
So what do you do? You look for correlation.

31
Positive Correlation Two Phenomena Rise and Fall
Together
32
Negative (Inverse) Correlation Two Phenomena
Rise and Fall Together
33
Strong vs. Weak Correlations
34
Common Question Are these phenomena correlated?

Pearsons Product Moment Coefficient
A.K.A. Correlation Coefficient or r
Never mind about this formula, its easy in Excel
What is the statistical significance?
Use Students T table but with n-2 degrees of
freedom

35
Common Problem Fitting Data

Find a trend in the data
How much more do plants grow when fertilized?
Can be very complicated Requires deep
understanding of the physical phenomenon
Simplest fit, Linear Least Squares
Accessible in Microsoft Excel

36
Lab 4 Determining Correlation

Make a scatter plot of the hours studied versus
exam score columns.
Find the correlation coefficient by using the
function gtInsert gtFunction gtStatistical
gtPearson
Plot the linear least squares fit of the plot and
the r2 value on the graph.
Determine how confident we can be that the
phenomena are linked.

37
Shifting Gears Again Discrete Data

Until now, we assumed that all data was
measurement data.
Now, we deal with nominal data
Home runs hit by player.
Ice cream preference by country.

38
Categorical data classified asNominal, Ordinal,
and/or Binary
Categorical data
Ordinal data
Nominal data
Not binary
Binary
Binary
Not binary
39
Heart Attacks and Aspirin
Does taking aspirin affect the chances of
suffering a Second heart attack? How sure can you
be?
40
?2 analysis Expectation vs. observation

Compute the percent distribution if all the data
was random.

41
?2 analysis Expectation vs. observation
The ?2 can now be looked up on a table to see its
significance. ?2 has degrees of freedom, like
Students T and Pearsons Product. The degrees of
freedom (rows-1 columns-1) The Excel
command is CHITEST
42
Lab 5 ?2 Analysis

Wally likes to fish, rain or shine. Some days he
catches fish and other days he doesnt. Find out
if there is a link between rainy days and Wallys
luck with the rod and reel. What is the
probability of such an event occurring by
accident?

43
To Recap

Statistics is a branch of mathematics that is
used to guide experimental design.
Larger sample sizes are better if unbiased.
Students T-tests can check if small data sets
are significantly different.
-For larger sets, use Gaussian Z-tests.
All of this is available in Excel.

44
Stats Resources

CliffsQuickReview Statistics (10)
Spreadsheet programs
Lots of stats built into Excel
http//mail.pittstate.edu/winters/tutorial/
http//www.statsoft.com/textbook/stathome.html

45
(No Transcript)
46
Lab 1 Make A Scatter Plot of Data with Error Bars

Open c\StatsLab.xls
In sheet 1 is some data. Use the function
average() and stdev to put the means and
standard deviations where they are indicated.
Make a scatter plot of the means versus the
dependent variable, the quantity of fertilizer.
Add error bars of 2s.

47
Lab 2 Determine whether data set differences are
statistically significant.

The second worksheet contains the data from
Lab 1.
Compute the standard score (Z) for the
differences between each data set.
Which of the data points cannot be shown to be
different from the others with 95 confidence?

48
Lab 3 Students T-Test

In Lab 3, another, much smaller data set from
another fertilizer is to be analyzed. Can it be
concluded that the other fertilizer had a
statistical effect on the growth?

49
Lab 4 Determining Correlation

Make a scatter plot of the hours studied versus
exam score columns.
Find the correlation coefficient by using the
function gtInsert gtFunction gtStatistical
gtPearson
Plot the linear least squares fit of the plot and
the r2 value on the graph.
Determine how confident we can be that the
phenomena are linked.

50
Lab 5 ?2 Analysis

Wally likes to fish, rain or shine. Some days he
catches fish and other days he doesnt. Find out
if there is a link between rainy days and Wallys
luck with the rod and reel. What is the
probability of such an event occurring by
accident?

51
(No Transcript)
52
Population

The set of data (numerical or otherwise)
corresponding to the entire collection of units
about which information is sought

53
Population Examples

Unemployment - Status of ALL employable people
(employed, unemployed) in the U.S.
SAT Scores - Math SAT scores of EVERY person that
took the SAT during 1997
Responses of ALL currently enrolled underage
college students as to whether they have consumed
alcohol in the last 24 hours

Traits of a Population are called Parameters
54
Sample

A subset of the population data that are actually
collected in the course of a study.

Population
Sample
100 people with heart conditions given aspirin.
55
Sample Examples

Unemployment - Status of the 1000 employable
people interviewed.
SAT Scores - Math SAT scores of 20 people that
took the SAT during 1997
Responses of 538 currently enrolled underage
college students as to whether they have consumed
alcohol in the last 24 hours

Traits of Samples are called Statistics

Write a Comment

User Comments (0)