Describing and Exploring Data

About This Presentation

Title:

Describing and Exploring Data

Description:

Check out this demo which clearly shows how the width of the bin that you select ... Using the procedure, the mean can be shown to be an unbiased estimator (see p 47) ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 62

Provided by: johnba4

Category:

more less

Transcript and Presenter's Notes

Title: Describing and Exploring Data

1
Chapter 2

Describing and Exploring Data

2
Describing and Exploring Data

Once a bunch of data has been collected, the raw
numbers must be manipulated in some fashion to
make them more informative.
Several options are available including plotting
the data or calculating descriptive statistics.

3
Plotting Data

Often, the first thing one does with a set of raw
data is to plot frequency distributions.
Usually this is done by first creating a table of
the frequencies broken down by values of the
relevant variable, then the frequencies in the
table are plotted in a histogram.

4
TABLE Example Your age as
estimated by the questionnaire from the first
class.

Note The frequencies in the adjacent table were
calculated by simply counting the number of
subjects having the specified value for the age
variable.

5
Histogram
6
Grouping Data

Plotting is easy when the variable of interest
has a relatively small number of values (like our
age variable did).
However, the values of a variable are sometimes
more continuous, resulting in uninformative
frequency plots if done in the above manner.

7
Grouping Data

For example, our weight variable ranges from 100
lb. to 200 lb. If we used the previously
described technique, we would end up with 100
bars, most of which with a frequency less than 2
or 3 (and many with a frequency of zero).
We can get around this problem by grouping our
values into bins. Try for around 10 bins with
natural splits.

8
TableExample Binning our weight variable.
9
Histogram
Check out this demo which clearly shows how the
width of the bin that you select can clearly
affect the look of the data Here is another
similar demonstration of the effects of bin width

See section in text on cumulative frequency
distributions

10
Stem Leaf Plots

If values of a variable must be grouped prior to
creating a frequency plot, then the information
related to the specific values becomes lost in
the process (i.e., the resulting graph depicts
only the frequency values associated with the
grouped values).
However, it is possible to obtain the graphical
advantage of grouping and still keep all of the
information if stem leaf plots are used.

11
Stem Leaf Plots

These plots are created by splitting a data point
into that part associated with the group and
that associated with the individual point.
For example, the numbers 180, 180, 181, 182, 185,
186, 187, 187, 189 could be represented as
18 001256779

12
Thus, we could represent our weight data in the
following stem leaf plot
13
Stem leaf plots are especially nice for
comparing distributions
14
Terminology Related to Distributions

Often, frequency histograms tend to have a
roughly symmetrical bell-shape and such
distributions are called normal or gaussion.

15
Terminology Related to Distributions

Sometimes, the bell shape is not symmetrical.
The term positive skew refers to the situation
where the tail of the distribution is to the
right, negative skew is when the tail is to the
left.

16
Example Pizza Data.
17
Notation Variables

When we describe a set of data corresponding to
the values of some variable, we will refer to
that set using an uppercase letter such as X or
Y.
When we want to talk about specific data points
within that set, we specify those points by
adding a subscript to the uppercase letter like
X1.

18
For Example

5, 8, 12, 3, 6, 8, 7
X1, X2, X3, X4, X5, X6, X7

19
Summation

The Greek letter sigma, which looks like ?, means
add up or sum whatever follows it.
Thus, ?Xi, means add up all the Xis.
If we use the Xis from the previous example, ?Xi
49 (or just ?X).

20
Summation

Note, that sometimes the ? has number above and
below if. These numbers specify the range over
which to sum.
For example, if we again use the the Xis from the
previous example, but now limit the summation
?Xi 34

21
Nasty Example
22
Nasty Example . . .continued

?X
?Y
?(X-Y)
?X2
(?X)2

23
Your turn

?(XY)
(?(X-Y))2
?(X2-Y2)

24
Double Subscripts

Sometimes things are made more complicated
because capital letters (e.g., X) are sometimes
used to refer to entire data sets (as opposed to
single variables) and multiple subscripts are
used to specify specific data points.

25
X24 3??X or ??Xij 61
26
Measures of Central Tendency

While distributions provide an overall picture of
some data set, it is sometimes desirable to
represent the entire data set using descriptive
statistics.
The first descriptive statistics we will discuss,
are those used to indicate where the centre of
the distribution lies.

27
(No Transcript)
28
The Mode

There are, in fact, three different measures of
central tendency.
The first of these is called the mode.
The mode is simply the value of the relevant
variable that occurs most often (i.e., has the
highest frequency) in the sample.

29
The Mode

Note that if you have done a frequency histogram,
you can often identify the mode simply by finding
the value with the highest bar.
However, that will not work when grouping was
performed prior to plotting the histogram
(although you can still use the histogram to
identify the modal group, just not the modal
value).

30
Finding the mode

Create a non-grouped frequency table as described
previously, then identify the value with the
greatest frequency.
Example Class height.

31
The Median

A second measure of central tendency is called
the median.
The median is the point corresponding to the
score that lies in the middle of the distribution
(i.e., there are as many data points above the
median as there are below the median).

32
The Median

To find the median, the data points must first be
sorted into either ascending or descending
numerical order.
The position of the median value can then be
calculated using the following formula

33
Examples

1) If there are an odd number of data points
(1, 3, 3, 4, 4, 5, 6, 7, 12)
2) If there are an even number of data points
The median is the item in the fifth position of
the
ordered data set, therefore the median is 4.

34
The Mean

Finally, the most commonly used measure of
central tendency is called the mean (denoted for
a sample, and for a population).
The mean is the same of what most of us call the
average, and it is calculated in the following
manner

35
The Mean

For example, given the data set that we used to
calculate the median (odd number example), the
corresponding mean would be
Similarly, the mean height of our class,
as indicated by our sample, is

36
Mode vs. Median vs. Mean

In our height example, the mode and median were
the same, and the mean was fairly close to the
mode and median.
This was the case because the height distribution
was fairly symmetrical.
However, when the underlying distribution is not
symmetrical, the three measures of central
tendency can be quite different.

This raises the issue of which measure is best.
Note that if you were calculating these values,
you would show all your steps (its good to be
prof!).

38
Some Visual Demos
Here is a demonstration that allows you to change
a frequency histogram while simultaneously noting
the effects of those changes on the mean versus
the median. As you use the demo, you should
easily be able to think about how these changes
are also affecting the mode, right?
39
Measures of Variability

In addition to knowing where the centre of the
distribution is, it is often helpful to know the
degree to which individual values cluster around
the centre.
This is known as variability.

40
Range

There are various measures of variability, the
most straightforward being the range of the
sample
Highest value minus lowest value
While range provides a good first pass at
variance, it is not the best measure because of
its sensitivity to extreme scores (see text).

41
The Average Deviation

Another approach to estimating variance is to
directly measure the degree to which individual
data points differ from the mean and then average
those deviations.
That is

42
The Average Deviation

However, if we try to do this with real data, the
result will always be zero
Example (2,3,4,4,6,6,12)

43
The Mean Absolute Deviation (MAD)

One way to get around the problem with the
average deviation is to use the absolute value of
the differences, instead of the differences
themselves.
The absolute value of some number is just the
number without any sign
For Example -3 3

44
The Mean Absolute Deviation (MAD)

Thus, we could re-write and solve our average
deviation question as follows
The data set in question has a mean of 5 and a
mean absolute deviation of 2.

45
The Variance

Although the MAD is an acceptable measure of
variability, the most commonly used measure is
variance (denoted s2 for a sample and ?2 for a
population) and its square root termed the
standard deviation (denoted s for a sample and ?
for a population).

46
The Variance

The computation of variance is also based on the
basic notion of the average deviation however,
instead of getting around the zero problem by
using absolute deviations (as in MAD), the zero
problem is eliminating by squaring the
differences from the mean.
Specifically

47
(No Transcript)
48
Alternate formula for s2 and s

The definitional formula of variance just
presented was
An equivalent formula that is easier to work with
when calculating variances by hand is
Although this second formula may
look more intimidating, a few examples
will show you that it is actually easier to
work with (as youll see in assignment 2).

49
Visualizing Means and Standard Deviations
This demonstration allows you to play with the
mean and standard deviation of a distribution.
Note that changing the mean of the distribution
simply moves the entire distribution to the left
or right without changing its shape. In
contrast, changing the standard deviation alters
the spread of the data but does not affect where
the distribution is centered Run demo
50
Estimating Population Parameters

So, the mean (X) and variance (s2) are the
descriptive statistics that are most commonly
used to represent the data points of some sample.
The real reason that they are the preferred
measures of central tendency and variance is
because of certain properties they have as
estimators of their corresponding population
parameters and ?2.

51
Estimating Population Parameters

Four properties are considered desirable in a
population estimator sufficiency, unbiasedness,
efficiency, resistance.
Both the mean and the variance are the best
estimators in their class in terms of the first
three of these four properties.
To understand these properties, you first need to
understand a concept in statistics called the
sampling distribution

52
Sampling Distribution Demo
We will discuss sampling distributions off and on
throughout the course, and I only want to touch
on the notion now. Basically, the idea is this
in order to exam the properties of a statistic we
often want to take repeated samples from some
population of data and calculate the relevant
statistic on each sample. We can then look at
the distribution of the statistic across these
samples and ask a variety of questions about
it. Check out this demonstration which I hope
makes the concept of sampling distributions more
clear.
53
Properties of a Statistic

1) Sufficiency
A sufficient statistic is one that makes use of
all of the information in the sample to estimate
its corresponding parameter.

54
Estimating Population Parameters

2) Unbiasedness
A statistic is said to be an unbiased estimator
if its expected value (i.e., the mean of a number
of sample means) is equal to the population
parameter it is estimating.
Explanation of N-1 in s2 formula.

55
Assessing the Bias of an Estimator

Using the procedure, the mean can be shown to be
an unbiased estimator (see p 47).
However, if the more intuitive formula for s2 is
used
it turns out to underestimate ?2

56
Assessing the Bias of an Estimator

This bias to underestimate is caused by the act
of sampling and it can be shown that this bias
can be eliminated if N-1 is used in the
denominator instead of N.
Note that this is only true when calculating s2,
if you have a measurable population and you want
to calculate ?2, you use N in the denominator,
not N-1.

57
Degrees of Freedom

The mean of 6, 8, 10 is 8.
If I allow you to change as many of these numbers
as you want BUT the mean must stay 8, how many of
the numbers are you free to vary?

58
Degrees of Freedom

The point of this exercise is that when the mean
is fixed, it removes a degree of freedom from
your sample -- this is like actually subtracting
1 from the number of observations in your sample.
It is for exactly this reason that we use N-1 in
the denominator when we calculate s2 (i.e., the
calculation requires that the mean be fixed first
which effectively removes -- fixes -- one of the
data points).

59
Estimating Population Parameters

3) Efficiency
The efficiency of a statistic is reflected in the
variance that is observed when one examines the
means of a bunch of independently chosen samples.
The smaller the variance, the more efficient the
statistic is said to be.

60
Estimating Population Parameters

4) Resistance
The resistance of an estimator refers to the
degree to which that estimate is effected by
extreme values.
As mentioned previously, both X and s2 are
highly sensitive to extreme values.

61
Estimating Population Parameters

4) Resistance
Despite this, they are still the most commonly
used estimates of the corresponding population
parameters, mostly because of their superiority
over other measures in terms sufficiency,
unbiasedness, efficiency.

Write a Comment

User Comments (0)

About PowerShow.com

Describing and Exploring Data - PowerPoint PPT Presentation

Describing and Exploring Data

Check out this demo which clearly shows how the width of the bin that you select ... Using the procedure, the mean can be shown to be an unbiased estimator (see p 47) ... – PowerPoint PPT presentation