Title: Describing and Exploring Data
1Chapter 2
- Describing and Exploring Data
2Describing and Exploring Data
- Once a bunch of data has been collected, the raw
numbers must be manipulated in some fashion to
make them more informative. - Several options are available including plotting
the data or calculating descriptive statistics.
3Plotting Data
- Often, the first thing one does with a set of raw
data is to plot frequency distributions. - Usually this is done by first creating a table of
the frequencies broken down by values of the
relevant variable, then the frequencies in the
table are plotted in a histogram.
4 TABLE Example Your age as
estimated by the questionnaire from the first
class.
- Note The frequencies in the adjacent table were
calculated by simply counting the number of
subjects having the specified value for the age
variable.
5Histogram
6Grouping Data
- Plotting is easy when the variable of interest
has a relatively small number of values (like our
age variable did). - However, the values of a variable are sometimes
more continuous, resulting in uninformative
frequency plots if done in the above manner.
7Grouping Data
- For example, our weight variable ranges from 100
lb. to 200 lb. If we used the previously
described technique, we would end up with 100
bars, most of which with a frequency less than 2
or 3 (and many with a frequency of zero). - We can get around this problem by grouping our
values into bins. Try for around 10 bins with
natural splits.
8TableExample Binning our weight variable.
9Histogram
Check out this demo which clearly shows how the
width of the bin that you select can clearly
affect the look of the data Here is another
similar demonstration of the effects of bin width
- See section in text on cumulative frequency
distributions
10Stem Leaf Plots
- If values of a variable must be grouped prior to
creating a frequency plot, then the information
related to the specific values becomes lost in
the process (i.e., the resulting graph depicts
only the frequency values associated with the
grouped values). - However, it is possible to obtain the graphical
advantage of grouping and still keep all of the
information if stem leaf plots are used.
11Stem Leaf Plots
- These plots are created by splitting a data point
into that part associated with the group and
that associated with the individual point. - For example, the numbers 180, 180, 181, 182, 185,
186, 187, 187, 189 could be represented as - 18 001256779
12 Thus, we could represent our weight data in the
following stem leaf plot
13 Stem leaf plots are especially nice for
comparing distributions
14Terminology Related to Distributions
- Often, frequency histograms tend to have a
roughly symmetrical bell-shape and such
distributions are called normal or gaussion.
15Terminology Related to Distributions
- Sometimes, the bell shape is not symmetrical.
- The term positive skew refers to the situation
where the tail of the distribution is to the
right, negative skew is when the tail is to the
left.
16 Example Pizza Data.
17 Notation Variables
- When we describe a set of data corresponding to
the values of some variable, we will refer to
that set using an uppercase letter such as X or
Y. - When we want to talk about specific data points
within that set, we specify those points by
adding a subscript to the uppercase letter like
X1.
18For Example
- 5, 8, 12, 3, 6, 8, 7
- X1, X2, X3, X4, X5, X6, X7
19Summation
- The Greek letter sigma, which looks like ?, means
add up or sum whatever follows it. - Thus, ?Xi, means add up all the Xis.
- If we use the Xis from the previous example, ?Xi
49 (or just ?X).
20Summation
- Note, that sometimes the ? has number above and
below if. These numbers specify the range over
which to sum. - For example, if we again use the the Xis from the
previous example, but now limit the summation - ?Xi 34
21Nasty Example
22Nasty Example . . .continued
23Your turn
24Double Subscripts
- Sometimes things are made more complicated
because capital letters (e.g., X) are sometimes
used to refer to entire data sets (as opposed to
single variables) and multiple subscripts are
used to specify specific data points.
25X24 3??X or ??Xij 61
26Measures of Central Tendency
- While distributions provide an overall picture of
some data set, it is sometimes desirable to
represent the entire data set using descriptive
statistics. - The first descriptive statistics we will discuss,
are those used to indicate where the centre of
the distribution lies.
27(No Transcript)
28The Mode
- There are, in fact, three different measures of
central tendency. - The first of these is called the mode.
- The mode is simply the value of the relevant
variable that occurs most often (i.e., has the
highest frequency) in the sample.
29The Mode
- Note that if you have done a frequency histogram,
you can often identify the mode simply by finding
the value with the highest bar. - However, that will not work when grouping was
performed prior to plotting the histogram
(although you can still use the histogram to
identify the modal group, just not the modal
value).
30Finding the mode
- Create a non-grouped frequency table as described
previously, then identify the value with the
greatest frequency. - Example Class height.
31The Median
- A second measure of central tendency is called
the median. - The median is the point corresponding to the
score that lies in the middle of the distribution
(i.e., there are as many data points above the
median as there are below the median).
32The Median
- To find the median, the data points must first be
sorted into either ascending or descending
numerical order. - The position of the median value can then be
calculated using the following formula -
33Examples
- 1) If there are an odd number of data points
- (1, 3, 3, 4, 4, 5, 6, 7, 12)
- 2) If there are an even number of data points
- The median is the item in the fifth position of
the - ordered data set, therefore the median is 4.
34The Mean
- Finally, the most commonly used measure of
central tendency is called the mean (denoted for
a sample, and for a population). - The mean is the same of what most of us call the
average, and it is calculated in the following
manner
35The Mean
- For example, given the data set that we used to
calculate the median (odd number example), the
corresponding mean would be - Similarly, the mean height of our class,
- as indicated by our sample, is
36Mode vs. Median vs. Mean
- In our height example, the mode and median were
the same, and the mean was fairly close to the
mode and median. - This was the case because the height distribution
was fairly symmetrical. - However, when the underlying distribution is not
symmetrical, the three measures of central
tendency can be quite different.
37- This raises the issue of which measure is best.
-
-
- Note that if you were calculating these values,
you would show all your steps (its good to be
prof!).
38Some Visual Demos
Here is a demonstration that allows you to change
a frequency histogram while simultaneously noting
the effects of those changes on the mean versus
the median. As you use the demo, you should
easily be able to think about how these changes
are also affecting the mode, right?
39Measures of Variability
- In addition to knowing where the centre of the
distribution is, it is often helpful to know the
degree to which individual values cluster around
the centre. - This is known as variability.
40Range
- There are various measures of variability, the
most straightforward being the range of the
sample - Highest value minus lowest value
- While range provides a good first pass at
variance, it is not the best measure because of
its sensitivity to extreme scores (see text).
41The Average Deviation
- Another approach to estimating variance is to
directly measure the degree to which individual
data points differ from the mean and then average
those deviations. - That is
42The Average Deviation
- However, if we try to do this with real data, the
result will always be zero - Example (2,3,4,4,6,6,12)
43The Mean Absolute Deviation (MAD)
- One way to get around the problem with the
average deviation is to use the absolute value of
the differences, instead of the differences
themselves. - The absolute value of some number is just the
number without any sign - For Example -3 3
44The Mean Absolute Deviation (MAD)
- Thus, we could re-write and solve our average
deviation question as follows - The data set in question has a mean of 5 and a
mean absolute deviation of 2.
45The Variance
- Although the MAD is an acceptable measure of
variability, the most commonly used measure is
variance (denoted s2 for a sample and ?2 for a
population) and its square root termed the
standard deviation (denoted s for a sample and ?
for a population).
46The Variance
- The computation of variance is also based on the
basic notion of the average deviation however,
instead of getting around the zero problem by
using absolute deviations (as in MAD), the zero
problem is eliminating by squaring the
differences from the mean. - Specifically
47(No Transcript)
48Alternate formula for s2 and s
- The definitional formula of variance just
presented was - An equivalent formula that is easier to work with
when calculating variances by hand is - Although this second formula may
look more intimidating, a few examples
will show you that it is actually easier to
work with (as youll see in assignment 2).
49Visualizing Means and Standard Deviations
This demonstration allows you to play with the
mean and standard deviation of a distribution.
Note that changing the mean of the distribution
simply moves the entire distribution to the left
or right without changing its shape. In
contrast, changing the standard deviation alters
the spread of the data but does not affect where
the distribution is centered Run demo
50Estimating Population Parameters
- So, the mean (X) and variance (s2) are the
descriptive statistics that are most commonly
used to represent the data points of some sample. - The real reason that they are the preferred
measures of central tendency and variance is
because of certain properties they have as
estimators of their corresponding population
parameters and ?2.
51Estimating Population Parameters
- Four properties are considered desirable in a
population estimator sufficiency, unbiasedness,
efficiency, resistance. - Both the mean and the variance are the best
estimators in their class in terms of the first
three of these four properties. - To understand these properties, you first need to
understand a concept in statistics called the
sampling distribution
52Sampling Distribution Demo
We will discuss sampling distributions off and on
throughout the course, and I only want to touch
on the notion now. Basically, the idea is this
in order to exam the properties of a statistic we
often want to take repeated samples from some
population of data and calculate the relevant
statistic on each sample. We can then look at
the distribution of the statistic across these
samples and ask a variety of questions about
it. Check out this demonstration which I hope
makes the concept of sampling distributions more
clear.
53Properties of a Statistic
- 1) Sufficiency
- A sufficient statistic is one that makes use of
all of the information in the sample to estimate
its corresponding parameter.
54Estimating Population Parameters
- 2) Unbiasedness
- A statistic is said to be an unbiased estimator
if its expected value (i.e., the mean of a number
of sample means) is equal to the population
parameter it is estimating. - Explanation of N-1 in s2 formula.
55Assessing the Bias of an Estimator
- Using the procedure, the mean can be shown to be
an unbiased estimator (see p 47). - However, if the more intuitive formula for s2 is
used - it turns out to underestimate ?2
56Assessing the Bias of an Estimator
- This bias to underestimate is caused by the act
of sampling and it can be shown that this bias
can be eliminated if N-1 is used in the
denominator instead of N. - Note that this is only true when calculating s2,
if you have a measurable population and you want
to calculate ?2, you use N in the denominator,
not N-1.
57Degrees of Freedom
- The mean of 6, 8, 10 is 8.
- If I allow you to change as many of these numbers
as you want BUT the mean must stay 8, how many of
the numbers are you free to vary?
58Degrees of Freedom
- The point of this exercise is that when the mean
is fixed, it removes a degree of freedom from
your sample -- this is like actually subtracting
1 from the number of observations in your sample. - It is for exactly this reason that we use N-1 in
the denominator when we calculate s2 (i.e., the
calculation requires that the mean be fixed first
which effectively removes -- fixes -- one of the
data points).
59Estimating Population Parameters
- 3) Efficiency
- The efficiency of a statistic is reflected in the
variance that is observed when one examines the
means of a bunch of independently chosen samples.
The smaller the variance, the more efficient the
statistic is said to be.
60Estimating Population Parameters
- 4) Resistance
- The resistance of an estimator refers to the
degree to which that estimate is effected by
extreme values. - As mentioned previously, both X and s2 are
highly sensitive to extreme values.
61Estimating Population Parameters
- 4) Resistance
- Despite this, they are still the most commonly
used estimates of the corresponding population
parameters, mostly because of their superiority
over other measures in terms sufficiency,
unbiasedness, efficiency.