Title: Data Description
1CHAPTER 3
Data Description
2Objectives
- Summarize data using measures of central
tendency, such as the mean, median, mode, and
midrange. - Describe data using the measures of variation,
such as the range, variance, and standard
deviation. - Identify the position of a data value in a data
set using various measures of position, such as
percentiles, deciles, and quartiles.
3Objectives (contd.)
- Use the techniques of exploratory data analysis,
including boxplots and five-number summaries to
discover various aspects of data.
4Introduction
- Statistical methods can be used to summarize
data. - Measures of average are also called measures of
central tendency and include the mean, median,
mode, and midrange. - Measures that determine the spread of data values
are called measures of variation or measures of
dispersion and include the range, variance, and
standard deviation.
5Introduction (contd.)
- Measures of position tell where a specific data
value falls within the data set or its relative
position in comparison with other data values. - The most common measures of position are
percentiles, deciles, and quartiles.
6 Introduction (contd.)
- The measures of central tendency, variation, and
position are part of what is called traditional
statistics. This type of data is typically used
to confirm conjectures about the data.
7Introduction (contd.)
- Another type of statistics is called exploratory
data analysis (EDA). These techniques include the
the box plot and the five-number summary. They
can be used to explore data to see what they show.
8Basic Vocabulary
- A statistic is a characteristic or measure
obtained by using the data values from a sample. - A parameter is a characteristic or measure
obtained by using all the data values for a
specific population. - When the data in a data set is ordered it is
called a data array.
9General Rounding Rule
- In statistics the basic rounding rule is that
when computations are done in the calculation,
rounding should not be done until the final
answer is calculated.
0
0
0
10(3.2) The Arithmetic Average
- The mean is the sum of the values divided by the
total number of values. - Rounding rule the mean should be rounded to one
more decimal place than occurs in the raw data. - The type of mean that considers an additional
factor is called the weighted mean.
11The Arithmetic Average
- The Greek letter ? (mu) is used to represent the
population mean. - The symbol (x-bar) represents the sample
mean. - Assume that data are obtained from a sample
unless otherwise specified.
12Median and Mode
- The median is the halfway point in a data set.
The symbol for the median is MD. - The median is found by arranging the data in
order and selecting the middle point. - The value that occurs most often in a data set is
called the mode. - The mode for grouped data, or the class with the
highest frequency, is the modal class.
13Midrange
- The midrange is defined as the sum of the lowest
and highest values in the data set divided by 2. - The symbol for midrange is MR.
14Central Tendency The Mean
- One computes the mean by using all the values of
the data. - The mean varies less than the median or mode when
samples are taken from the same population and
all three measures are computed for these
samples. - The mean is used in computing other statistics,
such as variance.
15Central Tendency The Mean (contd.)
- The mean for the data set is unique, and not
necessarily one of the data values. - The mean cannot be computed for an open-ended
frequency distribution. - The mean is affected by extremely high or low
values and may not be the appropriate average to
use in these situations.
16Central Tendency The Median
- The median is used when one must find the center
or middle value of a data set. - The median is used when one must determine
whether the data values fall into the upper half
or lower half of the distribution. - The median is used to find the average of an
open-ended distribution. - The median is affected less than the mean by
extremely high or extremely low values.
17Central Tendency The Mode
- The mode is used when the most typical case is
desired. - The mode is the easiest average to compute.
- The mode can be used when the data are nominal,
such as religious preference, gender, or
political affiliation. - The mode is not always unique. A data set can
have more than one mode, or the mode may not
exist for a data set.
18Central Tendency The Midrange
- The midrange is easy to compute.
- The midrange gives the midpoint.
- The midrange is affected by extremely high or low
values in a data set.
19Distribution Shapes
- In a positively skewed or right skewed
distribution, the majority of the data values
fall to the left of the mean and cluster at the
lower end of the distribution.
20Distribution Shapes (contd.)
- In a symmetrical distribution, the data values
are evenly distributed on both sides of the mean.
21Distribution Shapes (contd.)
- When the majority of the data values fall to the
right of the mean and cluster at the upper end of
the distribution, with the tail to the left, the
distribution is said to be negatively skewed or
left skewed.
22The Range
- The range is the highest value minus the lowest
value in a data set. - The symbol R is used for the range.
23(3.3) Variance and Standard Deviation
- The variance is the average of the squares of the
distance each value is from the mean. The symbol
for the population variance is ?2.
24Variance and Standard Deviation
- The standard deviation is the square root of the
variance. The symbol for the population standard
deviation is ?. Rounding rule The final answer
should be rounded to one more decimal place than
the original data.
25Coefficient of Variation
- The coefficient of variation is the standard
deviation divided by the mean. The result is
expressed as a percentage. - The coefficient of variation is used to compare
standard deviations when the units are different
for the two variables being compared.
26Variance and Standard Deviation
- Variances and standard deviations can be used to
determine the spread of the data. If the variance
or standard deviation is large, the data are more
dispersed. The information is useful in comparing
two or more data sets to determine which is more
variable. - The measures of variance and standard deviation
are used to determine the consistency of a
variable.
27Variance and Standard Deviation (contd.)
- The variance and standard deviation are used to
determine the number of data values that fall
within a specified interval in a distribution. - The variance and standard deviation are used
quite often in inferential statistics.
28Chebyshevs Theorem
- The proportion of values from a data set that
will fall within k standard deviations of the
mean will be at least 1 1/k2 where k is a
number greater than 1. - This theorem applies to any distribution
regardless of its shape.
29Empirical Rule for Normal Distributions
- The following apply to a bell-shaped
distribution. - Approximately 68 of the data values fall within
one standard deviation of the mean. - Approximately 95 of the data values fall within
two standard deviations of the mean. - Approximately 99.75 of the data values fall
within three standard deviations of the mean.
30Standard Scores
- A standard score or z score is used when direct
comparison of raw scores is impossible. - A standard score or z score for a value is
obtained by subtracting the mean from the value
and dividing the result by the standard
deviation.
31Percentiles
- Percentiles are position measures used in
educational and health-related fields to indicate
the position of an individual in a group. - A percentile, P, is an integer between 1 and 99
such that the Pth percentile is a value where P
of the data values are less than or equal to the
value and 100 P of the data values are
greater than or equal to the value.
32Quartiles and Deciles
- Quartiles divide the distribution into four
groups, denoted by Q1, Q2, Q3. Note that Q1 is
the same as the 25th percentile Q2 is the same
as the 50th percentile or the median and Q3
corresponds to the 75th percentile. - Deciles divide the distribution into 10 groups.
They are denoted by D1, D2, , D10.
33Outliers
- An outlier is an extremely high or an extremely
low data value when compared with the rest of the
data values. - Outliers can be the result of measurement or
observational error. - When a distribution is normal or bell-shaped,
data values that are beyond three standard
deviations of the mean can be considered
suspected outliers.
34Exploratory Data Analysis
- The purpose of exploratory data analysis is to
examine data in order to find out what
information can be discovered. For example - Are there any gaps in the data?
- Can any patterns be discerned?
35Boxplots and Five-Number Summaries
- Boxplots are graphical representations of a
five-number summary of a data set. The five
specific values that make up a five-number
summary are - The lowest value of data set (minimum)
- Q1 (or 25th percentile)
- The median (or 50th percentile)
- Q3 (or 75th percentile)
- The highest value of data set (maximum)