Title: Probability and Statistics
1- Probability and Statistics
- Lecture notes 03
2Lesson Overview
- Types of Data
- Qualitative (Categorical)
- Quantitative (Numerical)
- Discrete vs. Continuous
- Levels of Measurement
- Nominal, Ordinal, Interval, Ratio
- Data Summary and Presentation
- The Stem-and-leaf Diagram
- The Frequency Distribution Tables
- Histogram
- The Box Plot
- Time Sequence Plots
3Types of Data
- Data can be classified as either numeric or
nonnumeric. - Specific terms are used as follows
- Qualitative data are nonnumeric.
- Poor, Fair, Good, Better, Best, colors
(ignoring any physical causes), and types of
material straw, sticks, bricks are examples of
qualitative data. - Qualitative data is often termed catagorical
data. - Some books use the terms individual and variable
to reference the objects and characteristics
described by a set of data. - They also stress the importance of exact
definitions of these variables, including what
units they are recorded in. - The reason the data were collected is also
important.
4Types of Data
- Quantitative data are numeric.
- Quantitative data are further classified as
either discrete or continuous. - Discrete data are numeric data that have a finite
number of possible values.A classic example of
discrete data is a finite subset of the counting
numbers, 1,2,3,4,5 perhaps corresponding to
Strongly Disagree... Strongly Agree. - When data represent counts, they are discrete. An
example might be how many students were absent on
a given day. Counts are usually considered exact
and integer. Consider, however, if three tradies
make an absence, then aren't two tardies equal to
0.67 absences?
5Quantitative data / Types of Data
- Continuous data have infinite possibilities 1.4,
1.41, 1.414, 1.4142, 1.141421...The real numbers
are continuous with no gaps or interruptions. - Physically measureable quantities of length,
volume, time, mass, etc. are generally considered
continuous. At the physical level
(microscopically), especially for mass, this may
not be true, but for normal life situations is a
valid assumption. -
- The structure and nature of data will greatly
affect our choice of analysis method. By
structure we are referring to the fact that, for
example, the data might be pairs of measurements.
6Levels of Measurement
- The experimental (scientific) method depends on
physically measuring things. - The concept of measurement has been developed in
conjunction with the concepts of numbers and
units of measurement. - Statisticians categorize measurements according
to levels. - Each level corresponds to how this measurement
can be treated mathematically.
7Levels of Measurement (Measurement Scales) Four
common types
- Nominal Nominal data have no order and thus only
gives names or labels to various categories. - Ordinal Ordinal data have order, but the
interval between measurements is not meaningful. - Interval Interval data have meaningful intervals
between measurements, but there is no true
starting point (zero). - Ratio Ratio data have the highest level of
measurement. Ratios between measurements as well
as intervals are meaningful because there is a
starting point (zero). - (Gender is something you are born with, whereas
sex is something you should get a license for.)
8Levels of Measurement (measurement Scales) Four
common types
- Nominal scales are for things that are mutually
exclusive/non-overlapping, but there is no order
or ranking. For example, professors are divided
into departments by subject, but no subject is
ranked as better than another. - Ordinal Levels of Rank are categories that can be
ordered, but not precisely. For example, letter
grades, movie quality (excellent, good, adequate,
bad, terrible). - Interval Level ranks the data in precise scales,
but there is no meaningful zero. For example IQ
tests and temperature. Neither have a meaningful
zero. - Ratio Level Data can be ranked and there are
precise differences between the ranks, as well as
having a meaningful zero. For example Height,
weight, Salary, and Age.
9Types of Data / Levels of Measurement
- Example 1 ColorsTo most people, the colors
black, brown, red, orange, yellow, green, blue,
violet, gray, and white are just names of colors.
- To an electronics student familiar with
color-coded resistors, this data is in ascending
order and thus represents at least ordinal data. - To a physicist, the colors red, orange, yellow,
green, blue, and violet correspond to specific
wavelengths of light and would be an example of
ratio data.
10Types of Data / Levels of Measurement
- Example 2 TemperaturesWhat level of measurement
a temperature is depends on which temperature
scale is used.Specific values 0C 32F
273.15 K 491.69R 100C 212F 373.15 K
671.67R -17.8C 0F 255.4 K
459.67Rwhere C refers to Celsius F refers to
Fahrenheit K refers to Kelvin R refers to
Rankine. - Only Kelvin and Rankine have zeroes (starting
point) and ratios can be found. Celsius and
Fahrenheit are interval data certainly order is
important and intervals are meaningful. However,
a 180 dashboard is not twice as hot as the 90
outside temperature (Fahrenheit assumed)!
Although ordinal data should not be used for
calculations, it is not uncommon to find averages
formed from data collected which represented
Strongly Disagree, ..., Strongly Agree! Also,
averages of nominal data (zip codes, social
security numbers) is rather meaningless!
11Data Sources
- Published source
- Designed experiment
- Survey
- Observational study
12DATA SUMMARY
13DATA SUMMARY AND PRESENTATION
- The Stem-and-leaf Diagram
- The Frequency Tables
- Standard, Relative, and Cumulative
- Histograms
- The Box Plot
- Time Sequence Plots
14Graphical Displays
- The distribution of a variable describes what
values the variable takes and how often each
value occurs. - The frequency of any value of a variable is the
number of times that value occurs in the data. - The relative frequency of any value is the
proportion (fraction or percent) of all
observations that have that value.
15DATA SUMMARY AND PRESENTATION
- Frequency Tables Standard, Relative, and
Cumulative - Histograms, Ogive, Pareto Diagrams,
- Pie Charts
- Exploratory Data Analysis
- Stem-and-Leaf Diagram
- Boxplots
16Graphical Displays
- The distribution of a variable describes what
values the variable takes and how often each
value occurs. - The frequency of any value of a variable is the
number of times that value occurs in the data. - The relative frequency of any value is the
proportion (fraction or percent) of all
observations that have that value.
17Types of Variables
- Categorical variable Places an individual into
one of several categories. - Examples Gender, race, political party, zip code
- Quantitative variable Takes numerical values for
which arithmetic operations make sense. - Examples OYS score, number of vote, cost of
textbooks
18Graphs for categorical variables
- Pie charts require relative frequencies since
they display percentages and not raw data. The
relative frequency of each category corresponds
to the percent of the pie that is occupied by
that category. - Bar graphs display data where the categories are
on the horizontal axis and the frequencies (or
relative frequencies) are on the vertical axis.
19Graphs for quantitative variables
- Histograms
- The data are divided into classes of equal width
and the number (or percentage) of observations in
each class is counted. - Data scale is on the horizontal axis.
- Frequency (or relative frequency) scale is on the
vertical axis. - Bars are draw where base of each bar covers the
class, height of each bar covers the frequency
(or relative frequency).
20- Stem-plots or Stem and Leaf Displays
- Separate each observation in a stem unit (all but
the final rightmost digit of (rounded) data) and
a leaf unit (the final digit of (rounded) data). - Write the stems in a vertical column, smallest to
largest from top to bottom. - Write each leaf in the row to the right of its
stem, in increasing order.
21Histograms vs. Stem plots
- Both are used to describe the distribution of
data. - Stemplots display actual data values.
- Stemplots are used for small data sets (less than
100 values). - Histogram can be constructed for larger data sets.
22Common Distributional Shapes
- A symmetric distribution is one where both sides
about the center line are approximately mirror
images of each other. - A skewed distribution is one where one side of
the center line contains more data than the
other. - Skewed to the right The right side of the
histogram extends much farther than the left
side. - Skewed to the left The left side of the
histogram extends much farther than the right
side.
23Common Distributional Shapes
- A bimodal distribution has two humps where much
of the data lies. - All classes occur with approximately the same
frequency in a uniform distribution. - An outlier in any graph of data is an individual
observation that falls outside the overall
pattern of the graph.
24DATA SUMMARY AND PRESENTATION
- THE STEM-AND-LEAF DIAGRAM
- A stem-and-leaf diagram is a good way to obtain
an informative visual display of a data set - x1, x2, ..., xn,
- where each number xi consists of at least two
digits. - To construct a stem-and-leaf diagram, we divide
each number xi into two parts - a stem, consisting of one or more of the leading
digits, and - a leaf, consisting of the remaining digits.
25- Write the stems in a vertical column, smallest to
largest from top to bottom. - Write each leaf in the row to the right of its
stem, in increasing order.
26THE STEM-AND-LEAF DIAGRAM
- EXAMPLE
- Construct a stem-and-leaf display for the
following data
27THE STEM-AND-LEAF DIAGRAM
- SOLUTION
- We will select as stem values the numbers 7, 8,
9, 10, 11, , 24. - The resulting stem-and-leaf diagram is presented
in the following figure.
28THE STEM-AND-LEAF DIAGRAM
29THE STEM-AND-LEAF DIAGRAMStem is sorted in
decreasing order, leaf ordered in increasing order
30THE STEM-AND-LEAF DIAGRAM
- Inspection of this display immediately reveals
that most of the data lie between 110 and 200 and
that a central value is somewhere between 150 and
160. Furthermore, the data are distributed
approximately symmetrically about the central
value. - The stem-and-leaf diagram enables us to determine
quickly some important feature of the data that
were not immediately obvious in the original
display in original table.
31THE FREQUENCY DISTRIBUTION TABLES
- Frequency Tables
- Frequency refers to the number of times each
category occurs in the original data - A frequency table lists in one column the data
categories or classes and in another column the
corresponding frequencies. - A common way to summarize or present data is
with a standard frequency table.
32Frequency Tables
- Often, the category column will have continuous
data and hence be presented via a range of
values. In such a case, terms used to identify
the class limits, class boundaries, class widths,
and class marks must be well understood. - Class limits are the largest or smallest numbers
which can actually belong to each class. Each
class has a lower class limit and an upper class
limit. - Class boundaries are the numbers which separate
classes. They are equally spaced halfway between
neighboring class limits.
33Frequency Tables
- Class marks are the midpoints of the classes. It
may be necessary to utilize class marks to find
the mean and standard deviation, etc. of data
summarized in a frequency table. - Class width is the difference between two class
boundaries (or corresponding class limits).
34Frequency Tables
- Following are guidelines for constructing
frequency tables. - The classes must be "mutually exclusive"no
element can belong to more than one class. - Even if the frequency is zero, include each and
every class. - Make all classes the same width. (However, open
ended classes may be inevitable.) - Target between 5 and 20 classes, depending on the
range and number of data points. - Keep the limits as simple and as convenient as
possible.
35Frequency Tables
- Relative freqency tables contain the relative
frequency instead of absolute frequency. Relative
frequencies can be expressed either as
percentages or their decimal fraction
equivalents. - Cumulative frequency tables contain frequencies
which are cumulative for subsequent classes. In a
cumulative frequency table, the words less than
usually also appear in the left column.
36Frequency Tables
- The frequency distribution
- A frequency distribution is a more compact
summary of data than a stem-and-leaf diagram. - To construct a frequency distribution, we must
divide the range of the data into intervals,
which are usually called class intervals, cells,
or bins.
37Frequency Distrubion Tables
- EXAMPLE
- Construct the frequency distribution table for
the following data
38THE FREQUENCY DISTRIBUTION TABLES
- SOLUTION
- Class relative
- frequency
- Cumulative
- frequency
39Frequency Distrubion Tables
Another example containing student distributions
as follows