Title: Understanding Variability
1Understanding Variability
Instructor Ron S. Kenett Email
ron_at_kpa.co.il Course Website www.kpa.co.il/biosta
t Course textbook MODERN INDUSTRIAL
STATISTICS, Kenett and Zacks, Duxbury Press, 1998
2Course Syllabus
- Understanding Variability
- Variability in Several Dimensions
- Basic Models of Probability
- Sampling for Estimation of Population Quantities
- Parametric Statistical Inference
- Computer Intensive Techniques
- Multiple Linear Regression
- Statistical Process Control
- Design of Experiments
3 Discrete Data A set of data is said to be
discrete if the values / observations belonging
to it are distinct and separate. That is, they
can be counted (1,2,3,.......). For example, the
number of kittens in a litter the number of
patients in a doctors surgery the number of
flaws in one metre of cloth gender (male,
female) blood group (O, A, B, AB).
4 Continuous Data A set of data is said to be
continuous if the values / observations
belonging to it may take on any value within a
finite or infinite interval. You can count,
order and measure continuous data. For example,
height weight temperature the amount of sugar
in an orange the time required to run a mile.
5Types of Variables
- Qualitative Variables
- Attributes, categories
- Examples male/female, registered to vote/not,
ethnicity, eye color.... - Quantitative Variables
- Discrete - usually take on integer values but can
take on fractions when variable allows - counts,
how many - Continuous - can take on any value at any point
along an interval - measurements, how much
6Self Assessment Test
For each of the following, indicate whether
the appropriate variable would be qualitative or
quantitative. If the variable is quantitative,
indicate whether it would be discrete or
continuous.
7Self Assessment Test
- a) Whether you own an RCA Colortrak television
set - b) Your status as a full-time or a part-time
student - c) Number of people who attended your schools
graduation last year
- Qualitative Variable
- two levels yes/no
- no measurement
- Qualitative Variable
- two levels full/part
- no measurement
- Quantitative, Discrete Variable
- a countable number
- only whole numbers
8Self Assessment Test
- d) The price of your most recent haircut
- e) Sams travel time from his dorm to the Student
Union
- Quantitative, Discrete Variable
- a countable number
- only whole numbers
- Quantitative, Continuous Variable
- any number
- time is measured
- can take on any value greater than zero
9Self Assessment Test
- f) The number of students on campus who belong to
a social fraternity or sorority
- Quantitative, Discrete Variable
- a countable number
- only whole numbers
10Scales of Measurement
- Nominal Scale - Labels represent various levels
of a categorical variable. - Ordinal Scale - Labels represent an order that
indicates either preference or ranking. - Interval Scale - Numerical labels indicate order
and distance between elements. There is no
absolute zero and multiples of measures are not
meaningful. - Ratio Scale - Numerical labels indicate order and
distance between elements. There is an absolute
zero and multiples of measures are meaningful.
11Self Assessment Test
Bill scored 1200 on the Scholastic Aptitude Test
and entered college as a physics major. As a
freshman, he changed to business because he
thought it was more interesting. Because he made
the deans list last semester, his parents gave
him 30 to buy a new Casio calculator. Identify
at least one piece of information in the
12Self Assessment Test
- a) nominal scale of measurement.
- 1. Bill is going to college.
- 2. Bill will buy a Casio
- calculator.
- 3. Bill was a physics major.
- 4. Bill is a business major.
- 5. Bill was on the deans list.
13Self Assessment Test
- b) ordinal scale of measurement
- c) interval scale of measurement
- d) ratio scale of measurement
- Bill is a freshman.
- Bill earned a 1200 on the SAT.
- Bills parents gave him 30.
-
14Self Assessment Test
- b) ordinal scale of measurement
- c) interval scale of measurement
- d) ratio scale of measurement
- Bill is a freshman.
- Bill earned a 1200 on the SAT.
- Bills parents gave him 30.
-
15 Histogram A histogram is a way of summarising
data that are measured on an interval scale
(either discrete or continuous). It is often used
in exploratory data analysis to illustrate the
major features of the distribution of the data
in a convenient form. It divides up the range of
possible values in a data set into classes or
groups. For each group, a rectangle is
constructed with a base length equal to the
range of values in that specific group, and an
area proportional to the number of observations
falling into that group. This means that the
rectangles might be drawn of non-uniform height.
16Key Terms
- Data array
- An orderly presentation of data in either
ascending or descending numerical order. - Frequency Distribution
- A table that represents the data in classes and
that shows the number of observations in each
class.
17Key Terms
- Frequency Distribution
- Class - The category
- Frequency - Number in each class
- Class limits - Boundaries for each class
- Class interval - Width of each class
- Class mark - Midpoint of each class
18Sturges Rule
- How to set the approximate number of classes to
begin constructing a frequency distribution. - where k approximate number of classes to use
and - n the number of observations in the data set .
19Frequency Distributions
1. Number of classes Choose an approximate
number of classes for your data. Sturges rule
can help. 2. Estimate the class interval
Divide the approximate number of classes (from
Step 1) into the range of your data to find the
approximate class interval, where the range is
defined as the largest data value minus the
smallest data value. 3. Determine the class
interval Round the estimate (from Step 2) to a
convenient value.
20Frequency Distributions
4. Lower Class Limit Determine the lower class
limit for the first class by selecting a
convenient number that is smaller than the lowest
data value. 5. Class Limits Determine the other
class limits by repeatedly adding the class width
(from Step 2) to the prior class limit, starting
with the lower class limit (from Step 3). 6.
Define the classes Use the sequence of class
limits to define the classes.
21Relative Frequency Distributions
1. Retain the same classes defined in the
frequency distribution. 2. Sum the total number
of observations across all classes of the
frequency distribution. 3. Divide the frequency
for each class by the total number of
observations, forming the percentage of data
values in each class.
22Cumulative Relative Frequency Distributions
1. List the number of observations in the lowest
class. 2. Add the frequency of the lowest class
to the frequency of the second class. Record
that cumulative sum for the second class. 3.
Continue to add the prior cumulative sum to the
frequency for that class, so that the cumulative
sum for the final class is the total number of
observations in the data set.
23Cumulative Relative Frequency Distributions
- 4. Divide the accumulated frequencies for each
class by the total number of observations --
giving you the percent of all observations that
occurred up to an including that class. - An Alternative Accrue the relative frequencies
for each class instead of the raw frequencies.
Then you dont have to divide by the total to get
percentages.
24Example
- The average daily cost to community hospitals for
patient stays during 1993 for each of the 50 U.S.
states was given in the next table. - a) Arrange these into a data array.
- b) Construct a stem-and-leaf display.
- ) Approximately how many classes would be
appropriate for these data? - c d) Construct a frequency distribution. State
interval width and class mark. - e) Construct a histogram, a relative frequency
distribution, and a cumulative relative frequency
distribution.
25Example Data List
AL 775 HI 823 MA 1,036 NM 1,046 SD
506 AK 1,136 ID 659 MI 902 NY
784 TN 859 AZ 1,091 IL 917 MN 652 NC
763 TX 1,010 AR 678 IN 898 MS
555 ND 507 UT 1,081 CA 1,221 IA
612 MO 863 OH 940 VT 676 CO
961 KS 666 MT 482 OK 797 VA 830 CT
1,058 KY 703 NE 626 OR 1,052 WA
1,143 DE 1,024 LA 875 NV 900 PA
861 WV 701 FL 960 ME 738 NH 976 RI
885 WI 744 GA 775 MD 889 NJ
829 SC 838 WY 537
26Example Data Array
CA 1,221 TX 1,010 RI 885 NY 784 KS
666 WA 1,143 NH 976 LA 875 AL 775 ID
659 AK 1,136 CO 961 MO 863 GA 775 MN
652 AZ 1,091 FL 960 PA 861 NC 763 NE
626 UT 1,081 CH 940 TN 859 WI 744 IA
612 CT 1,058 IL 917 SC 838 ME
738 MS 555 OR 1,052 MI 902 VA 830 KY
703 WY 537 NM 1,046 NV 900 NJ 829 WV
701 ND 507 MA 1,036 IN 898 HI 823 AR
678 SD 506 DE 1,024 MD 889 OK
797 VT 676 MT 482
27Example Stem and Leaf Display
Stem-and-Leaf Display N 50 Leaf Unit 100
1 12 21 2 11 43, 36 8 10 91, 81, 58, 52,
46, 36, 24, 10 7 9 76, 61, 60, 40, 17, 02,
00 (11) 8 98, 89, 85, 75, 63, 61, 59, 38, 30,
29, 23 9 7 97, 84, 75, 75, 63, 44, 38, 03,
01 7 6 78, 76, 66, 59, 52, 26, 12 4
5 55, 37, 07, 06 1 4 82 Range 482 -
1,221
28Example Frequency Distribution
- To approximate the number of classes we should
use in creating the frequency distribution, use
Sturges Rule, n 50 -
- Sturges rule suggests we use approximately 7
classes.
29Example Frequency Distribution
- Step 1. Number of classes
- Sturges Rule approximately 7 classes.
- The range is 1,221 482 739
- 739/7 106 and 739/8 92
- Steps 2 3. The Class Interval
- So, if we use 8 classes, we can make each class
100 wide.
30Example Frequency Distribution
- Step 1. Number of classes
- Sturges Rule approximately 7 classes.
- The range is 1,221 482 739
- 739/7 106 and 739/8 92
- Steps 2 3. The Class Interval
- So, if we use 8 classes, we can make each class
100 wide.
31Example Frequency Distribution
- Step 4. The Lower Class Limit
- If we start at 450, we can cover the range in 8
classes, each class 100 in width. - The first class 450 up to 550
- Steps 5 6. Setting Class Limits
- 450 up to 550 850 up to 950
- 550 up to 650 950 up to 1,050
- 650 up to 750 1,050 up to 1,150
- 750 up to 850 1,150 up to 1,250
32Example Frequency Distribution
Average daily cost Number Mark 450
under 550 4 500 550 under 650
3 600 650 under 750 9 700 750
under 850 9 800 850 under 950
11 900 950 under 1,050 7
1,000 1,050 under 1,150 6
1,100 1,150 under 1,250 1
1,200 Interval width 100
33Example Histogram
34Example Relative Frequency Distribution
Average daily cost Number Rel. Freq.
450 under 550 4 4/50 .08 550
under 650 3 3/50 .06 650 under 750
9 9/50 .18 750 under 850
9 9/50 .18 850 under 950 11
11/50 .22 950 under 1,050
7 7/50 .14 1,050 under 1,150 6 6/50
.12 1,150 under 1,250 1 1/50 .02
35Example Polygon
36Example Cumulative Frequency Distribution
Average daily cost Number Cum. Freq.
450 under 550 4 4 550 under 650
3 7 650 under 750 9 16 750
under 850 9 25 850 under 9
11 36 950 under 1,050 7 43 1,050
under 1,150 6 49 1,150 under 1,250 1 50
37Example Cumulative Relative Frequency
Distribution
Average daily cost Cum.Freq.
Cum.Rel.Freq. 450 under 550 4 4/50
.02 550 under 650 7 7/50 .14 650
under 750 16 16/50 .32 750
under 850 25 25/50 .50 850
under 950 36 36/50 .72 950
under 1,050 43 43/50 .86 1,050
under 1,150 49 49/50 .98 1,150
under 1,250 50 50/50 1.00
38Example Percentage Ogive
39Statistical Description of Data
40Key Terms
- Measures of Central Tendency,
- The Center
- Mean
- µ, population , sample
- Weighted Mean
- Median
- Mode
-
41Key Terms
- Measures of Dispersion,
- The Spread
- Range
- Mean absolute deviation
- Variance
- Standard deviation
- Interquartile range
- Interquartile deviation
- Coefficient of variation
42Key Terms
- Measures of Relative Position
- Quantiles
- Quartiles
- Deciles
- Percentiles
- Residuals
- Standardized values
43The Mean
- Mean
- Arithmetic average (sum all values)/ of values
- Population µ (Sxi)/N
- Sample (Sxi)/n
-
- Problem Calculate the average number of truck
shipments from the United States to five Canadian
cities for the following data given in thousands
of bags - Montreal, 64.0 Ottawa, 15.0 Toronto, 285.0
- Vancouver, 228.0 Winnipeg, 45.0
- (Ans 127.4)
44The Weighted Mean
- When what you have is grouped data, compute the
mean using µ (Swixi)/Swi - Problem Calculate the average profit from truck
shipments, United States to Canada, for the
following data given in thousands of bags and
profits per thousand bags - Montreal 64.0 Ottawa 15.0 Toronto 285.0
- 15.00 13.50
15.50 - Vancouver 228.0 Winnipeg 45.0
- 12.00 14.00
- (Ans 14.04 per thous. bags)
45The Median
- To find the median
- 1. Put the data in an array.
- 2A. If the data set has an ODD number of numbers,
the median is the middle value. - 2B. If the data set has an EVEN number of
numbers, the median is the AVERAGE of the middle
two values. - (Note that the median of an even set of data
values is not necessarily a member of the set of
values.) - The median is particularly useful if there are
outliers in the data set, which otherwise tend to
sway the value of an arithmetic mean.
46The Mode
- The mode is the most frequent value.
- While there is just one value for the mean and
one value for the median, there may be more than
one value for the mode of a data set. - The mode tends to be less frequently used than
the mean or the median.
47Comparing Measures of Central Tendency
- If mean median mode, the shape of the
distribution is symmetric. - If mode lt median lt mean or if mean gt median gt
mode, - the shape of the distribution trails to the
right, - is positively skewed.
- If mean lt median lt mode or if mode gt median gt
mean, - the shape of the distribution trails to the
left, - is negatively skewed.
48The Range
- The range is the distance between the smallest
and the largest data value in the set. - Range largest value smallest value
- Sometimes range is reported as an interval,
anchored between the smallest and largest data
value, rather than the actual width of that
interval.
49Residuals
- Residuals are the differences between each data
value in the set and the group mean - for a population, xi µ
- for a sample, xi
50The MAD
- The mean absolute deviation is found by summing
the absolute values of all residuals and dividing
by the number of values in the set - for a population, MAD (Sxi µ)/N
- for a sample, MAD (Sxi )/n
51The Variance
- Variance is one of the most frequently used
measures of spread, - for population,
- for sample,
- The right side of each equation is often used as
a computational shortcut.
52The Standard Deviation
- Since variance is given in squared units, we
often find uses for the standard deviation, which
is the square root of variance - for a population,
- for a sample,
53Quartiles
- One of the most frequently used quantiles is the
quartile. - Quartiles divide the values of a data set into
four subsets of equal size, each comprising 25
of the observations. - To find the first, second, and third quartiles
- 1. Arrange the N data values into an array.
- 2. First quartile, Q1 data value at position (N
1)/4 - 3. Second quartile, Q2 data value at position
2(N 1)/4 - 4. Third quartile, Q3 data value at position
3(N 1)/4
54Quartiles
55Standardized Values
- How far above or below the individual value is
compared to the population mean in units of
standard deviation - How far above or below (data value mean)
- which is the residual...
- In units of standard deviation divided by s
- Standardized individual value
- A negative z means the data value falls below
the mean.