Title: the statistical analysis of data
1the statistical analysis of data
- by Dr. Dang Quang A Dr. Bui The Hong
- Hanoi Institute of Information Technology
2Preface
- Statistics is the science of collecting,
organizing and interpreting numerical and
nonnumerical facts, which we call data. - The collection and study of data are important in
the work of many professions, so that training in
the science of statistics is valuable preparation
for variety of careers. , for example economists
and financial advisors, businessmen, engineers,
farmers - Knownedge of probability and statistical methods
also are useful for informatic specialists of
various fields such as data mining, knowledge
discovery, neural network, fuzzy system and so
on. - Whatever else it may be, statistics is, firsrt
and foremost, a collection of tools used for
converting raw data into information to help
decision makers in their works. - The science of data - statistics - is the
subject of this course.
3Audience and objective
- Audience
- This tutorial as an introductory course to
statistics is intended mainly for users such as
engineers, economists, managers,...which need to
use statistical methods in their work and for
students. However, it will be in many aspects
useful for computer trainers. - Objectives
- Understanding statistical reasoning
- Mastering basic statistical methods for analyzing
data such as descriptive and inferential methods - Ability to use methods of statistics in practice
with the help of computer softwares in statistics - Entry requirements
- High school algebra course (elements of
calculus) - Skill of working with computer
4Contents
- Preface
- Chapter 1 Introduction.
- Chapter 2 Data presentation...
- Chapter 3 Data characteristics... descriptive
summary statistics - Chapter 4 Probability Basic... concepts
. - Chapter 5 Basic Probability distributions
... - Chapter 6 Sampling Distributions .
- Chapter 7 Estimation.
- Chapter 8 General Concepts of Hypothesis Testing
.. - Chapter 9 Applications of Hypothesis Testing
..
- Chapter 10 Categorical Data .... Analysis and
Analysis of variance - Chapter 11 Simple Linear regression and
correlation - Chapter 12 Multiple regression
- Chapter 13 Nonparametric statistics
- References
- Appendix A
- Appendix B
- Appendix C
- Appendix D
- Index
5Chapter 1 Introduction
- 1.1 What is Statistics
- Whatever else it may be, statistics is, first and
foremost, a collection of tools used for
converting raw data into information to help
decision makers in their works. - 1.2. Populations and samples
- A population is a whole, and a sample is a
fraction of the whole. - A population is a collection of all the elements
we are studying and about which we are trying to
draw conclusions. Such a population is often
referred to as the target population. - A sample is a collection of some, but not all, of
the elements of the population - 1.3. Descriptive and inferential statistics
- Descriptive statistics is devoted to the
summarization and description of data (population
or sample) . - Inferential statistics uses sample data to make
an inference about a population . - 1.4. Brief history of statistics
- 1.5. Computer softwares for statistical analysis
6Chapter 2 Data presentation
- 2.1 Introduction
- The objective of data description is to summarize
the characteristics of a data set. Ultimately, we
want to make the data set more comprehensible and
meaningful. In this chapter we will show how to
construct charts and graphs that convey the
nature of a data set. The procedure that we will
use to accomplish this objective depends on the
type of data. - 2.2 Types of data
- Quantitative data are observations measured on a
numerical scale. - Nonnumerical data that can only be classified
into categories are said to be qualitative
data.. - 2.3 Qualitative data presentation
- Category frequency the number of observations
that fall in that category. - Relative frequency the proportion of the total
number of observations that fall in that category - Percentage for a category Relative frequency
for the category x 100 - 2.4 Graphical description of qualitative data
- Bar graphs and pie charts
7Chapter 2 (continued 1)
- 2.5 Graphical description of quantitative data
Stem and Leaf displays - Stem and leaf display is widely used in
exploratory data analysis when the data set is
small - Steps to follow in constructing a Stem and Leaf
Display - Advantages and disdvantage of a stem and leaf
display - 2.6 Tabulating quantitative data Relative
frequency distributions - Frequency distribution is a table that organizes
data into classes - Class frequency the number of observations that
fall into the class. - Class relative frequency Class frequency/ Total
number of observations - Relative class percentage Class relative
frequency x 100 - 2.7 Graphical description of quantitative data
histogram and polygon - frequency histogram, relative frequency
histogram and percentage histogram. - frequency polygon, relative frequency polygon
and percentage polygon - 2.8 Cumulative distributions and cumulative
polygons - 2.9 Exercises
8Chapter 3 Data characteristics descriptive
summary statistics
- 3.1 Introduction3.2 Types of
numerical descriptive measures3.3 Measures of
central tendency3.4 Measures of data
variation3.5 Measures of relative standing3.6
Shape 3.7 Methods for detecting outlier3.8
Calculating some statistics from grouped data3.9
Computing descriptive summary statistics using
computer softwares3.10 Exercises
9Chapter 3 (continued 1)
- 3.2 Types of numerical descriptive measures
- Location, Dispersion, Relative standing and
Shape - 3.3 Measures of location ( or central
tendency) - 3.3.1 Mean
- 3.3.2 Median
- 3.3.3 Mode
- 3.3.4 Geometric mean
- 3.4 Measures of data variation
- 3.4.1 Range
- 3.1.2 Variance and standard deviation
- Uses of the standard deviation Chebyshevs
Theorem, - The Empirical Rule
- 3.4.3 Relative dispersion The coefficient of
variation
10Chapter 3 (continued 2)
- 3.5 Measures of relative standingDescriptive
measures that locate the relative position of an
observation in relation to the other observations
are called measures of relative standing - The pth percentile is a number such that p
of the observations of the data set fall below
and (100-p) of the observations fall above it. - Lower quartile 25th percentile Mid-
quartile, 50th percentile. - Upper quartile 75th percentile, Interquartile
range, z-score - 3.6 Shape
- 3.6.1 Skewness
- 3.6.2 Kurtosis3.7 Methods for detecting
outlier3.8 Calculating some statistics from
grouped data
11Chapter 4. Probability Basic concepts
- 4.1 Experiment, Events and Probability of an
Event - 4.2 Approaches to probability
- 4.3 The field of events
- 4.4 Definitions of probability
- 4.5 Conditional probability and independence
- 4.6 Rules for calculating probability
- 4.7 Exercises
12Chapter 4 (continued 1)
- 4.1 Experiment, Events and Probability of an
Event - The process of making an observation or recording
a measurement under a given set of conditions is
a trial or experiment. - Outcomes of an experiment are called events.
- We denote events by capital letters A, B, C,
- The probability of an event A, denoted by P(A),
in general, is the chance A will happen. - 4.2 Approaches to probability
- .Definitions of probability as a quantitative
measure of the degree of certainty of the
observer of experiment. - .Definitions that reduce the concept of
probability to the more primitive notion of
equal likelihood (the so-called classical
definition ). - .Definitions that take as their point of
departure the relative frequency of occurrence
of the event in a large number of trials
(statistical definition).
13Chapter 4 (continued 2)
- 4.3 The field of events
- Definitions and relations between the events A
implies B, A and B are equivalent (AB), product
or intersection of the events A and B (AB), sum
or union of A and B (AB), difference of A and
(A-B or A\B), certain (or sure) event, impossible
event, complement of A, mutually exclusive
events, simple (or elementary), sample space. - Ven diagrams
- Field of events
- 4.4 Definitions of probability
- 4.4.1 The classical definition of probability
- 4.4.2 The statistical definition of probability
- 4.4.3 Axiomatic construction of the theory of
probability (optional) - 4.5 Conditional probability and independence
- Definition, formula, multiplicative theorem,
independent and dependent events
14Chapter 4 (continued 3)
- 4.5 Conditional probability and independence
- 4.6 Rules for calculating probability
- 4.6.1 The addition rule
- for pairwise mutually exclusive events
- P(A1 A2 ...An) P(A1)P(A2) ...P(An)
- for two nonmutually exclusive events A and B
- P(AB) P(A) P(B) P(AB).
- 4.6.2 Multiplicative rule
- P(AB) P(A) P(BA) P(B) P(AB).
- 4.6.3 Formula of total probability
- P(B) P(A1)P(BA1)P(A2)P(BA2) ...P(An)P(BAn).
15Chapter 5 Basic Probability distributions
- 5.1 Random variables
- 5.2 The probability distribution for a discrete
random variable - 5.3 Numerical characteristics of a discrete
random variable - 5.4 The binomial probability distribution
- 5.5 The Poisson distribution
- 5.6 Continuous random variables distribution
function and density function - 5.7 Numerical characteristics of a continuous
random variable - 5.8 The normal distribution
- 5.9 Exercises
16Chapter 5 (continued 1)
- 5.1 Random variables
- A random variable is a variable that assumes
numerical values associated with events of an
experiment. - Classification of random variables A discrete
random variable and continuous random variable - 5.2 The probability distribution for a discrete
random variable - The probability distribution for a discrete
random variable x is a table, graph, or formula
that gives the probability of observing each
value of x. - Properties of the probability distribution
17Chapter 5 (continued 2)
- 5.3 Numerical characteristics of a discrete
random variable - 5.3.1 Mean or expected value ?E(X)? xp(x)
- 5.3.2 Variance and standard deviation ?2E(X-
?)2 - 5.4 The binomial probability distribution
- Model (or characteristics) of a binomial random
variable - The probability distribution
- mean and variance for a binomial random variable
- 5.5 The Poisson distribution
- Model (or characteristics) of a Poisson random
variable - The probability distribution
- mean and variance for a Poisson random variable
18Chapter 5 (continued 3)
- 5.6 Continuous random variables distribution
function and density function - Cumulative distribution function F(x)P(Xltx)
- Density probability function f(x) F(x)
- 5.7 Numerical characteristics of a continuous
random variable - Mean or expected value ?E(X)? xp(x)dx
- Variance and standard deviation
- 5.8 The normal distribution
- The density function, mean and variance for a
normal random variable - ? , 2? and 3? rules
- The normal distribution as an approximation to
binomial probability distribution
19Chapter 6 Sampling Distributions
- 6.1 Why the method of sampling is important
- 6.2 Obtaining a Random Sample
- 6.3 Sampling Distribution
- 6.4 The sampling distribution of sample mean
the Central Limit Theorem - 6.5 Summary
- 6.6 Exercises
20Chapter 6 (continued 1)
- 6.1 Why the method of sampling is important
- two samples from the same population can provide
contradictory information about the population - Random sampling eliminates the possibility of
bias in selecting a sample and, in addition,
provides a probabilistic basic for evaluating the
reliability of an inference - 6.2 Obtaining a Random Sample
- A random sample of n experimental units is one
selected in such a way that every different
sample of size n has an equal probability of
selection - procedures for generating a random sample
-
21Chapter 6 (continued 2)
- 6.3 Sampling Distribution
- A numerical descriptive measure of a population
is called a parameter. A quantity computed from
the observations in a random sample is called a
statistic. - A sampling distribution of a sample statistic
(based on n observations) is the relative
frequency distribution of the values of the
statistic theoretically generated by taking
repeated random samples of size n and computing
the value of the statistic for each sample. - Examples of computer-generated random samples
- 6.4 The sampling distribution of sample mean
the Central Limit Theorem - If the size is sufficiently large, the mean of a
random sample from a population has a sampling
distribution that is approximately normal,
regardless of the shape of the relative frequency
distribution of the target population - Mean and standard deviation of the sampling
distribution - 6.5 Summary
-
22Chapter 7. Estimation
- 7.1 Introduction
- 7.2 Estimation of a population mean Large-sample
case - 7.3 Estimation of a population mean small sample
case - 7.4 Estimation of a population proportion
- 7.5 Estimation of the difference between two
population means Independent samples - 7.6 Estimation of the difference between two
population means Matched pairs - 7.7 Estimation of the difference between two
population proportions - 7.8 Choosing the sample size
- 7.9 Estimation of a population variance
- 7.10 Summary
23Chapter 7 (continued 1)
- 7.2 Estimation of a population mean Large-sample
case - Point estimate for a population mean ?
- Large-sample (1-?) 100 Confidence interval for a
population mean ( use the fact that For
sufficient large sample size ngt30, the sampling
distribution of the sample mean, , is
approximately normal). - 7.3 Estimation of a population mean small sample
case (nlt30) - Problems arising for small sample sizes and
Assumption the population has an approximate
normal distribution. - (1-?) 100 Confidence interval using
t-distribution. - 7.4 Estimation of a population proportion
- For sufficiently large samples, the sampling
distribution of the proportion p-hat is
approximately normal. - Large-sample (1-?) 100 Confidence interval for a
population proportion
24Chapter 7 (continued 2)
- 7.5 Estimation of the difference between two
population means Independent samples - For sufficiently large sample size (n1 and n2 gt
30), the sampling distribution of ?1 - ?2
based on independent random samples from two
populations, is approximately normal - Small sample sizes under some assumptions on
populations - 7.6 Estimation of the difference between two
population means Matched pairs - Assumption the population of paired differences
is normally distributed ? Procedure - 7.7 Estimation of the difference between two
population proportions - For sufficiently large sample size (n1 and n2 gt
30), the sampling distribution of p1 - p2
based on independent random samples from two
populations, is approximately normal - (1-?) 100 Confidence interval for p1 - p2
25Chapter 8. General Concepts of Hypothesis Testing
- 8.1 Introduction
- The procedures to be discussed are useful in
situations, where we are interested in making a
decision about a parameter value rather then
obtaining an estimate of its value - 8.2 Formulation of Hypotheses
- A null hypothesis H0 is the hypothesis against
which we hope to gather evidence. The hypothesis
for which we wish to gather supporting evidence
is called the alternative hypothesises Ha - One-tailed (directional) test and two-tailed test
- 8.3 Conclusions and Consequences for a Hypothesis
Test - The goal of any hypothesis-testing is to make a
decision based on sample information whether to
reject H0 in favor of Ha ? we make one of two
types of error. - A Type I error occurs if we reject H0 when it is
true. The probability of committing a Type I
error is denoted by ? (also called significance
level) - A Type II error occurs if we do not reject H0
when it is false. The probability of committing a
Type II error is denoted by ?.
26Chapter 8 (continued 1)
- 8.4 Test statistics and rejection regions
- The test statistic is a sample ststistic, upon
which the decision concerning the null and
alternative hypotheses is based. - The rejection region is the set of possible
values of the test statistic for which the null
hypotheses will be rejected. - Steps for testing hypothesis
- Critical value boundary value of the rejection
region - 8.5 Summary
- 8.6 Exercises
27Chapter 9. Applications of Hypothesis Testing
- 9.1 Diagnosing a hypothesis test
- 9.2 Hypothesis test about a population mean
- 9.3 Hypothesis test about a population
proportion - 9.4 Hypothesis tests about the difference between
two population means - 9.5 Hypothesis tests about the difference between
two proportions - 9.6 Hypothesis test about a population variance
- 9.7 Hypothesis test about the ratio of two
population variances - 9.8 Summary
- 9.9 Exercises
28Chapter 9 (continued 1)
- 9.2 Hypothesis test about a population mean
- Large- sample test (ngt30)
- the sampling distribution of ? is approximately
normal and s is a good approximation of ?. - Procedure for large- sample test
- Small- sample test
- Assumption the population ha aaprox. Normal
distribution. - Procedure for small- sample test (using
t-distribution)\ - 9.3 Hypothesis test about a population proportion
- Large- sample test
- 9.4 Hypothesis tests about the difference between
two population means - Large- sample test
- Assumptions n1gt30, n2gt30 samples are selected
randomly and independently from the populations - Small- sample test
29Chapter 9 (continued 2)
- 9.5 Hypothesis tests about the difference between
two proportions - Assumptions, Procedure
- 9.6 Hypothesis test about a population variance
- Assumption the population has an approx. nornal
distr. - Procudure using chi-square distribution
- 9.7 Hypothesis test about the ratio of two
population variances (optional) - Assumptions Populations has approx. nornal
distr., random samples are independent. - Procudure using F- distribution
-
30Chapter 10. Categorical Data Analysis and
Analysis of Variance
- 10.1 Introduction
- 10.2 Tests of goodness of fit
- 10.3 The analysis of contingency tables
- 10.4 Contingency tables in statistical software
packages - 10.5 Introduction to analysis of variance
- 10.6 Design of experiments
- 10.7 Completely randomized designs
- 10.8 Randomized block designs
- 10.9 Multiple comparisons of means and confidence
regions - 10.10 Summary
- 10.11 Exercises
31Chapter 10 (continued 1)
- 10.1 Introduction
- 10.2 Tests of goodness -of- fit
- Purpose to test for a dependence on a
qualitative variable that allow for more than two
categorires for a response.Namely, it test there
is a significant difference between observed
frequency distribution and a theoretical
frequency distribution . - Procedure for a Chi-square goodness -of- fit test
- 10.3 The analysis of contingency tables
- Purpose to determine whether a dependence exists
between to qualitative variables - Procedure for a Chi-square Test for independence
of two directions of Classification - 10.4 Contingency tables in statistical software
packages
32Chapter 10 (continued 2)
- 10.5 Introduction to analysis of variance
- Purpose Comparison of more than two means
- 10.6 Design of experiments
- Concepts of experiment, design of the experiment,
response variable, factor, treatment - Concepts of Between-sample variation,
Within-sample variation - 10.7 Completely randomized designs
- This design involves a comparison of the means of
k treatments, based on independent random samples
of n1, n2,, nk observations drawn from
populations. - Assumptions All k populations are normal, have
equal variances - F-test for comparing k population means
- 10.8 Randomized block designs
- Concept of randomized block design
- Tests to compare k Treatment and b Block Means
- 10.9 Multiple comparisons of means and confidence
regions
33Chapter 11. Simple Linear regression and
correlation
- 11.1 Introduction Bivariate relationships
- 11.2 Simple Linear regression Assumptions
- 11.3 Estimating A and B the method of least
squares - 11.4 Estimating ?2
- 11.5 Making inferences about the slope, B
- 11.6. Correlation analysis
- 11.7 Using the model for estimation and
prediction - 11.8. Simple Linear Regression An Overview
Example - 11.9 Exercises
34Chapter 11 (continued 1)
- 11.1 Introduction Bivariate relationships
- Subject is to determine the relationship between
two variables. - Types of relationships direct and inverse
- Scattergram
- 11.2 Simple Linear regression Assumptions
- a simple linear regression model y A B x e
- assumptions required for a linear regression
model E(e) 0, e is normal, ?2 is equal a
constant for all value of x. - 11.3 Estimating A and B the method of least
squares - the least squares estimators a and b , formula
for a and b - 11.4 Estimating ?2
- Formula for s2, an estimator for ?2
- interpretation of s, the estimated standard
deviation of e
35Chapter 11 (continued 2)
- 11.5 Making inferences about the slope, B
- Problem about making an inference of the
population regression line E(y)ABx based on the
sample regression line yabx - Sampling distribution of the least square
estimator of slope b - Test of the utility of the model H0 B 0
against Ha B?0 or Bgt0, Blt0 - A (1-?) 100 Confidence interval for B
- 11.6. Correlation analysis
- Is the statistical tool for describing the degree
to which one variable is linearly related to
another. - The coefficient of correlation r is a measure of
the strength of the linear relationship between
two variables - The coefficient of determination
- 11.7 Using the model for estimation and
prediction - A (1-?) 100 Confidence interval for the mean
value of y for x xp - A (1-?) 100 Confidence interval for an
individual y for x xp - 11.8. Simple Linear Regression An Example
36Chapter 12. Multiple regression
- 12.1. Introduction the general linear model
- 12.2 Model assumptions
- 12.3 Fitting the model the method of least
squares - 12.4 Estimating ?2
- 12.5 Estimating and testing hypotheses about the
B parameters - 12.6. Checking the utility of a model
- 12.7. Using the model for estimating and
prediction - 12.8 Multiple linear regression An overview
example - 12.8. Model building interaction models
- 12.9. Model building quadratic models
- 12.10 Exercises
37Chapter 12 (continued 1)
- 12.1. Introduction the general linear model
- y B0 B1x1 ... Bkxk e, where y -
dependent., x1, x2, ..., xk - independent
variables, e - random error. - 12.2 Model assumptions
- For any given set of values x1, x2, ..., xk , the
random error e has a normal probability
distribution with the mean equal 0 and variance
equal ?2. - The random errors are independent.
- 12.3 Fitting the model the method of least
squares - Least square prediction equation y b0 b1 x1
. bk xk - 12.4 Estimating ?2
- 12.5 Estimating and testing hypotheses about the
B parameters - Sampling distributions of b0, b1, ..., bk
- A (1-?) 100 Confidence interval for Bi (i 0,
1,.., k) - Test of an individual parameter coefficient Bi
38Chapter 12 (continued 1)
- 12.6. Checking the utility of a model
- Finding a measure of how well a linear model
fits a set of data the multiple coefficient of
determination - testing the overall utility of the model
- 12.7. Using the model for estimating and
prediction - A (1-?) 100 confidence interval for the mean
value of y for a given x - A (1-?) 100 confidence interval for an
individual y for for a given x - 12.8 Multiple linear regression An overview
example - 12.8. Model building interaction models
- Interaction model with two independent variables
E(y) B0 B1x1 B2x2 B3x1x2 - procedure to build an interaction model
- 12.9. Model building quadratic models
- Quadratic model in a single variable E(y) B0
B1x B2x2 - procedure to build a quadratic model
39Chapter 13. Nonparametric statistics
- 13.1. Introduction
- Situations where t and F test are unsuitable
- What do nonparametric methods use?
- 13.2. The sign test for a single population
- Purpose to test hypotheses about median of any
populations - Procedure for the sign test for a population
median - Sign test based on a large sample (ngt10)
- 13.3 Comparing two populations based on
independent random samplesWilcoxon rank sum
test - Nonparametric test about the difference between
two populations is the test to detect whether
distribution 1 is shifted to the right of
distribution 2 or vice versa. - wilcoxon rank sum test for a shift in population
locations - The case of large samples (n1?10, n2?10)
40Chapter 13 (continued 1)
- 13.4. Comparing two populations based on matched
pairs - Wilcoxon signed ranks test for a shift in
population locations - Wilcoxon signed ranks test for large samples
(n?25) - 13.5. Comparing populations using a completely
randomized design The Kruskal-Wallis H test - The Kruskal-Wallis H test is the nonparam.
Equivalent of ANOVA F test when the assumptions
that populations are normally distributed with
common variance are not satisfied. - The Kruskal-Wallis H test for comparing k
population probability distributions - 13.6 Rank Correlation Spearmans rs statistic
- Is stastistic developed to measure and to test
for correlation between two random variables. - Formula for computing Spearmans rank correlation
coefficient rs - Spearmans nonparametric test for rank
correlation - 13.7 Exercises