Title: Review of Top 10 Concepts in Statistics
1Review of Top 10 Conceptsin Statistics
- NOTE This Power Point file is not an
introduction, but rather a checklist of topics to
review
2Top Ten 1
3Measures of Central Location
4Mean
- Population mean µ Sx/N (516)/3 12/3 4
- Algebra Sx Nµ 34 12
- Sample mean x-bar Sx/n
- Example the number of hours spent on the
Internet 4, 8, and 9 - x-bar (489)/3 7 hours
- Do NOT use if the number of observations is small
or with extreme values - Ex Do NOT use if 3 houses were sold this week,
and one was a mansion
5Median
- Median middle value
- Example 5,1,6
- Step 1 Sort data 1,5,6
- Step 2 Middle value 5
- When there is an even number of observation,
median is computed by averaging the two
observations in the middle. - OK even if there are extreme values
- Home sales 100K,200K,900K, so
- mean 400K, but median 200K
6Mode
- Mode most frequent value
- Ex female, male, female
- Mode female
- Ex 1,1,2,3,5,8
- Mode 1
- It may not be a very good measure, see the
following example
7Measures of Central Location - Example
- Sample 0, 0, 5, 7, 8, 9, 12, 14, 22, 23
- Sample Mean x-bar Sx/n 100/10 10
- Median (89)/2 8.5
- Mode 0
8Relationship
- Case 1 if probability distribution symmetric
(ex. bell-shaped, normal distribution), - Mean Median Mode
- Case 2 if distribution positively skewed to
right (ex. incomes of employers in large firm a
large number of relatively low-paid workers and a
small number of high-paid executives), - Mode lt Median lt Mean
9Relationship contd
- Case 3 if distribution negatively skewed to left
(ex. The time taken by students to write exams
few students hand their exams early and majority
of students turn in their exam at the end of
exam), - Mean lt Median lt Mode
10Dispersion Measures of Variability
- How much spread of data
- How much uncertainty
- Measures
- Range
- Variance
- Standard deviation
11Range
- Range Max-Min gt 0
- But range affected by unusual values
- Ex Santa Monica has a high of 105 degrees and a
low of 30 once a century, but range would be
105-30 75
12Standard Deviation (SD)
- Better than range because all data used
- Population SD Square root of variance sigma s
- SD gt 0
13Empirical Rule
- Applies to mound or bell-shaped curves
- Ex normal distribution
- 68 of data within one SD of mean
- 95 of data within two SD of mean
- 99.7 of data within three SD of mean
14Standard Deviation Square Root of Variance
15Sample Standard Deviation
16Standard Deviation
- Total variation 34
- Sample variance 34/4 8.5
- Sample standard deviation
- square root of 8.5 2.9
17Measures of Variability - Example
- The hourly wages earned by a sample of five
students are - 7, 5, 11, 8, and 6
- Range 11 5 6
- Variance
-
- Standard deviation
18Coefficient of Variation (CV)
- CV Standard Deviation/Mean
- Relative measure of spread
- Example
- Country 1 CV 100/1000 0.10
- Country 2 CV 200/4000 0.05
- Although Country 2 has higher standard
deviation, Country 1 has higher relative spread
19Graphical Tools
- Line chart trend over time
- Scatter diagram relationship between two
variables - Bar chart frequency for each category
- Histogram frequency for each class of measured
data (graph of frequency distr.) - Box plot graphical display based on quartiles,
which divide data into 4 parts
20Top Ten 2
21H0 Null Hypothesis
- Population mean?
- Population proportionp
- A statement about the value of a population
parameter - Never include sample statistic (such as, x-bar)
in hypothesis
22HA or H1 Alternative Hypothesis
- ONE TAIL ALTERNATIVE
- Right tail ?gtnumber(smog ck)
- pgtfraction(defectives)
- Left tail ?ltnumber(weight in box of crackers)
- pltfraction(unpopular Presidents
approval low)
23One-Tailed Tests
- A test is one-tailed when the alternate
hypothesis, H1 or HA, states a direction, such as
- H1 The mean yearly salaries earned by full-time
employees is more than 45,000. (?gt45,000) - H1 The average speed of cars traveling on
freeway is less than 75 miles per hour. (?lt75) - H1 Less than 20 percent of the customers pay
cash for their gasoline purchase. (p lt0.2)
24Two-Tail Alternative
- Population mean not equal to number (too hot or
too cold) - Population proportion not equal to fraction (
alcohol too weak or too strong) -
25Two-Tailed Tests
- A test is two-tailed when no direction is
specified in the alternate hypothesis
- H1 The mean amount of time spent for the
Internet is not equal to 5 hours. (? ? 5). - H1 The mean price for a gallon of gasoline is
not equal to 2.54. (? ? 2.54).
26Reject Null Hypothesis (H0) If
- Absolute value of test statistic gt critical
value - Reject H0 if Z Value gt critical Z
- Reject H0 if t Value gt critical t
- Reject H0 if p-value lt significance level (alpha)
- Note that direction of inequality is reversed!
- Reject H0 if very large difference between sample
statistic and population parameter in H0
Test statistic A value, determined from sample
information, used to determine whether or not to
reject the null hypothesis. Critical value The
dividing point between the region where the null
hypothesis is rejected and the region where it is
not rejected.
27Example Smog Check
- H0 ? 80
- HA ? gt 80
- If test statistic 2.2 and critical value 1.96,
reject H0, and conclude that the population mean
is likely gt 80 - If test statistic 1.6 and critical value
1.96, do not reject H0, and reserve judgment
about H0
28Type I vs. Type II Error
- Alphaa P(type I error) Significance level
probability that you reject true null hypothesis - Beta ß P(type II error) probability you do
not reject a null hypothesis, given H0 false - Ex H0 Defendant innocent
- a P(jury convicts innocent person)
- ß P(jury acquits guilty person)
29Type I vs. Type II Error
30Example Smog Check
- H0 ? 80
- HA ? gt 80
- If p-value 0.01 and alpha 0.05, reject H0,
and conclude that the population mean is likely gt
80 - If p-value 0.07 and alpha 0.05, do not reject
H0, and reserve judgment about H0
31Test Statistic
- When testing for the population mean from a large
sample and the population standard deviation is
known, the test statistic is given by
32Example
- The processors of Best Mayo indicate on the
label that the bottle contains 16 ounces of mayo.
The standard deviation of the process is 0.5
ounces. A sample of 36 bottles from last hours
production showed a mean weight of 16.12 ounces
per bottle. At the .05 significance level, can
we conclude that the mean amount per bottle is
greater than 16 ounces?
33Example contd
- 1. State the null and the alternative hypotheses
- H0 ? 16, H1 ? gt 16
2. Select the level of significance. In this
case, we selected the .05 significance level.
- 3. Identify the test statistic. Because we know
the population standard deviation, the test
statistic is z. -
- 4. State the decision rule.
-
- Reject H0 if zgt 1.645 ( z0.05)
34Example contd
- 5. Compute the value of the test statistic
-
-
- 6. Conclusion Do not reject the null hypothesis.
We cannot conclude the mean is greater than 16
ounces.
35Example Using Empirical Rule in Hypothesis
Testing
- H0 ? 80
- HA µ ? 80
- The following information is given
- Computed test statistic (z) -1.14
- Significance level (alpha) 0.05
- Should you reject H0?
36Example contd
- Two tail HA with alpha 0.05 has same critical Z
value as 95 confidence interval - Empirical rule, if bell shaped
- 95 of data within two standard deviation of
mean - That is, 95 of normal curve between Z -2.0 and
Z 2.0 - Computed test statistic (Z -1.14) is between
2.0 and 2.0, so do NOT reject H0
37Top Ten 3
- Confidence Intervals Mean and Proportion
38Confidence Interval
- A confidence interval is a range of values within
which the population parameter is expected to
occur.
39Factors for Confidence Interval
- The factors that determine the width of a
confidence interval are -
- The sample size, n
- The variability in the population, usually
estimated by standard deviation. - The desired level of confidence.
40Confidence Interval Mean
- Use normal distribution (Z table if)
- population standard deviation (sigma) known and
either (1) or (2) - Normal population
- Sample size gt 30
41Confidence Interval Mean
42Normal Table
- Tail .5(1 confidence level)
- NOTE! Different statistics texts have different
normal tables - This review uses the tail of the bell curve
- Ex 95 confidence tail .5(1-.95) .025
- Z.025 1.96
43Example
- n49, Sx490, s2, 95 confidence
- 9.44 lt ? lt 10.56
44Another Example
- One of SOM professors wants to estimate the mean
number of hours worked per week by students. A
sample of 49 students showed a mean of 24 hours.
It is assumed that the population standard
deviation is 4 hours. What is the population
mean?
45Another Example contd
- 95 percent confidence interval for the
population mean. -
The confidence limits range from 22.88 to 25.12.
We estimate with 95 percent confidence that the
average number of hours worked per week by
students lies between these two values.
46Confidence Interval Mean t distribution
- Use if normal population but population standard
deviation (s) not known - If you are given the sample standard deviation
(s), use t table, assuming normal population - If one population, n-1 degrees of freedom
47Confidence Interval Mean t distribution
48Confidence Interval Proportion
- Use if success or failure
- (ex defective or not-defective,
- satisfactory or unsatisfactory)
- Normal approximation to binomial ok if
- (n)(p) gt 5 and (n)(1-p) gt 5, where
- n sample size
- p population proportion
- NOTE NEVER use the t table if proportion!!
49Confidence Interval Proportion
- Ex 8 defectives out of 100, so p .08 and
- n 100, 95 confidence
50Confidence Interval Proportion
- A sample of 500 people who own their house
revealed that 175 planned to sell their homes
within five years. Develop a 98 confidence
interval for the proportion of people who plan to
sell their house within five years.
51Interpretation
- If 95 confidence, then 95 of all confidence
intervals will include the true population
parameter - NOTE! Never use the term probability when
estimating a parameter!! (ex Do NOT say
Probability that population mean is between 23
and 32 is .95 because parameter is not a random
variable. In fact, the population mean is a fixed
but unknown quantity.)
52Point vs. Interval Estimate
- Point estimate statistic (single number)
- Ex sample mean, sample proportion
- Each sample gives different point estimate
- Interval estimate range of values
- Ex Population mean sample mean error
- Parameter statistic error
53Width of Interval
- Ex sample mean 23, error 3
- Point estimate 23
- Interval estimate 23 3, or (20,26)
- Width of interval 26-20 6
- Wide interval Point estimate unreliable
54Wide Confidence Interval If
- (1) small sample size(n)
- (2) large standard deviation
- (3) high confidence interval (ex 99 confidence
interval wider than 95 confidence interval) - If you want narrow interval, you need a large
sample size or small standard deviation or low
confidence level. -
55Top Ten 4
56Linear Regression
- Regression equation
- dependent variablepredicted value
- x independent variable
- b0y-intercept predicted value of y if x0
- b1sloperegression coefficient
- change in y per unit change in x
57Slope vs. Correlation
- Positive slope (b1gt0) positive correlation
between x and y (y increase if x increase) - Negative slope (b1lt0) negative correlation (y
decrease if x increase) - Zero slope (b10) no correlation(predicted value
for y is mean of y), no linear relationship
between x and y
58Simple Linear Regression
- Simple one independent variable, one dependent
variable - Linear graph of regression equation is straight
line
59Example
- y salary (female manager, in thousands of
dollars) - x number of children
- n number of observations
60Given Data
61Totals
62Slope (b1) -6.5
- Method of Least Squares formulas not on BUS 302
exam - b1 -6.5 given
Interpretation If one female manager has 1 more
child than another, salary is 6,500 lower that
is, salary of female managers is expected to
decrease by -6.5 (in thousand of dollars) per
child
63Intercept (b0)
- b0 44.33 (-6.5)(2.33) 59.5
- If number of children is zero, expected salary is
59,500
64Regression Equation
65Forecast Salary If 3 Children
- 59.5 6.5(3) 40
- 40,000 expected salary
66Standard Error of Estimate
67Standard Error of Estimate
68Standard Error of Estimate
Actual salary typically 1,900 away from expected
salary
69Coefficient of Determination
- R2 of total variation in y that can be
explained by variation in x - Measure of how close the linear regression line
fits the points in a scatter diagram - R2 1 max. possible value perfect linear
relationship between y and x (straight line) - R2 0 min. value no linear relationship
70Sources of Variation (V)
- Total V Explained V Unexplained V
- SS Sum of Squares V
- Total SS Regression SS Error SS
- SST SSR SSE
- SSR Explained V, SSE Unexplained
71Coefficient of Determination
- R2 SSR
SST - R2 197 .98
200.5 - Interpretation 98 of total variation in salary
can be explained by variation in number of
children
720 lt R2 lt 1
- 0 No linear relationship since SSR0
(explained variation 0) - 1 Perfect relationship since SSR SST
(unexplained variation SSE 0), but does not
prove cause and effect
73RCorrelation Coefficient
- Case 1 slope (b1) lt 0
- R lt 0
- R is negative square root of coefficient of
determination
74Our Example
- Slope b1 -6.5
- R2 .98
- R -.99
75Case 2 Slope gt 0
- R is positive square root of coefficient of
determination - Ex R2 .49
- R .70
- R has no interpretation
- R overstates relationship
76Caution
- Nonlinear relationship (parabola, hyperbola, etc)
can NOT be measured by R2 - In fact, you could get R20 with a nonlinear
graph on a scatter diagram
77Summary Correlation Coefficient
- Case 1 If b1 gt 0, R is the positive square root
of the coefficient of determination - Ex1 y 43x, R2.36 R .60
- Case 2 If b1 lt 0, R is the negative square root
of the coefficient of determination - Ex2 y 80-10x, R2.49 R -.70
- NOTE! Ex2 has stronger relationship, as measured
by coefficient of determination
78Extreme Values
- R1 perfect positive correlation
- R -1 perfect negative correlation
- R0 zero correlation
79MS Excel Output
Correlation Coefficient (-0.9912) Note that you
need to change the sign because the sign of slope
(b1) is negative (-6.5)
Coefficient of Determination
Standard Error of Estimate
Regression Coefficient
80Top Ten 5
81Expected Value
- Expected Value E(x) SxP(x)
- x1P(x1) x2P(x2)
- Expected value is a weighted average, also a
long-run average
82Example
- Find the expected age at high school graduation
if 11 were 17 years old, 80 were 18 years old,
and 5 were 19 years old - Step 1 1180596
83Step 2
84Another Example of E(x)
- A news rack has 2 papers left. In past, 20 of
days you sold both papers, while 50 of days you
sold one paper. Find expected number of papers
sold. - Answer
- First, find P(0) 1 - 0.20 - 0.50 0.30
- E(X) 0(0.30) 1(0.50) 2(0.20) 0.9
85Top Ten 6
- What Distribution to Use?
86Use Binomial Distribution If
- Random variable (x) is number of successes in n
trials - Each trial is success or failure
- Independent trials
- Constant probability of success (p) on each trial
- Sampling with replacement (in practice, people
may use binomial w/o replacement, but theory is
with replacement)
87Success vs. Failure
- The binomial experiment can result in only one of
two possible outcomes - Male vs. Female
- Defective vs. Non-defective
- Yes or No
- Pass (8 or more right answers) vs. Fail (fewer
than 8) - Buy drink (21 or over) vs. Cannot buy drink
88Binomial Is Discrete
- Integer values
- 0,1,2,n
- Binomial is often skewed, but may be symmetric
89Example of Binomial
- If 60 of all voters in a precinct are Democrats,
find the probability that a sample of 3 voters
has (a) all Democrats (b) no Democrats - Answer for (a)
- P(3) (0.6)(0.6)(0.6)
- Answer for (b)
- P(0) (0.4)(0.4)(0.4)
90Normal Distribution
- Continuous, bell-shaped, symmetric
- Meanmedianmode
- Measurement (dollars, inches, years)
- Cumulative probability under normal curve use Z
table if you know population mean and population
standard deviation - Sample mean use Z table if you know population
standard deviation and either normal population
or n gt 30
91t Distribution
- Continuous, mound-shaped, symmetric
- Applications similar to normal
- More spread out than normal
- Use t if normal population but population
standard deviation not known - Degrees of freedom df n-1 if estimating the
mean of one population - t approaches z as df increases
92Normal or t Distribution?
- Use t table if normal population but population
standard deviation (s) is not known - If you are given the sample standard deviation
(s), use t table, assuming normal population
93Top Ten 7
94P-value
- P-value probability of getting a sample
statistic as extreme (or more extreme) than the
sample statistic you got from your sample, given
that the null hypothesis is true
95P-value Example one tail test
- H0 ? 40
- HA ? gt 40
- Sample mean 43
- P-value P(sample mean gt 43, given H0 true)
- Meaning probability of observing a sample mean
as large as 43 when the population mean is 40 - How to use it Reject H0 if p-value lt a
(significance level)
96Two Cases
- Suppose a .05
- Case 1 suppose p-value .02, then reject H0
(unlikely H0 is true you believe population mean
gt 40) - Case 2 suppose p-value .08, then do not reject
H0 (H0 may be true you have reason to believe
that the population mean may be 40)
97P-value Example two tail test
- H0 ? 70
- HA ? ? 70
- Sample mean 72
- If two-tails, then P-value
- 2 ? P(sample mean gt 72)2(.04).08
- If a .05, p-value gt a, so do not reject H0
98Top Ten 8
- Variation Creates Uncertainty
99No Variation
- Certainty, exact prediction
- Standard deviation 0
- Variance 0
- All data exactly same
- Example all workers in minimum wage job
100High Variation
- Uncertainty, unpredictable
- High standard deviation
- Ex 1 Workers in downtown L.A. have variation
between CEOs and garment workers - Ex 2 New York temperatures in spring range from
below freezing to very hot
101Comparing Standard Deviations
- Temperature Example
- Beach city small standard deviation (single
temperature reading close to mean) - High Desert city High standard deviation (hot
days, cool nights in spring)
102Standard Error of the Mean
- Standard deviation of sample mean
- standard deviation/square root of n
- Ex standard deviation 10, n 4, so standard
error of the mean 10/2 5 - Note that 5lt10, so standard error lt standard
deviation. - As n increases, standard error decreases.
103Sampling Distribution
- Expected value of sample mean population mean,
but an individual sample mean could be smaller or
larger than the population mean - Population mean is a constant parameter, but
sample mean is a random variable - Sampling distribution is distribution of sample
means
104Example
- Mean age of all students in the building is
population mean - Each classroom has a sample mean
- Distribution of sample means from all classrooms
is sampling distribution
105Sampling
- Sampling Distribution concepts assume
probability sample - Probability sample requires calculation of
probability of being in the sample - Probability sample more accurate than judgment or
convenience sample
106Central Limit Theorem (CLT)
- If population standard deviation is known,
sampling distribution of sample means is normal
if n gt 30 - CLT applies even if original population is skewed
107Top Ten 9
108Population
- Collection of all items (all light bulbs made at
factory) - Parameter measure of population
- (1) population mean (average number of hours in
life of all bulbs) - (2) population proportion ( of all bulbs that
are defective)
109Sample
- Part of population (bulbs tested by inspector)
- Statistic measure of sample estimate of
parameter - (1) sample mean (average number of hours in life
of bulbs tested by inspector) - (2) sample proportion ( of bulbs in sample that
are defective)
110Top Ten 10
- Qualitative vs. Quantitative
111Levels of Measurement
- I. QUALITATIVE
- Nominal No order (Ex color)
- Ordinal Order important (Ex good, fair, poor)
- II. QUANTITATIVE
- Interval Order important (Ex temp, shoe size)
- Ratio Order important AND ratio meaningful (Ex
20/hr twice as good as 10/hr)
112Qualitative
- Categorical data
- success vs. failure
- ethnicity
- marital status
- color
- zip code
- 4 star hotel in tour guide
113Qualitative
- If you need an average, do not calculate the
mean - However, you can compute the mode (average
person is married, buys a blue car made in
America)
114Quantitative
- Two cases
- Case 1 discrete
- Case 2 continuous
115Discrete
- (1) integer values (0,1,2,)
- (2) example binomial
- (3) finite number of possible values
- (4) counting
- (5) number of brothers
- (6) number of cars arriving at gas station
116Continuous
- Real numbers, such as decimal values (22.22)
- Examples Z, t
- Infinite number of possible values
- Measurement
- Miles per gallon, distance, duration of time
117Graphical Tools
- Pie chart or bar chart qualitative
- Joint frequency table qualitative (relate
marital status vs. zip code) - Scatter diagram quantitative (distance from CSUN
vs. duration of time to reach CSUN)
118Hypothesis TestingConfidence Intervals
- Quantitative Mean
- Qualitative Proportion