Chapter 10/ Chapter 11 - PowerPoint PPT Presentation

1 / 108
About This Presentation
Title:

Chapter 10/ Chapter 11

Description:

Chapter 10/ Chapter 11 * ... – PowerPoint PPT presentation

Number of Views:552
Avg rating:3.0/5.0
Slides: 109
Provided by: MBre65
Category:

less

Transcript and Presenter's Notes

Title: Chapter 10/ Chapter 11


1
Chapter 10/ Chapter 11
2
Main Tools
  • Pie charts
  • Line graphs
  • Histograms
  • Stem plots
  • Box plots
  • 5 number summaries

3
Representing the Data
  • Most everyone is familiar with using graphs to
    display information about data
  • Review Handout for more details
  • When evaluating a graph look for
  • Trends, patterns, deviations, variation
  • Read making good graphs in Ch 10
  • To describe distributions look at the shape, look
    for center and spread, patterns

4
Evaluating Distributions (cont)
  • Look for Skewness
  • Right skewed
  • Left skewed
  • Look for symmetry (two halves of the graph are
    mirror images of each other)

5
Examples of Bad Graphs
  • Evaluate the following graphs and identify the
    problems

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Interpretation of the misleading pictogram
Exports ? Exports3 ?
USA 582 -- 197,137,368 --
Germany 502 16 126,506,008 56
Japan 421 19 (38) 74,618,461 70 (164)
11
Stem and Leaf Ex.
  • Make a stem and leaf plot from the following data
  • 100 110 114 115 115 112 116 118 120
  • 33 48 24 43 33 39 22 29 38 20 25 37

12
Recap
  • A good way to display data is using a graph
  • When constructing a graph use good techniques
  • When evaluating a graph/distribution look for
    patterns, outliers, variation, skewness etc.
  • Try to describe distributions by the shape

13
Chapter 12
  • Describing Distributions with Numbers
  • Statistics-Concepts and Controversies, 6th
    edition, David S. Moore

14
Comparing Data
  • Median and Quartiles
  • One way to describe the spread of the data is to
    give the median and the quartiles
  • the median is the midpoint of a distribution,
    half of the
  • numbers are larger and half are smaller
  • To find the median
  • Order the observations from smallest to largest
  • If the number of observations n is odd, then the
    median is the middle value. You can find the
    median by counting (n1)/2 observations from the
    top or bottom of the list.
  • If the number of observations is even then the
    median is the average of the two middle values.

15
Example
  • Find the median of the following data sets
  • 1 5 9 6 3 7 8 5 1
  • 95 68 35 42 96 55 25 35 13 79 28 85

16
  • Quartiles
  • The quartiles are the numbers that divide the
    data into quarters.

17
Comparing Data
  • Calculating the Quartiles
  • Arrange the observations from smallest to largest
    and find the median
  • Q1 the median of all observations below the
    median Q2
  • Q3 the median of all observations above the
    median Q2

18
Examples
  • Find the median and quartiles of these data sets
  • 1 2 4 7 9 4 6 5 5 2 8 9 0 1 7 8 8
    3 2 6
  • 29 84 57 89 46 35 76 83 26 35 15 36
    47 98 63 62 88 87 35 62

19
The 5 Number Summary
  • Minimum
  • Q1
  • Median(Q2)
  • Q3
  • Maximum
  • These numbers give a good identification of
    center and spread of the data
  • What is the 5 number summary for the two data
    sets we just examined?

20
Visuals of the 5 Number Summary
  • Box plot
  • A box plot contains all 5 numbers in the 5 number
    summary
  • To illustrate draw a box plot of past data set

21
Other Measures
  • Mean another way to measure center of
    distribution also called average (median is
    more robust than the mean)
  • sample mean x sum of all observations
  • number of observations
  • Note population mean

22
Other Measures
  • Sample Standard Deviation s
  • average distance of an observation from the
    mean
  • How to find standard deviation
  • Find the distance of each observation from the
    mean and then square it
  • Add all the squared distances together then take
    the average of these distances using (n-1)
    instead of n (this is the variance)
  • Take the square root
  • Note population standard deviation
  • Best illustrated with examples

23
Examples
  • Given this data set find the mean and standard
    deviation
  • 8 9 6 14 8 3

24
More about the Standard Deviation
  • s measures the spread about the mean .
  • We use s to describe the spread of a distribution
    only when we use the mean to describe the
    center.
  • s 0 only when there is no spread.
  • This happens only when all the observations have
    the same value. Otherwise, s gt 0.
  • As the observations become more spread out about
    their mean, s gets larger.

25
Choosing Numerical Descriptions
  • The five-number summary is the best short
    description for most distributions.
  • The mean standard deviation are harder to
    understand but are more common.
  • We must keep in mind that the mean is greatly
    influenced by a few extreme observations the
    median is not.

26
  • Symmetric Distributions The mean and the median
    are about the same.
  • Skewed Distributions The mean runs away from
    the median toward the long tail.
  • ie Skewed Right Median lt Mean
  • Standard Deviation It is pulled up by outliers
    or long tails. The quartiles are much less
    sensitive to a few extreme observations.

27
Chapter 13
  • Normal Distributions
  • Statistics-Concepts and Controversies, 6th
    Edition, David S. Moore

28
Strategies for Exploring Data
  • Plot the data
  • Histogram/(stem and leaf plot)
  • Best to visualize shape
  • Look for the overall pattern for striking
    deviations such as outliers
  • Choose either the five-number summary or the mean
    standard deviation to briefly describe the
    center and spread in the data.
  • Sometimes, the overall pattern of a large number
    of observations is so regular that we can
    describe it by a smooth curve.

29
  • Histograms of Large Data Sets (Density Curves)
  • Remark A relative frequency histogram for an
    infinitely large data set looks like a smooth
    curve.
  • Notes
  • 1.) By Convention, area (not height)
  • measures relative frequency.
  • 2.) Area under the entire curve 1

relative frequency
x
a
b
30
Center and Spread
  • Areas under a density curve represent proportions
    of the total of observations.
  • Median center point such that each half has
    equal areas
  • Quartiles divide area under the curve into
    quarters
  • Mean balance point (pt at which the curve would
    balance if made of solid material)
  • Symmetric Density Curve mean and median are the
    same point.
  • Curve that is Skewed to the Right the mean is
    larger than the median.
  • Recall that the mean is affected by outliers more
    than the median. Hence, the few high
    observations pull the mean towards the tail.
  • Curve that is Skewed to the Left the mean is
    smaller than the median.

31
The Normal Distribution
  • Symmetric, bell-shaped curves with the following
    properties
  • A specific normal curve is completely described
    by giving its mean and standard deviation.
  • The mean determines the center.
  • it is symmetric about the mean
  • The standard deviation determines the shape of
    the curve.
  • It is the distance from the mean to the
    change-of-curvature points on either side.

32
The Empirical Rule
  • A.K.A. The 68 95 99.7 Rule
  • In any normal distribution,
  • approximately 68 of the observations fall within
    one standard deviation of the mean.
  • approximately 95 of the observations fall within
    two standard deviations of the mean.
  • approximately 99.7 of the observations fall
    within three standard deviations of the mean.

33
Visualization of the Empirical Rule
34
Standard Scores Z
  • One way of describing the location of any
    observation in a normal distribution is to
    calculate its standard score. This score
    indicates how many standard deviations an
    observation lies above or below the mean.

35
  • Example 1
  • The mean of a data set is 50 and the standard
    deviation is 12. What is the standard score if I
    have an observation of 25?
  • Example 2

36
Percentiles
  • The cth percentile of a distribution is a value
    such that c percent of the observations lie below
    it and the rest lie above.
  • Think of the quartiles
  • We can use Appendix B to find the percentiles of
    standard scores

37
Example
  • Using table B and the previous Example 2, what is
    the percentile for the standard score for our
    observation of 37?
  • What is the percentile of 24?

38
Examples
  • Suppose that on a statistics test the mean grade
    is 75 and the standard deviation is 8. Assume
    that the scores on the test vary according to a
    distribution that is approximately normal.
  • Sixty-eight percent of the data fall into what
    range?
  • Almost all (99.7) of the scores fall in what
    range?
  • How high did the top 2.5 of the class score?

39
Example
  • A set of SAT scores has a mean of 890 and a
    standard deviation of 120. Assume the data are
    bell-shaped.
  • Lanes SAT score is 1130. Calculate her standard
    score.
  • Approximately what proportion of students
    received a score higher than Lanes?
  • Mikes SAT score is 770. Calculate his standard
    score.
  • Approximately what proportion of students
    received a score lower than Mikes?
  • Kevins SAT score is 1010. Calculate his
    standard score.
  • Approximately what proportion of students
    received a score lower than Kevins?
  • Approximately what proportion of students
    received a score between 650 and 890?

40
Example
  • The length of human pregnancies from conception
    to birth varies according to a distribution that
    is approximately normal with mean 266 days and
    standard deviation 16 days. Use this information
    to answer the questions below.
  • Between what values do the lengths of the middle
    99.7 of all pregnancies fall?
  • About what percent of the pregnancies are less
    than 234 days?
  • How long are the longest 16 of all human
    pregnancies?
  • What percent of these pregnancies last less than
    258 days?
  • What percent of these pregnancies last more than
    290 days?
  • What percent of these pregnancies last between
    258 and 290 days?
  • How long is a pregnancy which falls into the
    13.57 percentile?

41
Example
  • The SAT math test among Kentucky high school
    seniors in a recent year were normally
    distributed with mean 440 and standard deviation
    60. Use this information to answer the following
    questions
  • Into what percentile would a student with a score
    of 398 have fallen?
  • What score would a student have achieved if she
    fell into the top 3.6?

42
Example
  • Suppose that the average height for adult males
    is normally distributed with a mean of 70 inches
    and a standard deviation of 2.5 inches.
  • What percentile does a man who is 68 inches fall
    into?
  • What percentile does a man who is 73 inches fall
    into?
  • What proportion of men are shorter than 74
    inches?
  • What percent of men are taller than 72 inches?
  • How tall is a man in the 9.68 percentile?
  • How tall is a man who has 8 of all men taller
    than him?
  • Determine the percentage of men falling between
    69.25 inches and 73.5 inches.

43
Example
  • In the summer, a grocery store brings in a large
    supply of watermelons. The mean weight in pounds
    is 22. The variance is 16.
  • What percent of watermelons weigh between 18 and
    20 pounds?
  • What percent of watermelons weigh less than 18
    pounds?
  • What percent of watermelons weigh more than 17
    pounds?
  • What percent of watermelons weigh more than 30
    pounds?

44
Chapter 14
  • Describing Relationships
  • Scatter plots and Correlation
  • Statistics-Concepts and Controversies, 6th
    Edition, David S. Moore

45
Scatterplot
  • Show the relationship between two quantitative
    variables measured on the same individual. Each
    individual appears as a point in the plot, fixed
    by the values of both variables for that
    individual.
  • When applicable, put the explanatory variable on
    the horizontal axis and the response variable on
    the vertical axis.

46
Interpretation
  • Look for an overall pattern and for striking
    deviations from that pattern.
  • Describe the pattern by form, direction,
    strength of the relationship.
  • Form Look for clusters and for the shape
    (i.e. curved/linear/nothing/other)
  • Direction Is it positively associated, negative
    associated, or neither?
  • Strength Determined by how closely the points
    follow a clear form.
  • Locate outliers

47
(No Transcript)
48
Linear Relationships
  • Why?
  • Correlation Describes the direction strength
    of a straight-line relationship between 2
    quantitative variables. The notation is r.

49
Understanding Correlation
  • Some important facts about Correlation
  • Positive r indicates positive association.
  • Negative r indicates negative association.
  • Always between -1 and 1.
  • The closer r is to1 or -1, the stronger the
    relationship.
  • Does not change if we change units
  • Ignores distinction between explanatory
    response variable.
  • Measures the strength of ONLY straight-line
    associations between 2 variables.
  • Strongly affected by a few outlying observations.

50
Examples
  • Would you expect these correlations to be
    positive, negative or nothing (i.e. r 0)
  • The heights and weights of adult men
  • The age of secondhand cars and their prices
  • The weight of new cars and their gas mileages in
    miles per gallon
  • The heights and the IQ scores of adult men
  • The heights of husbands and the heights of their
    wives
  • The number of work-hours in safety training and
    the number of work-hours lost due to accidents

51
Chapter 15
  • Describing Relationships Regression, Prediction
    and Causation
  • Statistics-Concepts and Controversies, 6th
    Edition, David S. Moore

52
Regression
  • When we have a scatterplot with a linear
    relationship, we are often interested in
    summarizing the overall pattern. We can do this
    by drawing a line on the graph. This type of
    line is called a regression line. A regression
    line is a straight line that describes how a
    response variable y changes as an explanatory
    variable x changes. We are often interested in
    using this line to predict a value of y for a
    given value of x.

53
Regression Lines
  • In order to draw a regression line, we must have
    a regression equation.
  • Using the data we are given we come up with the
    best regression equation which will result in the
    best regression line. The best regression line
    is the one that comes the closest to the data
    points in the vertical direction. There are many
    ways to make this distance as small as
    possible.
  • least-squares method the most common method.
    The least-squares regression line of y on x is
    the line that makes the sum of the squares of the
    vertical distances of the data points from the
    line as small as possible.
  • Note We will not actually perform the
    least-squares method. Instead, we are interested
    in being able to use the resulting line.

54
Background of a Line
  • Recall from past math courses that the equation
    of a line has the form y a bx.
  • We write the regression equation in the form y
    a bx.
  • y represents the (response) variable on the
    y-axis, or the vertical axis.
  • x represents the (explanatory) variable on the
    x-axis, or the horizontal axis.
  • b represents the slope. The slope tells us how
    much increase there is in the y for every one
    unit increase in the x. In other words, b 3
    would mean that if the x variable increases one
    unit, then the y variable increases 3 units.
  • a represents the y-intercept, or the point where
    the line crosses the vertical axis.
  • We can use this equation to predict the value of
    y for any given x.

55
Example
  • The regression line between the age of a wife and
    the age of a husband is given by y 3.6 .97x
    where x is the wifes age in years and y is the
    husbands age in years.
  • If a wife is 30 years old, then we would estimate
    that her husband is approximately 32.7 years old.

56
Correlation and Regression
  • Recall that correlation measures the strength and
    direction of a linear relationship. We now know
    that regression is what is used to draw the line
    representing this relationship.
  • Correlation and regression are closely connected.
  • Both correlation and regression are strongly
    affected by outliers.
  • The usefulness of the regression line depends on
    the correlation between the two variables.
  • We use the square of the correlation, called R -
    squared. It is the fraction of the variation in
    the values of y that is explained by the
    least-squares regression of y on x.
  • The idea is that when there is a straight-line
    relationship, some of the variation in y is
    accounted for by the fact that as x changes, it
    pulls y along with it.
  • Ex. If r .6, then .36, meaning that
    roughly 36 of the variation is accounted for by
    the straight-line relationship.

57
Prediction
  • Prediction is based on fitting some model to a
    set of data.
  • All of the models we will be looking at involve
    only a linear relationship between one
    explanatory and one response variable. Other
    prediction methods use more elaborate models.
  • Prediction works best when the model fits the
    data closely.
  • The closer the data actually follows a linear
    pattern, the better the prediction will be.
  • Prediction outside the range of the available
    data is risky.
  • It is not a good idea to use a regression
    equation to predict values far outside the range
    where the original data fell. In other words,
    the data used to calculate the regression
    equation for the relationship between a husbands
    and a wifes age given above was comprised of men
    and women ranging from about 20 to about 65.
    Therefore, it would not be a good idea to use
    this equation to estimate the age of a womans
    husband if she was 75. We should only use the
    equation for a minor extrapolation beyond the
    range of the original data.
  • ie you wouldnt use a young childs growth to
    predict how tall they will be at age 40

58
Causation
  • Watch Out! A strong relationship between two
    variables does not always mean that changes in
    one variable cause changes in the other.
  • The relationship between two variables is often
    influenced by other variables lurking in the
    background.

59
Funny Examples
  • A strong correlation has been found in a certain
    city in the northeastern United States between
    weekly sales of hot chocolate and weekly sales of
    facial tissues.
  • Can we conclude causation ?
  • There is a strong correlation between the number
    of women in the work force versus the number of
    Christmas trees sold in the United States for
    each year between 1930 and the present.
  • Can we conclude causation ?

60
Why Two Variables Could Be Related
  • The explanatory is the direct cause of the
    response variable.
  • Example Variable A is the pollen count from
    grasses and variable B is the percentage of
    people suffering from allergy symptoms, measured
    over a year. A is the direct cause of B.
  • The response variable is causing a change in the
    explanatory variable.
  • Example In a study in Resource Manual, it was
    noted that divorced men were twice as likely to
    abuse alcohol as married men. The authors
    concluded that getting divorced caused alcohol
    abuse. But, it is just as reasonable to assume
    that alcohol abuse causes divorce.

61
Why Two Variables Could Be Related
  • The explanatory variable is a contributing, but
    not the sole, cause of the response variable.
  • Example Consider the relationship between hours
    studied per day and grade point average.
    Studying increases grade point average, but it is
    also reasonable that a desire to do well in
    school means that a person studies more and that
    their grade point average is high.
  • Confounding variables may exist.
  • Example Meditation was found to be related to
    lower levels of an aging factor. It may be that
    meditation does indeed slow aging, but the
    influence of other factors cannot be separated in
    terms of their effect on aging, like a general
    concern for ones well-being.

62
Why Two Variables Could Be Related
  • Both variables may result from a common cause.
  • Look at the example under reason two. Divorce
    and alcohol abuse are related. It may be that
    both result from an unhappy relationship, for
    whatever reason.
  • Both variables are changing over time.
  • Example The number of divorces and the number
    of suicides have both increased dramatically
    since 1900. This does not mean that divorces are
    causing suicides. All such statistics increase
    as the population increases.

63
Why Two Variables Could Be Related
  • The association may be nothing more than
    coincidence.
  • Example Paulos (1994) relates this story.
    Consider the near panic that ensued last year
    when a guest on a national talk show blamed his
    wifes recent death from brain cancer on her use
    of a cellular telephone. The man alleged that
    there was a causal connection between his wifes
    frequent use of their cellular phone and her
    subsequent brain cancer. It is then noted that
    if brain cancer rates among cellular phone users
    were equal to the rate for the general population
    there would be about 700 cases a year, yet only a
    few have come to light.

64
Causation
  • The best evidence for causation comes from
    randomized comparative experiments.
  • The only legitimate way to try to establish a
    causal connection statistically is through the
    use of designed experiments.
  • Evidence of a possible causal connection.
  • 1.There is a reasonable explanation of cause and
    effect.
  • 2.The connection happens under varying
    conditions.
  • 3.Potential confounding variables are ruled out.
  • Other things to keep in mind Data from an
    observational study in the absence of any other
    evidence cannot be used to establish causation.

65
Example
  • For a certain type of automobile, yearly repair
    costs in dollars (Y) are approximately linearly
    related to the age in years (X) of the car. A
    sample of cars which were 1 to 10 years old
    yielded a regression line of
  • y 69.7548 9.5221x.
  • Estimate the repair costs of a 6-year-old car and
    a 3-year-old car.
  • What is the slope of the regression line?
  • Interpret what the slope means for this
    regression line.
  • Is the correlation between repair cost and age
    positive or negative? Support your answer.

66
Chapter 21
  • What is a Confidence Interval?
  • Statistics-Concepts and Controversies, 6th
    Edition, David C. Moore

67
What is a Confidence Interval?
  • Statistical Inference Draws conclusions about a
    population on the basis of data from a sample.
  • Recall parameters tell us something about
    populations whereas, statistics address samples
  • We will use a sample statistic to estimate a
    population parameter.

68
Estimating Sample Statistics
  • p-hat will be the statistic that we use to
    estimate the true population proportion p, the
    parameter we wish to estimate.

69
Examples
  • Suppose I survey 200 UK students and ask if they
    have studied in the library in the past week.
    Suppose 150 answer yes.
  • What is the sample proportion?
  • What is the sample?
  • What is the population?
  • I take a survey of 500 Gainesville, FL residents,
    who are over 18 and registered to vote, and ask
    if they are planning to vote in the next
    election. Suppose 450 answer yes.
  • What is the sample proportion?
  • What is the sample?
  • What is the population?

70
Confidence Intervals
  • 95 Confidence Interval an interval calculated
    from sample data that is designed to contain the
    true population parameter in 95 of all samples.
    (Notice the confidence is in the method!)
  • Using repeated sampling!

71
Estimating Confidence
  • We will use p-hat from an SRS to estimate the
    true population proportion p.
  • What we want to ask ourselves is What would
    happen if we took many samples?
  • Note p-hat varies from sample to sample.
    Sampling variability, however, has a clear
    pattern in the long run, a pattern that is well
    described by a normal curve centered around p.
  • Sampling Distribution Distribution of values
    taken by the statistic in all possible samples of
    the same size from the same population.
  • We have briefly discussed this before

72
Estimating Confidence cont. Paraphrase of CLT
(Important)
  • Take an SRS of size n from a large population
    that contains a proportion parameter p of
    successes. Let be the sample proportion of
    successes. If the sample is large enough, then
  • The sampling distribution of is
    approximately normal.
  • The mean of the sampling distribution is p.
  • The standard deviation of the sampling
    distribution is

73
Example
  • We ask 500 adults if they jog. Of the people we
    asked 120 responded yes. Suppose we know that
    15 of all adults jog.
  • What is p-hat?
  • What is the sampling distribution of ?
  • Find the ranges for the middle 95 of all
    samples.

74
Relationship to Confidence Interval

  • 95 of all samples give an outcome of such
    that the population truth p is captured by the
    interval from

- 2(s.d.) to
2(s.d.).
This is written as In practice, however, this
is not very helpful. In order to find the exact
standard deviation, we must know p, but if we
knew p, we would not be doing the sampling in the
first place. Since in practice we do not know p,
we will use as our estimate for p.
75
Creating A Confidence Interval
  • 95 Confidence Interval For a Proportion
  • Choose an SRS of size n from a large population
    that contains an unknown proportion p of
    successes. Call the proportion of successes in
    this sample
  • An approximate 95 confidence interval for the
    parameter p is

76
Creating an Interval Cont.
  • We can use other levels of Confidence.
  • A level C confidence interval has two parts.
  • The Interval calculated from the data.
  • The Confidence Level, C, which gives the
    probability that the interval will capture the
    true parameter value in repeated samples.
  • (So, a 95 confidence interval means that
    95 of the time, the method produces an interval
    that does capture the true parameter.)
  • Z is called the critical value of the normal
    distribution.
  • Table 21.1 (pg. 435) gives you Z for a
    particular level of confidence

77
General Formula for a Confidence Interval
  • .

78
Examples
  • Suppose we flip a poker chip 100 times. One side
    of the poker chip is black and the other side is
    red.
  • Suppose we get 48 reds. Construct a 95
    confidence interval for the overall proportion of
    reds.
  • Suppose we flip another poker chip and get only
    35 reds. What is the 95 confidence interval
    now?
  • What conclusions can you make using the above
    confidence intervals?
  • Suppose we flip a fair coin 100 times. Describe
    the sampling distribution of the proportion of
    heads. Apply the Empirical rule to this
    distribution.

79
More Examples!
  • Suppose we took two different SRS of 500 adults
    and asked them if they jogged. In the first
    sample, 70 people said yes. In the other sample,
    100 people said yes. Find a 95 confidence
    interval for the true population proportion p
    using each of the sample statistics.

80
Example
  • UKs student government decides to examine the
    proportion of students who eat at on-campus
    restaurants. Of the 250 students surveyed, 175
    eat on campus.
  • Say in words what the population proportion p is
    for this situation.
  • Find a 95 confidence interval for the proportion
    of all students who eat on campus.
  • Interpret the resulting interval in words that a
    statistically naïve reader would understand.
  • Given your interval in part (c), should you
    conclude that the majority of all students eat on
    campus? Why or why not?

81
Example
  • Construct a 90 and a 99 confidence interval for
    the proportion of all students who eat on campus
    using the sample result in the previous example.
    How do these intervals compare to the 95
    confidence interval?

82
Chapter 22
  • What is a Test of Significance?
  • Statistics-Concepts and Controversies, 6th
    Edition, David C. Moore

83
Statistical Tests
  • If a friend tells you I can run a 5K race in
    under 20 minutes but youre friends time at
    the 6 5K races you both have run in is over 27
    minutes. What would you be inclined to believe?
    Why?

84
Tests of Significance
  • Determine the hypotheses
  • Null Hypothesis an assumption concerning the
    value of the population parameter being studied
    (usually represents no effect, no change, no
    difference, etc.)
  • Notation H0
  • Note The null always contains the equality
  • Alternative Hypothesis a statement that
    specifies an alternative set of possible values
    for the population parameter that is not included
    in the null hypothesis (states the result for
    which we hope to find evidence)
  • Notation HA (or H1)
  • Note The null and alternative always contradict
    eachother
  • Note The null hypothesis may or may not be true.
    We will carry out a study and then determine if
    we have strong enough evidence to conclude that
    the null hypothesis is false (meaning our
    evidence suggests that HA is true).

85
Examples
  • A psychology text states that 10 of the
    population is left-handed. You do not know
    whether the proportion of left-handers is more or
    less than .1. What null and alternative
    hypotheses should you test?
  • Note the differences in hypotheses between
  • Not equal to HA p ? .10
  • Greater than HA p gt .10
  • Less than HA p lt .10

86
Tests of Significance
  • Obtain a simple random sample of n observations
    from the desired population and calculate the
    observed sample statistic.
  • For example, if we want to test something about a
    population proportion (p), then we would
    calculate the sample proportion .
  • If the null hypothesis is true, our sample
    proportion can be approximately described by a
    normal curve with
  • We use this mean and standard deviation to
    construct the test statistic (z) for our observed
    sample statistic

87
Evaluating a Test
  • Determine the strength of your evidence.
  • The evidence is strong if the outcome we observe
    would rarely occur assuming the null hypothesis
    is true (meaning it is more probable that the
    alternative hypothesis is true).
  • The evidence is weak if the outcome we observe
    has a high probability of occurring assuming the
    null hypothesis is true.
  • We measure the strength of the evidence by
    calculating a P-value.
  • p-value the probability of obtaining a sample
    outcome at least as extreme as the actual
    observed outcome, assuming the null is true.
    (Know this defintion!)
  • The smaller the p-value, the stronger the
    evidence is against H0.
  • (You may also think of the p-value as describing
    the risk of making a mistake if we wrongly reject
    the null.)

88
Evaluating a Test (cont)
  • Draw a conclusion.
  • If the p-value is small, then we reject H0 in
    favor of HA.
  • If the p-value is large, then we fail to reject
    H0, meaning we cannot conclude H0 is false
  • You may NEVER conclude that the null is true.
    Unfortunately, you CANNOT be certain that you
    have made the correct conclusion. i.e. we would
    not state we accept the null

89
More about conclusions
  • we decide in advance how small the p-value must
    be in order to conclude that we have strong
    evidence against H0.
  • The value we choose is called the significance
    level (written as a ).
  • If the p-value is as small as or smaller than a,
    then we say that the data is statistically
    significant at level a (meaning that the observed
    outcome would rarely occur by chance).
  • a .05 is the most common. When assuming H0 is
    true, this means that the data must give strong
    evidence that this result would occur by chance
    no more than 5 of the time.
  • a .01 requires stronger evidence against H0.
  • Statistically significant does NOT necessarily
    mean practically important. It only means not
    likely to happen by chance alone.
  • Giving the p-value is ALWAYS more informative
    than just stating if the results are
    statistically significant or not.

90
The p-value approach
  • Advantages to this Approach
  • When the P-value is reported, the decision of
    whether or not to reject the null hypothesis is
    left up to the reader.
  • For example, suppose a p-value of .03 is
    reported. If you, the reader, think that a 5
    level of significance (a .05) is sufficient,
    then you would choose to reject the null
    hypothesis in favor of the alternative
    hypothesis. If, however, a second reader thinks
    that a 5 level of significance is insufficient
    and would rather use a .01, then he or she
    would fail to reject the null hypothesis.

91
Publishing Our Results
  • p-values are very often reported when describing
    the results of studies in many fields.
    Therefore, it is very important to understand
    what they are telling you.
  • Example The financial aid office of a university
    asks a sample of students about their employment
    and earnings. The report says, For academic
    year earnings, a significant difference ( p-value
    .038) was found between the sexes, with men
    earning more on the average.
  • Interpretation If there really is no difference
    in academic year earnings between the sexes, then
    we would have seen a difference this big or
    bigger in only 3.8 of all samples. (i.e. There
    is only a 3.8 chance that these results occurred
    by chance alone.)

92
Examples
  • An economist states that 10 of a citys labor
    force is unemployed. Suppose you think that this
    estimate is too low. What null and alternative
    hypotheses should you test?
  • A state legislature says that it is going to
    decrease its funding of the state university
    because, according to its sources, 36 of the
    graduates move out of the state within 3 years of
    graduation. As a faculty member at the
    university, you want to show that the percent of
    graduates who move out of state is less than 36.
    What null and alternative hypotheses should you
    test?

93
Examples
  • Suppose you think that the proportion of people
    who wear contact lenses that experience no
    difficulty is less than 80. You wish to conduct
    a hypothesis test to determine if you are
    correct.
  • What null and alternative hypotheses should you
    use?
  • Suppose you carry out the above hypothesis test
    on a sample of 200 students. You obtain a sample
    proportion of .745 and get a P-value of .0262.
    Carefully explain what this P-value means in this
    particular situation.

94
Examples
  • The is testing a new method to teach soldiers to
    shoot a rifle. A larger proportion of soldiers
    pass the marksmanship test after 3 days of
    training using the new method. (74 vs. 70,
    P-value .0228)
  • Set up the appropriate hypothesis.
  • What does this p-value lead you to conclude?
  • A farmer planted wheat in 100 plots of land. On
    50 plots, he used fertilizer by company 1 and on
    the other 50 plots, he used fertilizer by company
    2. The average yield on the plots is different.
    (P-value .0004, fertilizer by co. 1 yielded 14
    higher)
  • What does this p-value lead you to conclude?

95
Examples
  • Verify that the P-value in the problem concerning
    people who wear contact lenses is correct.
  • Is the result in part (b) of the problem
    statistically significant at the 5 level? At
    the 1 level?
  • Going back to the state legislature problem,
    suppose you obtain a random sample of 160
    graduates and find that 40 moved out of state
    within 3 years of graduation. Calculate the
    corresponding P-value.

96
Example
  • In a random sample of 200 walnut panels, 32 had
    major flaws. Is this sufficient evidence for
    concluding that the proportion of walnut panels
    that contain major flaws is greater than .1?
  • Use a .05.

97
Example
  • According to Myers-Briggs estimates, about 82 of
    college student government leaders are
    extroverts. (Source Myers-Briggs Type
    Indicator Atlas of Type Tables.) Suppose that a
    Myers-Briggs personality preference test was
    given to a random sample of 73 student government
    leaders attending a large national leadership
    conference and that 56 were found to be
    extroverts. Does this indicate that the
    population proportion of extroverts among college
    student government leaders is not 82? Use a 5
    level of significance.

98
Example
  • The publisher of a magazine is told that the
    percentage of subscribers to the magazine that
    are younger than 36 years is 60. The publisher
    thinks that the percentage is different.
  • What null and alternative hypotheses should you
    use?
  • Suppose you carry out the above hypothesis test
    on a sample of 150 subscribers. You obtain a
    sample proportion of 0.51.
  • Calculate the correct test statistic and p-value
    for this problem.
  • What should the publisher conclude?

99
Chapter 23
  • Use and Abuse of Statistical Inference
  • Statistics-Concepts and Controversies, 6th
    Edition, David C. Moore

100
Using Inference Wisely
  • The design of the data production matters.
  • For our confidence interval and test for a
    proportion p, we must have a simple random sample
    (SRS).
  • If we have poor data collection methods our
    results may be invalid.

101
Know how confidence intervals behave
  • The confidence level says how often the method
    catches the true parameter under repeated
    sampling.
  • We are confident in the method! i.e. The method
    works 95 of the time for a 95 confidence
    interval.
  • We never know if our particular interval actually
    contains p.
  • The higher the confidence level, the wider the
    interval.
  • Larger samples give shorter intervals. Notice
    the formula

102
The advantages of confidence intervals
  • Confidence intervals can be more informative than
    tests because they actually estimate a population
    parameter.
  • They are also easier to interpret.
  • It is good practice to give confidence intervals
    whenever possible.

103
Hypothesis Tests
  • Know what statistical significance says.
  • A significance test answers only one question
    How strong is the evidence that the null
    hypothesis is not true?
  • The p-value measures how unlikely our data would
    be if the null is true
  • We never know whether the hypothesis is true for
    the specific population.

104
Hypothesis Test cont.
  • Know what your methods require.
  • Our test and confidence interval for a proportion
    p require that the population be much larger than
    the sample.
  • They also require that the sample itself be
    reasonably large.
  • Keep in mind the effects of sample size on
    hypothesis tests.

105
When does testing for significance make sense?
  • Significance tests work when we form a hypothesis
    and then wait for the data.
  • Look back and take the best isnt a suitable
    foundation for a significance test.

106
The woes of significance tests.
  • Larger samples make tests of significance more
    sensitive whereas tests of significance based on
    small samples are not sensitive.
  • When reporting a p-value, you should also include
    the sample size and the statistic that describes
    the sample outcome. Reason The p-value depends
    strongly on the size of the sample and the truth
    about the population.

107
Significance at the 5 level isnt magical
  • There is no sharp border between significant
    and insignificant, only increasingly strong
    evidence as the p-value decreases.
  • This means that there is no practical distinction
    between a p-value of 0.051 and a p-value of 0.049.

108
Beware of Searching for Significance
  • The main goal is to find a significant effect
    that you were looking for.
  • This does not mean go out and run a hundred tests
    and then report which ones you found that were
    significant.
  • Remember at a 5 significance level out of a 100
    tests on average 5 of them would be found to be
    significant on chance alone.
Write a Comment
User Comments (0)
About PowerShow.com