Statistics: Two Issues - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Statistics: Two Issues

Description:

What sort of cases your firm made most money by taking ... How many people will vote for Obama vs McCain? What is the average height of an adult American? ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 19
Provided by: santac9
Category:
Tags: issues | statistics | two

less

Transcript and Presenter's Notes

Title: Statistics: Two Issues


1
Statistics Two Issues
  • How to convince other people of things you want
    them to believe
  • Both how to do it, and
  • How not to be a victim of people doing it
  • How to Lie With Statistics
  • How to actually learn things about the data
  • Different ways of summarizing data
  • And of identifying patterns
  • You might want to know
  • What sort of cases your firm made most money by
    taking
  • How SCU could increase its bar passage rate
  • Whether there was statistical evidence for a mass
    tort that you could try to base a class action
    case on
  • Do you care if the evidence shows that the tort
    is real
  • Or only if the evidence can be used to convince a
    judge and jury?

2
Ways of Fooling or Being Fooled
  • Have the vertical axis start well above zero, to
    magnify changes

3
My Highschool Textbook
The Chief Justice
  • From 1920 to 1930, farm income fell 40.
    Presidents Harding and Coolidge did nothing.

Over the past ten years, judicial salaries,
adjusted for inflation, have fallen more than 20
Pick your endpoints
"In three states with no paper trails, we have
exit poll/final tally disagreement. In three
states with paper trails, we have exit poll/final
tally congruence."
Select your (very nonrandom) sample
4
Normal Distribution
  • A particular family of distributions (bell
    curve)
  • Where once you know the mean and the standard
    deviation
  • you know the distribution
  • Ae(x-ltxgt)2 gives a bell shaped curve
  • Which many real world distributions approximate
  • And which has characteristics that are known and
    useful
  • About 68 within one stdev, 95 within two, 99.7
    within three
  • If you know the mean IQ is 100 and the stdev is
    15, just how special is your IQ 150 kid?
  • Z score table is the continuous version of that
    rule
  • Z score is the number of standard deviations from
    the mean.
  • Table tells you how likely it is that the Z score
    is no higher than that

5
Estimation v Hypothesis testing
  • One use of statistics is to test a hypothesis
  • Does legalizing concealed carry reduce crime?
  • Does capital punishment deter?
  • Does the defendant firm discriminate against
    women?
  • Does mercury in vaccines cause Autism?
  • Does an extra semester of torts increase bar
    passage?
  • The other is to estimate population
    characteristics
  • How many people will vote for Obama vs McCain?
  • What is the average height of an adult American?
  • You want to know not only the average but
  • The standard deviation, or other measures of the
    distribution
  • And you want to know how sure you are that your
    result is correct
  • The margin of error of this poll is
  • Sometimes you estimate population characteristics
    in the process of testing a hypothesis.
  • Estimate the standard deviation for the
    population in order to see
  • Whether your results are unreasonably far from
    the value being tested.
  • Stay tuned

6
Sources of Error
  • One problem with samples is sampling error
  • When you select ten students,
  • by chance they might be taller or shorter than
    average
  • This is what the uncertainty in this figure is
    usually means
  • Another is bias Was this a random sample?
  • If you are measuring age, not height, and select
    students in this class
  • Since it isn't taken by first years
  • Your sample is biased towards older students
  • Famous exampletelephone poll that showed Dewey
    would win
  • A third is validity Are these facts true?
  • If you test age by asking people their age when
    their friends are around
  • In some populations people refer to exaggerate
    their age
  • In others to make it look smaller
  • Similarly for asking about adultery in the
    presence of a spouse
  • Or drug use when the questioner knows the name of
    the respondent
  • Note that bias and invalidity may be either
    accidental or deliberate

7
Rents paid by law students at SCU
  • Take a sample of 100
  • First deduce standard deviation of the population
    from the sample
  • Calculate the mean of the sample ltrentgt
  • For each rent, calculate (rent - ltrentgt)2
  • Add up and divide by 99 (why 99 not 100?)
  • The square root is your estimate of the standard
    deviation of the population ?
  • Which measures how much rents vary from student
    to student
  • Then deduce the standard deviation of the mean
  • Standard deviation of a sample of size n goes as
    ?/square root of n
  • For samples of that size, thats how much their
    means would vary
  • How likely is it that ltrentgt is at least that far
    from 1000?
  • The distribution of means is approximately normal
  • You know its standard deviation ?/10
  • So ltrentgt-1000/(?/10) is z, consult the z table

8
The Calculation
  • Hypothesis being tested average rent 1000
  • Hypothetical numbers (from the book)
  • Sample size 100
  • ltrentgt950 Average of the sample
  • ? 150 Standard deviation of the population
    (estimate)
  • ?/?100 ?/10 15 Standard deviation of the
    mean
  • So Z 50/15 3.33 standard deviations above
  • Two tailed test why?
  • Z table shows .995 below 3.33, .005 below -3.33
  • So .99 between the two values
  • So .01 probability that ltxgt at least that far
    from 1000 by chance

9
What does it mean?
  • If the average rent for all students is 1000
  • There is one chance in 100
  • That a sample of 100 rents would have a mean
  • At least 50 higher or lower
  • Significance at .01--very strong result
  • That does not mean either
  • That the probability the rent is actually 1000
    is .01
  • How high do you think it is?
  • Or that the difference of the rent from 1000 is
    significant in the normal sense--i.e. large
  • Suppose the population were San Jose, n10,000
  • Z3.33 represents a mean how far from 1000?

10
Hypothesis Testing
  • The basic logic of confidence results
  • You have a null hypothesisthis coin is fair
  • You have a samplesay the result of flipping the
    coin ten times. 7 heads.
  • You want to decide whether the null hypothesis is
    true
  • In the background there is an alternative
    hypothesis
  • Which is relevant to how you test the null
    hypothesis
  • For instancethis coin is not fair, but I don't
    know in which direction
  • You ask If the null hypothesis is true, how
    likely is a result at least this far from what it
    predicts in the direction the alternative
    predicts
  • For example, if the coin is fair
  • How likely is it that the result of my experiment
    would be this far from 50/50?
  • Suppose the answer is that if the coin is fair,
    the chance of being this far off 50/50 is less
    than .05 (i.e. 5)
  • You then say that the null hypothesis is rejected
    at the .05 level

11
To Restate
  • Confidence level tells you how strong this piece
    of evidence against the null hypothesis is
  • but not how likely the null hypothesis is to be
    true
  • analogously, it might be that a witness
    identification has only one chance in four of
    being wrong by chance
  • but if you have a solid alibi, you still get
    acquitted
  • "Statistically significant" doesn't mean
    "important" it means "unlikely to occur by
    chance"
  • I take a random coin and flip it 10,000 times
  • the result will prove it isn't a fair coin to a
    very high level of significance
  • Even if it is "unfair" only by .501 vs .499
    probability

12
This is all sampling error
  • Sampling error can be calculated, but..
  • Other forms of error may be more important
  • So "the margin of error is" may be misleading
  • Consider DNA tests
  • "The chance that the defendant's DNA would match
    this closely is less than one in a hundred
    million"
  • May be a true statement about sampling error
  • But there have been far more mistaken results
    than that number suggests
  • Rates of human error are much higher than that
  • As are rates of deliberate fraud
  • Think of sampling error as a lower bound

13
Controlling for Bias
  • Suppose you cant get a random sample
  • Sampling SCU law students, but
  • You are only around in the daytime
  • And easiest to sample those in your classes
  • You can try to correct for the bias
  • Suppose 20 of students are part-time, but
  • Only 5 of your sample are
  • Let each of them count as five in your
    calculations
  • 25/125 20/100
  • So part-time students are now the right fraction
  • Similarly for fraction of women, 1st, 2nd, 3rd
    year
  • What is the risk in this procedure?

14
Bayesian Statistics
  • Consider again my coin flipping experiment
  • Take a coin from my pocket, flip it twice
  • Null hypothesis It's a fair coin
  • Alternative It's double headed
  • Get two heads
  • Chance of evidence that strong for the
    alternative is .25
  • We dont conclude it has that probability of
    being double headed
  • Why?
  • We start with a prior probability
  • very few coins are double headed
  • So the chance of drawing one and then getting
    heads twice
  • Is much lower than the chance of drawing a fair
    coin and getting heads twice
  • So the latter is what probably happened

15
Done formally
  • Suppose one coin in 1000 is double headed
  • Probability of pulling one from my pocket .001
  • If it is double headed, prob of two heads 1
  • So joint probability--that both happen--is .001
  • 999 in 1000 coins are (approximately) fair
  • P of pulling a fair coin from pocket .999
  • If fair, p of two heads .25
  • Joint probability is .25x.999.24975
  • We know one of these two things happened
  • Relative probability is .001/.24975aprox 1/250
  • So odds about 250 to 1 that the coin is fair
  • This is Bayesian statistics as opposed to
    classical statistics

16
Bayesian Statistics
  • Tells you how to
  • Start with a set of prior probabilities (.001,
    .999)
  • Combine with the result of an experiment
  • Deduce posterior probabilities (.004, .996)
  • It doesn't tell you
  • How to find your prior probabilities
  • Those come from knowledge of the situation
  • Modified by past experiments
  • No prior, no posterior

17
How to Lie Part 2
  • Report sampling error as if it was all error
  • Report confidence result with meaning reversed
  • The theory that the firm didn't discriminate
    against women
  • Can be rejected at the .05 level
  • So the odds are twenty to one that it did
  • Report selected result
  • This study found our product clearly worked
  • And we aren't telling you about the other 19
    studies
  • And this happens even without trying
  • Academic version if you don't get results you
    can't publish
  • Popular version the most striking result gets
    the press
  • Both can cause unintentionally misleading
    results, but also
  • Are incentives to deliberately distort results
  • Since getting published and getting press may be
    the objectives

18
You can also just lie
  • Statistics prove that
  • 95 of quoted statistics are invented

Including this one
Write a Comment
User Comments (0)
About PowerShow.com