Title: Statistics: Two Issues
1Statistics Two Issues
- How to convince other people of things you want
them to believe - Both how to do it, and
- How not to be a victim of people doing it
- How to Lie With Statistics
- How to actually learn things about the data
- Different ways of summarizing data
- And of identifying patterns
- You might want to know
- What sort of cases your firm made most money by
taking - How SCU could increase its bar passage rate
- Whether there was statistical evidence for a mass
tort that you could try to base a class action
case on - Do you care if the evidence shows that the tort
is real - Or only if the evidence can be used to convince a
judge and jury?
2Ways of Fooling or Being Fooled
- Have the vertical axis start well above zero, to
magnify changes
3My Highschool Textbook
The Chief Justice
- From 1920 to 1930, farm income fell 40.
Presidents Harding and Coolidge did nothing.
Over the past ten years, judicial salaries,
adjusted for inflation, have fallen more than 20
Pick your endpoints
"In three states with no paper trails, we have
exit poll/final tally disagreement. In three
states with paper trails, we have exit poll/final
tally congruence."
Select your (very nonrandom) sample
4Normal Distribution
- A particular family of distributions (bell
curve) - Where once you know the mean and the standard
deviation - you know the distribution
- Ae(x-ltxgt)2 gives a bell shaped curve
- Which many real world distributions approximate
- And which has characteristics that are known and
useful - About 68 within one stdev, 95 within two, 99.7
within three - If you know the mean IQ is 100 and the stdev is
15, just how special is your IQ 150 kid? - Z score table is the continuous version of that
rule - Z score is the number of standard deviations from
the mean. - Table tells you how likely it is that the Z score
is no higher than that
5Estimation v Hypothesis testing
- One use of statistics is to test a hypothesis
- Does legalizing concealed carry reduce crime?
- Does capital punishment deter?
- Does the defendant firm discriminate against
women? - Does mercury in vaccines cause Autism?
- Does an extra semester of torts increase bar
passage? - The other is to estimate population
characteristics - How many people will vote for Obama vs McCain?
- What is the average height of an adult American?
- You want to know not only the average but
- The standard deviation, or other measures of the
distribution - And you want to know how sure you are that your
result is correct - The margin of error of this poll is
- Sometimes you estimate population characteristics
in the process of testing a hypothesis. - Estimate the standard deviation for the
population in order to see - Whether your results are unreasonably far from
the value being tested. - Stay tuned
6Sources of Error
- One problem with samples is sampling error
- When you select ten students,
- by chance they might be taller or shorter than
average - This is what the uncertainty in this figure is
usually means - Another is bias Was this a random sample?
- If you are measuring age, not height, and select
students in this class - Since it isn't taken by first years
- Your sample is biased towards older students
- Famous exampletelephone poll that showed Dewey
would win - A third is validity Are these facts true?
- If you test age by asking people their age when
their friends are around - In some populations people refer to exaggerate
their age - In others to make it look smaller
- Similarly for asking about adultery in the
presence of a spouse - Or drug use when the questioner knows the name of
the respondent - Note that bias and invalidity may be either
accidental or deliberate
7Rents paid by law students at SCU
- Take a sample of 100
- First deduce standard deviation of the population
from the sample - Calculate the mean of the sample ltrentgt
- For each rent, calculate (rent - ltrentgt)2
- Add up and divide by 99 (why 99 not 100?)
- The square root is your estimate of the standard
deviation of the population ? - Which measures how much rents vary from student
to student - Then deduce the standard deviation of the mean
- Standard deviation of a sample of size n goes as
?/square root of n - For samples of that size, thats how much their
means would vary - How likely is it that ltrentgt is at least that far
from 1000? - The distribution of means is approximately normal
- You know its standard deviation ?/10
- So ltrentgt-1000/(?/10) is z, consult the z table
8The Calculation
- Hypothesis being tested average rent 1000
- Hypothetical numbers (from the book)
- Sample size 100
- ltrentgt950 Average of the sample
- ? 150 Standard deviation of the population
(estimate) - ?/?100 ?/10 15 Standard deviation of the
mean - So Z 50/15 3.33 standard deviations above
- Two tailed test why?
- Z table shows .995 below 3.33, .005 below -3.33
- So .99 between the two values
- So .01 probability that ltxgt at least that far
from 1000 by chance
9What does it mean?
- If the average rent for all students is 1000
- There is one chance in 100
- That a sample of 100 rents would have a mean
- At least 50 higher or lower
- Significance at .01--very strong result
- That does not mean either
- That the probability the rent is actually 1000
is .01 - How high do you think it is?
- Or that the difference of the rent from 1000 is
significant in the normal sense--i.e. large - Suppose the population were San Jose, n10,000
- Z3.33 represents a mean how far from 1000?
10Hypothesis Testing
- The basic logic of confidence results
- You have a null hypothesisthis coin is fair
- You have a samplesay the result of flipping the
coin ten times. 7 heads. - You want to decide whether the null hypothesis is
true - In the background there is an alternative
hypothesis - Which is relevant to how you test the null
hypothesis - For instancethis coin is not fair, but I don't
know in which direction - You ask If the null hypothesis is true, how
likely is a result at least this far from what it
predicts in the direction the alternative
predicts - For example, if the coin is fair
- How likely is it that the result of my experiment
would be this far from 50/50? - Suppose the answer is that if the coin is fair,
the chance of being this far off 50/50 is less
than .05 (i.e. 5) - You then say that the null hypothesis is rejected
at the .05 level
11To Restate
- Confidence level tells you how strong this piece
of evidence against the null hypothesis is - but not how likely the null hypothesis is to be
true - analogously, it might be that a witness
identification has only one chance in four of
being wrong by chance - but if you have a solid alibi, you still get
acquitted - "Statistically significant" doesn't mean
"important" it means "unlikely to occur by
chance" - I take a random coin and flip it 10,000 times
- the result will prove it isn't a fair coin to a
very high level of significance - Even if it is "unfair" only by .501 vs .499
probability
12This is all sampling error
- Sampling error can be calculated, but..
- Other forms of error may be more important
- So "the margin of error is" may be misleading
- Consider DNA tests
- "The chance that the defendant's DNA would match
this closely is less than one in a hundred
million" - May be a true statement about sampling error
- But there have been far more mistaken results
than that number suggests - Rates of human error are much higher than that
- As are rates of deliberate fraud
- Think of sampling error as a lower bound
13Controlling for Bias
- Suppose you cant get a random sample
- Sampling SCU law students, but
- You are only around in the daytime
- And easiest to sample those in your classes
- You can try to correct for the bias
- Suppose 20 of students are part-time, but
- Only 5 of your sample are
- Let each of them count as five in your
calculations - 25/125 20/100
- So part-time students are now the right fraction
- Similarly for fraction of women, 1st, 2nd, 3rd
year - What is the risk in this procedure?
14Bayesian Statistics
- Consider again my coin flipping experiment
- Take a coin from my pocket, flip it twice
- Null hypothesis It's a fair coin
- Alternative It's double headed
- Get two heads
- Chance of evidence that strong for the
alternative is .25 - We dont conclude it has that probability of
being double headed - Why?
- We start with a prior probability
- very few coins are double headed
- So the chance of drawing one and then getting
heads twice - Is much lower than the chance of drawing a fair
coin and getting heads twice - So the latter is what probably happened
15Done formally
- Suppose one coin in 1000 is double headed
- Probability of pulling one from my pocket .001
- If it is double headed, prob of two heads 1
- So joint probability--that both happen--is .001
- 999 in 1000 coins are (approximately) fair
- P of pulling a fair coin from pocket .999
- If fair, p of two heads .25
- Joint probability is .25x.999.24975
- We know one of these two things happened
- Relative probability is .001/.24975aprox 1/250
- So odds about 250 to 1 that the coin is fair
- This is Bayesian statistics as opposed to
classical statistics
16Bayesian Statistics
- Tells you how to
- Start with a set of prior probabilities (.001,
.999) - Combine with the result of an experiment
- Deduce posterior probabilities (.004, .996)
- It doesn't tell you
- How to find your prior probabilities
- Those come from knowledge of the situation
- Modified by past experiments
- No prior, no posterior
17How to Lie Part 2
- Report sampling error as if it was all error
- Report confidence result with meaning reversed
- The theory that the firm didn't discriminate
against women - Can be rejected at the .05 level
- So the odds are twenty to one that it did
- Report selected result
- This study found our product clearly worked
- And we aren't telling you about the other 19
studies - And this happens even without trying
- Academic version if you don't get results you
can't publish - Popular version the most striking result gets
the press - Both can cause unintentionally misleading
results, but also - Are incentives to deliberately distort results
- Since getting published and getting press may be
the objectives
18You can also just lie
- Statistics prove that
- 95 of quoted statistics are invented
Including this one